Debugging Guide

View as MarkdownOpen in Claude

Overview

This guide covers advanced debugging techniques for troubleshooting complex issues with Smallest Self-Host.

Debugging Tools

Docker Debugging

Enter Running Container

$docker exec -it <container-name> /bin/bash

Inside the container:

$ls -la
$ps aux
$df -h
$nvidia-smi
$env

Debug Failed Container

View logs of crashed container:

$docker logs <container-name>
$docker logs <container-name> --tail=100 --follow

Inspect container configuration:

$docker inspect <container-name>

Network Debugging

Check container networking:

$docker network ls
$docker network inspect <network-name>
$docker exec <container> ping license-proxy
$docker exec <container> curl http://license-proxy:3369/health

Kubernetes Debugging

Debug Pod

Interactive debug container:

$kubectl debug <pod-name> -it --image=ubuntu --target=<container-name>

Copy debug tools into pod:

$kubectl cp ./debug-script.sh <pod-name>:/tmp/debug.sh
$kubectl exec -it <pod-name> -- bash /tmp/debug.sh

Ephemeral Debug Container

Add temporary container to running pod:

$kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=lightning-asr

Inside debug container:

$nslookup license-proxy
$curl http://api-server:7100/health
$tcpdump -i eth0

Get Previous Logs

If pod crashed and restarted:

$kubectl logs <pod-name> --previous
$kubectl logs <pod-name> -c <container-name> --previous

Network Debugging

Test Service Connectivity

From inside cluster:

$kubectl run netdebug --rm -it --restart=Never \
> --image=nicolaka/netshoot \
> --namespace=smallest \
> -- bash

Inside debug pod:

$nslookup api-server
$nslookup license-proxy
$nslookup lightning-asr
$
$curl http://api-server:7100/health
$curl http://license-proxy:3369/health
$
$traceroute api-server
$ping -c 3 lightning-asr

DNS Resolution

Check DNS is working:

$kubectl run dnstest --rm -it --restart=Never \
> --image=busybox \
> -- nslookup kubernetes.default

Check CoreDNS logs:

$kubectl logs -n kube-system -l k8s-app=kube-dns

Network Policies

List network policies:

$kubectl get networkpolicy -n smallest
$kubectl describe networkpolicy <policy-name> -n smallest

Temporarily disable for testing:

$kubectl delete networkpolicy <policy-name> -n smallest

Remember to recreate network policies after testing!

Performance Debugging

Resource Usage

Check pod resource consumption:

$kubectl top pods -n smallest
$kubectl top pods -n smallest --sort-by=memory
$kubectl top pods -n smallest --sort-by=cpu

Check node resource usage:

$kubectl top nodes
$kubectl describe node <node-name> | grep -A 10 "Allocated resources"

GPU Debugging

Check GPU availability in pod:

$kubectl exec -it <lightning-asr-pod> -- nvidia-smi
$
$kubectl exec -it <lightning-asr-pod> -- nvidia-smi dmon

Watch GPU utilization:

$kubectl exec -it <lightning-asr-pod> -- watch -n 1 nvidia-smi

Check GPU events:

$kubectl exec -it <lightning-asr-pod> -- nvidia-smi -q -d MEMORY,UTILIZATION,POWER,CLOCK,PERFORMANCE

Application Profiling

Profile Lightning ASR:

$kubectl exec -it <pod> -- sh -c 'apt-get update && apt-get install -y python3-pip && pip3 install py-spy'
$
$kubectl exec -it <pod> -- py-spy top --pid 1

Memory profiling:

$kubectl exec -it <pod> -- sh -c 'cat /proc/1/status | grep -i mem'

Log Analysis

Structured Log Parsing

Extract errors from logs:

$kubectl logs <pod> | grep -i "error\|exception\|failed"

Count errors:

$kubectl logs <pod> | grep -i "error" | wc -l

Show errors with context:

$kubectl logs <pod> | grep -B 5 -A 5 "error"

Log Aggregation

Combine logs from all replicas:

$kubectl logs -l app=lightning-asr -n smallest --tail=100 --all-containers=true

Follow logs from multiple pods:

$kubectl logs -l app=lightning-asr -f --max-log-requests=10

Parse JSON Logs

Using jq:

$kubectl logs <pod> | jq 'select(.level=="error")'
$kubectl logs <pod> | jq 'select(.duration > 1000)'
$kubectl logs <pod> | jq '.message' -r

Database Debugging

Redis Debugging

Connect to Redis:

$kubectl exec -it <redis-pod> -- redis-cli

Inside Redis CLI:

1AUTH your-password
2INFO
3DBSIZE
4KEYS *
5GET some_key
6MONITOR

Check Redis memory:

1INFO memory

Check slow queries:

1SLOWLOG GET 10

API Debugging

Test API Endpoints

Health check:

$kubectl port-forward svc/api-server 7100:7100
$curl http://localhost:7100/health

Test transcription:

$curl -X POST http://localhost:7100/v1/listen \
> -H "Authorization: Token ${LICENSE_KEY}" \
> -H "Content-Type: application/json" \
> -d '{"url": "https://www2.cs.uic.edu/~i101/SoundFiles/StarWars60.wav"}' \
> -v

Request Tracing

Add request ID tracking:

$curl -X POST http://localhost:7100/v1/listen \
> -H "Authorization: Token ${LICENSE_KEY}" \
> -H "X-Request-ID: debug-123" \
> -H "Content-Type: application/json" \
> -d '{"url": "..."}' \
> -v

Grep logs for request:

$kubectl logs -l app=api-server | grep "debug-123"
$kubectl logs -l app=lightning-asr | grep "debug-123"

Packet Capture

Capture network traffic:

$kubectl exec -it <pod> -- apt-get update && apt-get install -y tcpdump
$
$kubectl exec -it <pod> -- tcpdump -i any -w /tmp/capture.pcap port 7100
$
$kubectl cp <pod>:/tmp/capture.pcap ./capture.pcap

Analyze with Wireshark or:

$tcpdump -r capture.pcap -A

Event Debugging

Watch Events

Real-time events:

$kubectl get events -n smallest --watch

Filter by type:

$kubectl get events -n smallest --field-selector type=Warning

Sort by timestamp:

$kubectl get events -n smallest --sort-by='.lastTimestamp'

Event Analysis

Count events by reason:

$kubectl get events -n smallest -o json | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})'

Metrics Debugging

Check Prometheus Metrics

Port forward Prometheus:

$kubectl port-forward -n default svc/smallest-prometheus-stack-prometheus 9090:9090

Query metrics:

Open http://localhost:9090 and run:

1asr_active_requests
2rate(asr_total_requests[5m])
3asr_gpu_utilization

Check Custom Metrics

Verify metrics available to HPA:

$kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

Query specific metric:

$kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/smallest/pods/*/asr_active_requests" | jq .

Debugging Checklists

Startup Issues Checklist

1

Check Image Pull

$kubectl describe pod <pod> | grep -A 10 "Events"
2

Verify Secrets

$kubectl get secrets -n smallest
$kubectl describe secret <secret-name>
3

Check Resources

$kubectl describe node <node> | grep "Allocated resources" -A 10
4

Review Logs

$kubectl logs <pod> --all-containers=true

Performance Issues Checklist

1

Check Resource Usage

$kubectl top pods -n smallest
$kubectl top nodes
2

Verify GPU

$kubectl exec <pod> -- nvidia-smi
3

Check HPA

$kubectl get hpa
$kubectl describe hpa lightning-asr
4

Review Metrics

$kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"

Advanced Techniques

Enable Debug Logging

Increase log verbosity:

1lightningAsr:
2 env:
3 - name: LOG_LEVEL
4 value: "DEBUG"

Simulate Failures

Test error handling:

$kubectl delete pod <pod-name>
$kubectl drain <node-name> --ignore-daemonsets

Load Testing

Generate load:

$kubectl run load-test --rm -it --image=williamyeh/hey \
> -- -z 5m -c 50 http://api-server:7100/health

Chaos Engineering

Test resilience (requires Chaos Mesh):

1apiVersion: chaos-mesh.org/v1alpha1
2kind: PodChaos
3metadata:
4 name: pod-failure
5spec:
6 action: pod-failure
7 mode: one
8 selector:
9 namespaces:
10 - smallest
11 labelSelectors:
12 app: lightning-asr
13 duration: "30s"

What’s Next?