Logs Analysis

View as MarkdownOpen in Claude

Overview

Understanding log messages is crucial for diagnosing issues. This guide helps you interpret logs from each component and identify common error patterns.

Log Levels

All components use standard log levels:

LevelDescriptionExample
DEBUGDetailed diagnostic infoVariable values, function calls
INFONormal operation eventsRequest received, model loaded
WARNINGPotential issuesSlow response, retry attempt
ERRORError that needs attentionFailed request, connection error
CRITICALSevere errorService crash, unrecoverable error

Lightning ASR Logs

Successful Startup

1INFO: Starting Lightning ASR v1.0.0
2INFO: GPU detected: NVIDIA A10 (24GB)
3INFO: Downloading model from URL...
4INFO: Model downloaded: 23.5GB
5INFO: Loading model into GPU memory...
6INFO: Model loaded successfully (5.2GB GPU memory)
7INFO: Warmup inference completed in 3.2s
8INFO: Server ready on port 2269

Request Processing

1INFO: Request received: req_abc123
2DEBUG: Audio duration: 60.5s, sample_rate: 44100
3DEBUG: Preprocessing audio...
4DEBUG: Running inference...
5INFO: Transcription completed in 3.1s (RTF: 0.05x)
6INFO: Confidence: 0.95

Common Errors

1ERROR: No CUDA-capable device detected
2ERROR: nvidia-smi command not found
3CRITICAL: Cannot initialize GPU, exiting

Cause: GPU not available or drivers not installed

Solution:

  • Check nvidia-smi works
  • Verify GPU device plugin (Kubernetes)
  • Check NVIDIA Container Toolkit (Docker)
1ERROR: CUDA out of memory
2ERROR: Tried to allocate 2.5GB but only 1.2GB available
3WARNING: Reducing batch size

Cause: Not enough GPU memory

Solution:

  • Reduce concurrent requests
  • Use larger GPU (A10 vs T4)
  • Scale horizontally (more pods)
1INFO: Downloading model from https://example.com/model.bin
2WARNING: Download attempt 1 failed: Connection timeout
3WARNING: Retrying download...
4ERROR: Download failed after 3 attempts

Cause: Network issues, invalid URL, disk full

Solution:

  • Verify MODEL_URL
  • Check disk space: df -h
  • Test URL: curl -I $MODEL_URL
  • Use shared storage (EFS)
1ERROR: Failed to process audio: req_xyz789
2ERROR: Unsupported audio format: audio/webm
3ERROR: Audio file corrupted or invalid

Cause: Invalid audio file

Solution:

  • Verify audio format (WAV, MP3, FLAC supported)
  • Check file is not corrupted
  • Ensure proper sample rate (16kHz+)

API Server Logs

Successful Startup

1INFO: Starting API Server v1.0.0
2INFO: Connecting to Lightning ASR at http://lightning-asr:2269
3INFO: Connected to Lightning ASR (2 replicas)
4INFO: Connecting to License Proxy at http://license-proxy:3369
5INFO: License validated
6INFO: API server listening on port 7100

Request Handling

1INFO: POST /v1/listen from 10.0.1.5
2DEBUG: Request ID: req_abc123
3DEBUG: Audio URL: https://example.com/audio.wav
4DEBUG: Routing to Lightning ASR pod: lightning-asr-0
5INFO: Response time: 3.2s
6INFO: Status: 200 OK

Common Errors

1WARNING: Invalid license key from 10.0.1.5
2WARNING: Missing Authorization header
3ERROR: License validation failed: expired

Cause: Invalid, missing, or expired license key

Solution:

  • Verify Authorization: Token <key> header
  • Check license key is correct
  • Renew expired license
1ERROR: No Lightning ASR workers available
2WARNING: Request queued: req_abc123
3WARNING: Queue size: 15

Cause: All Lightning ASR pods busy or down

Solution:

  • Check Lightning ASR pods: kubectl get pods
  • Scale up replicas
  • Check HPA configuration
1ERROR: Request timeout after 300s
2ERROR: Lightning ASR pod not responding: lightning-asr-0
3WARNING: Retrying with different pod

Cause: Lightning ASR overloaded or crashed

Solution:

  • Check Lightning ASR logs
  • Increase timeout
  • Scale up pods

License Proxy Logs

Successful Validation

1INFO: Starting License Proxy v1.0.0
2INFO: License key loaded
3INFO: Connecting to console-api.smallest.ai
4INFO: License validated successfully
5INFO: License valid until: 2025-12-31T23:59:59Z
6INFO: Grace period: 24 hours
7INFO: Server listening on port 3369

Usage Reporting

1DEBUG: Reporting usage batch: 150 requests
2DEBUG: Total duration: 3600s
3DEBUG: Features: [streaming, punctuation]
4INFO: Usage reported successfully

Common Errors

1ERROR: License validation failed: Invalid license key
2ERROR: License server returned 401 Unauthorized
3CRITICAL: Cannot start without valid license

Cause: Invalid or expired license

Solution:

1WARNING: Connection to console-api.smallest.ai failed
2WARNING: Connection timeout after 10s
3INFO: Using cached validation
4INFO: Grace period active (23h remaining)

Cause: Network connectivity issue

Solution:

  • Test: curl https://console-api.smallest.ai
  • Check firewall allows HTTPS
  • Restore connectivity before grace period expires
1WARNING: Grace period expires in 1 hour
2WARNING: Cannot connect to license server
3ERROR: Grace period expired
4CRITICAL: Service will stop accepting requests

Cause: Extended network outage

Solution:

  • Restore network connectivity immediately
  • Check firewall rules
  • Contact support if persistent

Redis Logs

Normal Operation

1Ready to accept connections
2Client connected from 10.0.1.5:45678
3DB 0: 1523 keys (expires: 0)

Common Errors

1WARNING: Memory usage: 95%
2ERROR: OOM command not allowed when used memory > 'maxmemory'

Solution:

  • Increase memory limit
  • Enable eviction policy
  • Clear old keys
1ERROR: Failed writing the RDB file
2ERROR: Disk is full

Solution:

  • Increase disk space
  • Disable persistence if not needed
  • Clean up old snapshots

Log Pattern Analysis

Error Rate Analysis

Count errors in last 1000 lines:

$kubectl logs <pod> --tail=1000 | grep -c "ERROR"

Group errors by type:

$kubectl logs <pod> | grep "ERROR" | sort | uniq -c | sort -rn

Performance Analysis

Extract response times:

$kubectl logs <pod> | grep "Response time" | awk '{print $NF}' | sort -n

Calculate average:

$kubectl logs <pod> | grep "Response time" | awk '{sum+=$NF; count++} END {print sum/count}'

Request Tracking

Follow a specific request ID:

$kubectl logs <pod> | grep "req_abc123"

Across all pods:

$kubectl logs -l app=lightning-asr | grep "req_abc123"

Log Aggregation

Using stern

Install stern:

$brew install stern

Follow logs from all Lightning ASR pods:

$stern lightning-asr -n smallest

Filter by pattern:

$stern lightning-asr -n smallest --grep "ERROR"

Using Loki (if installed)

Query logs via LogQL:

1{app="lightning-asr"} |= "ERROR"
2{app="api-server"} |= "req_abc123"
3rate({app="lightning-asr"}[5m])

Structured Logging

Parse JSON Logs

If logs are in JSON format:

$kubectl logs <pod> | jq 'select(.level=="ERROR")'
$kubectl logs <pod> | jq 'select(.duration > 1000)'
$kubectl logs <pod> | jq '.message' -r

Filter by Field

$kubectl logs <pod> | jq 'select(.request_id=="req_abc123")'
$kubectl logs <pod> | jq 'select(.component=="license_proxy")'

Log Retention

Configure Log Rotation

Docker:

docker-compose.yml
1services:
2 lightning-asr:
3 logging:
4 driver: "json-file"
5 options:
6 max-size: "10m"
7 max-file: "3"

Kubernetes:

1apiVersion: v1
2kind: Pod
3metadata:
4 name: lightning-asr
5spec:
6 containers:
7 - name: lightning-asr
8 imagePullPolicy: Always

Kubernetes automatically rotates logs via kubelet.

Export Logs

Save logs for analysis:

$kubectl logs <pod> > logs.txt
$kubectl logs <pod> --since=1h > logs-last-hour.txt
$kubectl logs <pod> --since-time=2024-01-15T10:00:00Z > logs-since.txt

Debugging Log Issues

No Logs Appearing

Check pod is running:

$kubectl get pods -n smallest
$kubectl describe pod <pod-name>

Check stdout/stderr:

$kubectl exec -it <pod> -- sh -c "ls -la /proc/1/fd/"

Logs Truncated

Increase log size limits:

1apiVersion: v1
2kind: Pod
3metadata:
4 annotations:
5 kubernetes.io/psp: privileged
6spec:
7 containers:
8 - name: app
9 env:
10 - name: LOG_MAX_SIZE
11 value: "100M"

Best Practices

Prefer JSON format for easier parsing:

1{
2 "timestamp": "2024-01-15T10:30:00Z",
3 "level": "ERROR",
4 "message": "Request failed",
5 "request_id": "req_abc123",
6 "duration_ms": 3200
7}

Always include relevant context in logs:

  • Request ID
  • Component name
  • Timestamp
  • User/session info (if applicable)

Use correct log levels:

  • DEBUG: Development only
  • INFO: Normal operation
  • WARNING: Potential issues
  • ERROR: Actual problems
  • CRITICAL: Service-breaking issues

Use centralized logging:

  • ELK Stack (Elasticsearch, Logstash, Kibana)
  • Loki + Grafana
  • CloudWatch Logs (AWS)
  • Cloud Logging (GCP)

What’s Next?