Docker Troubleshooting

View as MarkdownOpen in Claude

Common Issues

GPU Not Accessible

Symptoms:

  • Error: could not select device driver "nvidia"
  • Error: no NVIDIA GPU devices found
  • Lightning ASR fails to start

Diagnosis:

$docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
$sudo systemctl restart docker
$docker compose up -d
$sudo apt-get remove nvidia-container-toolkit
$sudo apt-get update
$sudo apt-get install -y nvidia-container-toolkit
$
$sudo systemctl restart docker
$nvidia-smi

If driver version is below 470, update:

$sudo ubuntu-drivers autoinstall
$sudo reboot

Verify /etc/docker/daemon.json contains:

1{
2 "runtimes": {
3 "nvidia": {
4 "path": "nvidia-container-runtime",
5 "runtimeArgs": []
6 }
7 }
8}

Restart Docker after changes:

$sudo systemctl restart docker

License Validation Failed

Symptoms:

  • Error: License validation failed
  • Error: Invalid license key
  • Services fail to start

Diagnosis:

Check license-proxy logs:

$docker compose logs license-proxy

Check .env file:

$cat .env | grep LICENSE_KEY

Ensure there are no:

  • Extra spaces
  • Quotes around the key
  • Line breaks

Correct format:

$LICENSE_KEY=abc123def456

Test connection to license server:

$curl -v https://console-api.smallest.ai

If this fails, check:

  • Firewall rules
  • Proxy settings
  • DNS resolution

If the key appears correct and network is accessible, your license may be:

  • Expired
  • Revoked
  • Invalid

Contact support@smallest.ai with:

  • Your license key
  • License-proxy logs
  • Error messages

Model Download Failed

Symptoms:

  • Lightning ASR stuck at startup
  • Error: Failed to download model
  • Error: Connection timeout

Diagnosis:

Check Lightning ASR logs:

$docker compose logs lightning-asr

Check .env file:

$cat .env | grep MODEL_URL

Test URL accessibility:

$curl -I "${MODEL_URL}"

Models require ~20-30 GB:

$df -h

Free up space if needed:

$docker system prune -a

Download model manually and use volume mount:

$mkdir -p ~/models
$cd ~/models
$wget "${MODEL_URL}" -O model.bin

Update docker-compose.yml:

1lightning-asr:
2 volumes:
3 - ~/models:/app/models

For slow connections, increase download timeout:

1lightning-asr:
2 environment:
3 - DOWNLOAD_TIMEOUT=3600

Port Already in Use

Symptoms:

  • Error: port is already allocated
  • Error: bind: address already in use

Diagnosis:

Find what’s using the port:

$sudo lsof -i :7100
$sudo netstat -tulpn | grep 7100

If another service is using the port:

$sudo systemctl stop [service-name]

Or kill the process:

$sudo kill -9 [PID]

Modify docker-compose.yml to use different port:

1api-server:
2 ports:
3 - "8080:7100"

Access API at http://localhost:8080 instead

Old containers may still be bound:

$docker compose down
$docker container prune -f
$docker compose up -d

Out of Memory

Symptoms:

  • Container killed unexpectedly
  • Error: OOMKilled
  • System becomes unresponsive

Diagnosis:

Check container status:

$docker compose ps
$docker inspect [container-name] | grep OOMKilled

Lightning ASR requires minimum 16 GB RAM

Check current memory:

$free -h

Prevent one service from consuming all memory:

1services:
2 lightning-asr:
3 deploy:
4 resources:
5 limits:
6 memory: 14G
7 reservations:
8 memory: 12G

Add swap space (temporary solution):

$sudo fallocate -l 16G /swapfile
$sudo chmod 600 /swapfile
$sudo mkswap /swapfile
$sudo swapon /swapfile

Use smaller model or reduce batch size:

1lightning-asr:
2 environment:
3 - BATCH_SIZE=1
4 - MODEL_PRECISION=fp16

Container Keeps Restarting

Symptoms:

  • Container status shows Restarting
  • Logs show crash loop

Diagnosis:

View recent logs:

$docker compose logs --tail=100 [service-name]
$docker inspect [container-name] --format='{{.State.ExitCode}}'

Common exit codes:

  • 137: Out of memory (OOMKilled)
  • 139: Segmentation fault
  • 1: General error

Temporarily disable restart to debug:

1lightning-asr:
2 restart: "no"

Start manually and watch logs:

$docker compose up lightning-asr

Ensure required services are healthy:

$docker compose ps

All should show Up (healthy) or Up

Slow Performance

Symptoms:

  • High latency (>500ms)
  • Low throughput
  • GPU underutilized

Diagnosis:

Monitor GPU usage:

$watch -n 1 nvidia-smi

Check container resources:

$docker stats

Ensure GPU is not throttling:

$nvidia-smi -q -d PERFORMANCE

Enable persistence mode:

$sudo nvidia-smi -pm 1
1lightning-asr:
2 deploy:
3 resources:
4 limits:
5 cpus: '8'

For maximum performance (loses isolation):

1api-server:
2 network_mode: host

Use Redis with persistence disabled for speed:

1redis:
2 command: redis-server --save ""

Scale Lightning ASR workers:

$docker compose up -d --scale lightning-asr=2

Performance Optimization

Best Practices

1

Use Persistent Volumes

Cache models to avoid re-downloading:

1volumes:
2 - model-cache:/app/models
2

Enable GPU Persistence Mode

Reduces GPU initialization time:

$sudo nvidia-smi -pm 1
3

Optimize Container Resources

Allocate appropriate CPU/memory:

1deploy:
2 resources:
3 limits:
4 cpus: '8'
5 memory: 14G
4

Monitor and Tune

Use monitoring tools:

$docker stats
$nvidia-smi dmon

Benchmark Your Deployment

Test transcription performance:

$time curl -X POST http://localhost:7100/v1/listen \
> -H "Authorization: Token ${LICENSE_KEY}" \
> -H "Content-Type: application/json" \
> -d '{
> "url": "https://example.com/test-audio-60s.wav"
> }'

Expected performance:

  • Cold start: First request after container start (5-10 seconds)
  • Warm requests: Subsequent requests (50-200ms)
  • Real-time factor: 0.05-0.15x (60s audio in 3-9 seconds)

Debugging Tools

View All Logs

$docker compose logs -f

Follow Specific Service

$docker compose logs -f lightning-asr

Last N Lines

$docker compose logs --tail=100 api-server

Save Logs to File

$docker compose logs > deployment-logs.txt

Execute Commands in Container

$docker compose exec lightning-asr bash

Check Container Configuration

$docker inspect lightning-asr-1

Network Debugging

Test connectivity between containers:

$docker compose exec api-server ping lightning-asr
$docker compose exec api-server curl http://lightning-asr:2233/health

Health Checks

API Server

$curl http://localhost:7100/health

Expected: {"status": "healthy"}

Lightning ASR

$curl http://localhost:2233/health

Expected: {"status": "ready", "gpu": "NVIDIA A10"}

License Proxy

$docker compose exec license-proxy wget -q -O- http://localhost:6699/health

Expected: {"status": "valid"}

Redis

$docker compose exec redis redis-cli ping

Expected: PONG

Log Analysis

Common Log Patterns

1redis-1 | Ready to accept connections
2license-proxy | License validated successfully
3lightning-asr-1 | Model loaded successfully
4lightning-asr-1 | GPU: NVIDIA A10 (24GB)
5lightning-asr-1 | Server ready on port 2233
6api-server | Connected to Lightning ASR
7api-server | API server listening on port 7100

Getting Help

Before Contacting Support

Collect the following information:

1

System Information

$docker version
$docker compose version
$nvidia-smi
$uname -a
2

Container Status

$docker compose ps > status.txt
$docker stats --no-stream > resources.txt
3

Logs

$docker compose logs > all-logs.txt
4

Configuration

Sanitize and include:

  • docker-compose.yml
  • .env (remove license key)

Contact Support

Email: support@smallest.ai

Include:

  • Description of the issue
  • Steps to reproduce
  • System information
  • Logs and configuration
  • License key (via secure channel)

What’s Next?