Kubernetes Troubleshooting
Overview
This guide covers common issues encountered when deploying Smallest Self-Host on Kubernetes and how to resolve them.
Diagnostic Commands
Quick Status Check
Detailed Pod Information
Events
Common Issues
Pods Stuck in Pending
Symptoms:
Causes and Solutions:
Insufficient GPU Resources
Check:
Look for: 0/3 nodes are available: 3 Insufficient nvidia.com/gpu
Solutions:
- Add GPU nodes to cluster
- Check GPU nodes are ready:
kubectl get nodes -l nvidia.com/gpu=true - Verify GPU device plugin:
kubectl get pods -n kube-system -l name=nvidia-device-plugin - Reduce requested GPUs or add more nodes
Node Selector Mismatch
Check:
Solutions:
- Update nodeSelector in values.yaml to match actual node labels
- Remove nodeSelector if not needed
- Add labels to nodes:
kubectl label nodes <node-name> workload=gpu
Tolerations Missing
Check:
Solutions: Update tolerations in values.yaml:
PVC Not Bound
Check:
Look for: STATUS: Pending
Solutions:
- Check storage class exists:
kubectl get storageclass - Verify sufficient storage:
kubectl describe pvc <pvc-name> -n smallest - Check EFS/EBS CSI driver running:
kubectl get pods -n kube-system -l app=efs-csi-controller
ImagePullBackOff
Symptoms:
Diagnosis:
Look for errors in Events section.
Solutions:
Invalid Credentials
Error: unauthorized: authentication required
Solutions:
- Verify imageCredentials in values.yaml
- Check secret created:
kubectl get secrets -n smallest | grep registry - Test credentials locally:
docker login quay.io - Recreate secret:
Image Not Found
Error: manifest unknown or not found
Solutions:
- Verify image name in values.yaml
- Check image exists:
docker pull quay.io/smallestinc/lightning-asr:latest - Contact support@smallest.ai for access
Rate Limited
Error: rate limit exceeded
Solutions:
- Wait and retry
- Use authenticated pulls (imageCredentials)
CrashLoopBackOff
Symptoms:
Diagnosis:
Common Causes:
License Validation Failed
Error: License validation failed or Invalid license key
Solutions:
- Check License Proxy is running:
kubectl get pods -l app=license-proxy -n smallest - Verify license key in values.yaml
- Check License Proxy logs:
kubectl logs -l app=license-proxy -n smallest - Test License Proxy:
kubectl exec -it <api-server-pod> -- curl http://license-proxy:3369/health
Model Download Failed
Error: Failed to download model or Connection timeout
Solutions:
- Verify MODEL_URL in values.yaml
- Check network connectivity
- Check disk space:
kubectl exec -it <pod> -- df -h - Test URL:
kubectl run test --rm -it --image=curlimages/curl -- curl -I $MODEL_URL
Out of Memory
Error: Pod killed, exit code 137
Solutions:
- Check memory limits:
- Increase memory:
- Check node capacity:
kubectl describe node <node-name>
GPU Not Accessible
Error: No CUDA-capable device or GPU not found
Solutions:
- Verify GPU available on node:
kubectl describe node <node-name> | grep nvidia.com/gpu - Check NVIDIA device plugin:
kubectl get pods -n kube-system -l name=nvidia-device-plugin - Restart device plugin:
kubectl delete pod -n kube-system -l name=nvidia-device-plugin - Verify GPU driver on node
Service Not Accessible
Symptoms:
- Cannot connect to API server
- Connection refused errors
- Timeouts
Diagnosis:
Solutions:
No Endpoints
Issue: Service has no endpoints
Check:
Solutions:
- Verify pods are running:
kubectl get pods -l app=api-server -n smallest - Check pod labels match service selector
- Check pods are ready:
kubectl get pods -l app=api-server -o wide
Wrong Port
Solutions:
- Verify service port:
- Use correct port in connections (7100 for API Server)
Network Policy Blocking
Check:
Solutions:
- Review network policies
- Temporarily disable to test:
HPA Not Scaling
Symptoms:
- HPA shows
<unknown>for metrics - Pods not scaling despite high load
Diagnosis:
Solutions:
Metrics Not Available
Check:
Solutions:
- Enable ServiceMonitor:
- Verify Prometheus is scraping:
Query:
asr_active_requests
Already at Max Replicas
Check:
Solutions:
- Increase maxReplicas:
Insufficient Cluster Resources
Solutions:
- Add more nodes
- Enable Cluster Autoscaler
- Check pending pods:
kubectl get pods --field-selector=status.phase=Pending
Persistent Volume Issues
Symptoms:
- PVC stuck in Pending
- Mount failures
- Permission denied
Solutions:
No Storage Class
Check:
Solutions:
- Install EBS CSI driver (AWS)
- Install EFS CSI driver (AWS)
- Create storage class
EFS Mount Failed
Check:
Solutions:
- Verify EFS file system ID
- Check security group allows NFS (port 2049)
- Verify EFS CSI driver:
kubectl get pods -n kube-system -l app=efs-csi-controller
Permission Denied
Solutions:
- Check volume permissions
- Add fsGroup to pod securityContext:
Performance Issues
Slow Response Times
Check:
Solutions:
- Increase pod resources
- Scale up replicas
- Check GPU utilization:
kubectl exec -it <lightning-asr-pod> -- nvidia-smi - Review model configuration
- Check network latency
High CPU/Memory Usage
Check:
Solutions:
- Increase resource limits
- Scale horizontally (more pods)
- Investigate memory leaks in logs
- Enable monitoring with Grafana
Debugging Tools
Interactive Shell
Debug Container
Network Debugging
Inside the debug pod:
Copy Files
Getting Help
Collect Diagnostic Information
Before contacting support, collect:
Contact Support
Email: support@smallest.ai
Include:
- Description of the issue
- Steps to reproduce
- Diagnostic files collected above
- Cluster information (EKS version, node types, etc.)
- Helm chart version

