Overview
This guide covers common issues encountered when deploying Smallest Self-Host on Kubernetes and how to resolve them.
Diagnostic Commands
Quick Status Check
Events
Common Issues
Pods Stuck in Pending
Symptoms :
Causes and Solutions :
Insufficient GPU Resources Check :
Look for: 0/3 nodes are available: 3 Insufficient nvidia.com/gpu
Solutions :
Add GPU nodes to cluster
Check GPU nodes are ready: kubectl get nodes -l nvidia.com/gpu=true
Verify GPU device plugin: kubectl get pods -n kube-system -l name=nvidia-device-plugin
Reduce requested GPUs or add more nodes
Node Selector Mismatch Check :
Solutions :
Update nodeSelector in values.yaml to match actual node labels
Remove nodeSelector if not needed
Add labels to nodes: kubectl label nodes <node-name> workload=gpu
Tolerations Missing Check :
Solutions :
Update tolerations in values.yaml:
PVC Not Bound Check :
Look for: STATUS: Pending
Solutions :
Check storage class exists: kubectl get storageclass
Verify sufficient storage: kubectl describe pvc <pvc-name> -n smallest
Check EFS/EBS CSI driver running: kubectl get pods -n kube-system -l app=efs-csi-controller
ImagePullBackOff
Symptoms :
Diagnosis :
Look for errors in Events section.
Solutions :
Invalid Credentials Error : unauthorized: authentication required
Solutions :
Verify imageCredentials in values.yaml
Check secret created: kubectl get secrets -n smallest | grep registry
Test credentials locally: docker login quay.io
Recreate secret:
Image Not Found Error : manifest unknown or not found
Solutions :
Verify image name in values.yaml
Check image exists: docker pull quay.io/smallestinc/lightning-asr:latest
Contact support@smallest.ai for access
Rate Limited Error : rate limit exceeded
Solutions :
Wait and retry
Use authenticated pulls (imageCredentials)
CrashLoopBackOff
Symptoms :
Diagnosis :
Common Causes :
License Validation Failed Error : License validation failed or Invalid license key
Solutions :
Check License Proxy is running: kubectl get pods -l app=license-proxy -n smallest
Verify license key in values.yaml
Check License Proxy logs: kubectl logs -l app=license-proxy -n smallest
Test License Proxy: kubectl exec -it <api-server-pod> -- curl http://license-proxy:3369/health
Model Download Failed Error : Failed to download model or Connection timeout
Solutions :
Verify MODEL_URL in values.yaml
Check network connectivity
Check disk space: kubectl exec -it <pod> -- df -h
Test URL: kubectl run test --rm -it --image=curlimages/curl -- curl -I $MODEL_URL
Out of Memory Error : Pod killed, exit code 137
Solutions :
Check memory limits:
Increase memory:
Check node capacity: kubectl describe node <node-name>
GPU Not Accessible Error : No CUDA-capable device or GPU not found
Solutions :
Verify GPU available on node: kubectl describe node <node-name> | grep nvidia.com/gpu
Check NVIDIA device plugin: kubectl get pods -n kube-system -l name=nvidia-device-plugin
Restart device plugin: kubectl delete pod -n kube-system -l name=nvidia-device-plugin
Verify GPU driver on node
Service Not Accessible
Symptoms :
Cannot connect to API server
Connection refused errors
Timeouts
Diagnosis :
Solutions :
No Endpoints Issue : Service has no endpoints
Check :
Solutions :
Verify pods are running: kubectl get pods -l app=api-server -n smallest
Check pod labels match service selector
Check pods are ready: kubectl get pods -l app=api-server -o wide
Wrong Port Solutions :
Verify service port:
Use correct port in connections (7100 for API Server)
Network Policy Blocking Check :
Solutions :
Review network policies
Temporarily disable to test:
HPA Not Scaling
Symptoms :
HPA shows <unknown> for metrics
Pods not scaling despite high load
Diagnosis :
Solutions :
Metrics Not Available Check :
Solutions :
Enable ServiceMonitor:
Verify Prometheus is scraping:
Query: asr_active_requests
Already at Max Replicas Insufficient Cluster Resources Solutions :
Add more nodes
Enable Cluster Autoscaler
Check pending pods: kubectl get pods --field-selector=status.phase=Pending
Persistent Volume Issues
Symptoms :
PVC stuck in Pending
Mount failures
Permission denied
Solutions :
No Storage Class Check :
Solutions :
Install EBS CSI driver (AWS)
Install EFS CSI driver (AWS)
Create storage class
EFS Mount Failed Check :
Solutions :
Verify EFS file system ID
Check security group allows NFS (port 2049)
Verify EFS CSI driver: kubectl get pods -n kube-system -l app=efs-csi-controller
Permission Denied Solutions :
Check volume permissions
Add fsGroup to pod securityContext:
Slow Response Times
Check :
Solutions :
Increase pod resources
Scale up replicas
Check GPU utilization: kubectl exec -it <lightning-asr-pod> -- nvidia-smi
Review model configuration
Check network latency
High CPU/Memory Usage
Check :
Solutions :
Increase resource limits
Scale horizontally (more pods)
Investigate memory leaks in logs
Enable monitoring with Grafana
Interactive Shell
Debug Container
Network Debugging
Inside the debug pod:
Copy Files
Getting Help
Before contacting support, collect:
Email: support@smallest.ai
Include:
Description of the issue
Steps to reproduce
Diagnostic files collected above
Cluster information (EKS version, node types, etc.)
Helm chart version
What’s Next?