Kubernetes Troubleshooting

Overview

This guide covers common issues encountered when deploying Smallest Self-Host on Kubernetes and how to resolve them.

Diagnostic Commands

Quick Status Check

$ kubectl get all -n smallest
$ kubectl get pods -n smallest --show-labels
$ kubectl top pods -n smallest
$ kubectl top nodes

Detailed Pod Information

$ kubectl describe pod <pod-name> -n smallest
$ kubectl logs <pod-name> -n smallest
$ kubectl logs <pod-name> -n smallest --previous
$ kubectl logs <pod-name> -c <container-name> -n smallest -f

Events

$ kubectl get events -n smallest --sort-by='.lastTimestamp'
$ kubectl get events -n smallest --field-selector type=Warning

Common Issues

Pods Stuck in Pending

Symptoms:

NAME                READY   STATUS    RESTARTS   AGE
lightning-asr-xxx   0/1     Pending   0          5m

Causes and Solutions:

Insufficient GPU Resources

Check:

$ kubectl describe pod lightning-asr-xxx -n smallest

Look for: 0/3 nodes are available: 3 Insufficient nvidia.com/gpu

Solutions:

Add GPU nodes to cluster
Check GPU nodes are ready: kubectl get nodes -l nvidia.com/gpu=true
Verify GPU device plugin: kubectl get pods -n kube-system -l name=nvidia-device-plugin
Reduce requested GPUs or add more nodes

Node Selector Mismatch

Check:

$ kubectl get nodes --show-labels
$ kubectl describe pod lightning-asr-xxx -n smallest | grep "Node-Selectors"

Solutions:

Update nodeSelector in values.yaml to match actual node labels
Remove nodeSelector if not needed
Add labels to nodes: kubectl label nodes <node-name> workload=gpu

Tolerations Missing

Check:

$ kubectl describe pod lightning-asr-xxx -n smallest | grep -A5 "Tolerations"
$ kubectl describe node <node-name> | grep "Taints"

Solutions: Update tolerations in values.yaml:

1 lightningAsr:
2   tolerations:
3     - key: nvidia.com/gpu
4       operator: Exists
5       effect: NoSchedule

PVC Not Bound

Check:

$ kubectl get pvc -n smallest

Look for: STATUS: Pending

Solutions:

Check storage class exists: kubectl get storageclass
Verify sufficient storage: kubectl describe pvc <pvc-name> -n smallest
Check EFS/EBS CSI driver running: kubectl get pods -n kube-system -l app=efs-csi-controller

ImagePullBackOff

Symptoms:

NAME                READY   STATUS             RESTARTS   AGE
lightning-asr-xxx   0/1     ImagePullBackOff   0          2m

Diagnosis:

$ kubectl describe pod lightning-asr-xxx -n smallest

Look for errors in Events section.

Solutions:

Invalid Credentials

Error: unauthorized: authentication required

Solutions:

Verify imageCredentials in values.yaml
Check secret created: kubectl get secrets -n smallest | grep registry
Test credentials locally: docker login quay.io

Recreate secret:

$ kubectl delete secret <pull-secret> -n smallest
$ helm upgrade smallest-self-host ... -f values.yaml

Image Not Found

Error: manifest unknown or not found

Solutions:

Verify image name in values.yaml
Check image exists: docker pull quay.io/smallestinc/lightning-asr:latest
Contact support@smallest.ai for access

Rate Limited

Error: rate limit exceeded

Solutions:

Wait and retry
Use authenticated pulls (imageCredentials)

CrashLoopBackOff

Symptoms:

NAME                READY   STATUS             RESTARTS   AGE
lightning-asr-xxx   0/1     CrashLoopBackOff   5          5m

Diagnosis:

$ kubectl logs lightning-asr-xxx -n smallest
$ kubectl logs lightning-asr-xxx -n smallest --previous
$ kubectl describe pod lightning-asr-xxx -n smallest

Common Causes:

License Validation Failed

Error: License validation failed or Invalid license key

Solutions:

Check License Proxy is running: kubectl get pods -l app=license-proxy -n smallest
Verify license key in values.yaml
Check License Proxy logs: kubectl logs -l app=license-proxy -n smallest
Test License Proxy: kubectl exec -it <api-server-pod> -- curl http://license-proxy:3369/health

Model Download Failed

Error: Failed to download model or Connection timeout

Solutions:

Verify MODEL_URL in values.yaml
Check network connectivity
Check disk space: kubectl exec -it <pod> -- df -h
Test URL: kubectl run test --rm -it --image=curlimages/curl -- curl -I $MODEL_URL

Out of Memory

Error: Pod killed, exit code 137

Solutions:

Check memory limits:

$ kubectl describe pod lightning-asr-xxx -n smallest | grep -A5 Limits

Increase memory:

1 lightningAsr:
2   resources:
3     limits:
4       memory: 16Gi

Check node capacity: kubectl describe node <node-name>

GPU Not Accessible

Error: No CUDA-capable device or GPU not found

Solutions:

Verify GPU available on node: kubectl describe node <node-name> | grep nvidia.com/gpu
Check NVIDIA device plugin: kubectl get pods -n kube-system -l name=nvidia-device-plugin
Restart device plugin: kubectl delete pod -n kube-system -l name=nvidia-device-plugin
Verify GPU driver on node

Service Not Accessible

Symptoms:

Cannot connect to API server
Connection refused errors
Timeouts

Diagnosis:

$ kubectl get svc -n smallest
$ kubectl describe svc api-server -n smallest
$ kubectl get endpoints -n smallest

Solutions:

No Endpoints

Issue: Service has no endpoints

Check:

$ kubectl get endpoints api-server -n smallest

Solutions:

Verify pods are running: kubectl get pods -l app=api-server -n smallest
Check pod labels match service selector
Check pods are ready: kubectl get pods -l app=api-server -o wide

Wrong Port

Solutions:

Verify service port:

$ kubectl get svc api-server -n smallest -o yaml

Use correct port in connections (7100 for API Server)

Network Policy Blocking

Check:

$ kubectl get networkpolicy -n smallest

Solutions:

Review network policies

Temporarily disable to test:

$ kubectl delete networkpolicy <policy-name> -n smallest

HPA Not Scaling

Symptoms:

HPA shows <unknown> for metrics
Pods not scaling despite high load

Diagnosis:

$ kubectl get hpa -n smallest
$ kubectl describe hpa lightning-asr -n smallest
$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

Solutions:

Metrics Not Available

Check:

$ kubectl get servicemonitor -n smallest
$ kubectl logs -n kube-system -l app.kubernetes.io/name=prometheus-adapter

Solutions:

Enable ServiceMonitor:

1 scaling:
2   auto:
3     lightningAsr:
4       servicemonitor:
5         enabled: true

Verify Prometheus is scraping:

$ kubectl port-forward svc/smallest-prometheus-stack-prometheus 9090:9090

Query: asr_active_requests

Already at Max Replicas

Check:

$ kubectl get hpa lightning-asr -n smallest

Solutions:

Increase maxReplicas:

1 scaling:
2   auto:
3     lightningAsr:
4       hpa:
5         maxReplicas: 20

Insufficient Cluster Resources

Solutions:

Add more nodes
Enable Cluster Autoscaler
Check pending pods: kubectl get pods --field-selector=status.phase=Pending

Persistent Volume Issues

Symptoms:

PVC stuck in Pending
Mount failures
Permission denied

Solutions:

No Storage Class

Check:

$ kubectl get storageclass

Solutions:

Install EBS CSI driver (AWS)
Install EFS CSI driver (AWS)
Create storage class

EFS Mount Failed

Check:

$ kubectl describe pod <pod-name> | grep -A10 "Events"

Solutions:

Verify EFS file system ID
Check security group allows NFS (port 2049)
Verify EFS CSI driver: kubectl get pods -n kube-system -l app=efs-csi-controller

Permission Denied

Solutions:

Check volume permissions

Add fsGroup to pod securityContext:

1 securityContext:
2   fsGroup: 1000

Performance Issues

Slow Response Times

Check:

$ kubectl top pods -n smallest
$ kubectl top nodes
$ kubectl logs -l app=lightning-asr -n smallest | grep -i "latency\|duration"

Solutions:

Increase pod resources
Scale up replicas
Check GPU utilization: kubectl exec -it <lightning-asr-pod> -- nvidia-smi
Review model configuration
Check network latency

High CPU/Memory Usage

Check:

$ kubectl top pods -n smallest
$ kubectl describe pod <pod-name> -n smallest | grep -A5 "Limits"

Solutions:

Increase resource limits
Scale horizontally (more pods)
Investigate memory leaks in logs
Enable monitoring with Grafana

Debugging Tools

Interactive Shell

$ kubectl exec -it <pod-name> -n smallest -- /bin/sh

Debug Container

$ kubectl debug <pod-name> -n smallest -it --image=ubuntu -- bash

Network Debugging

$ kubectl run netdebug --rm -it --restart=Never \
>   --image=nicolaka/netshoot \
>   --namespace=smallest

Inside the debug pod:

$ nslookup api-server
$ curl http://api-server:7100/health
$ traceroute lightning-asr

Copy Files

$ kubectl cp <pod-name>:/path/to/file ./local-file -n smallest
$ kubectl cp ./local-file <pod-name>:/path/to/file -n smallest

Getting Help

Collect Diagnostic Information

Before contacting support, collect:

$ kubectl get all -n smallest > status.txt
$ kubectl describe pods -n smallest > pods.txt
$ kubectl logs -l app=lightning-asr -n smallest --tail=500 > asr-logs.txt
$ kubectl logs -l app=api-server -n smallest --tail=500 > api-logs.txt
$ kubectl logs -l app=license-proxy -n smallest --tail=500 > license-logs.txt
$ kubectl get events -n smallest --sort-by='.lastTimestamp' > events.txt
$ kubectl top nodes > nodes.txt
$ kubectl top pods -n smallest > pod-resources.txt
$ helm get values smallest-self-host -n smallest > values.txt

Contact Support

Email: support@smallest.ai

Include:

Description of the issue
Steps to reproduce
Diagnostic files collected above
Cluster information (EKS version, node types, etc.)
Helm chart version

What’s Next?

General Troubleshooting

Platform-agnostic troubleshooting guide

API Reference

API integration documentation

Overview

This guide covers common issues encountered when deploying Smallest Self-Host on Kubernetes and how to resolve them.

Diagnostic Commands

Quick Status Check

$ kubectl get all -n smallest
$ kubectl get pods -n smallest --show-labels
$ kubectl top pods -n smallest
$ kubectl top nodes

Detailed Pod Information

$ kubectl describe pod <pod-name> -n smallest
$ kubectl logs <pod-name> -n smallest
$ kubectl logs <pod-name> -n smallest --previous
$ kubectl logs <pod-name> -c <container-name> -n smallest -f

Events

$ kubectl get events -n smallest --sort-by='.lastTimestamp'
$ kubectl get events -n smallest --field-selector type=Warning

Common Issues

Pods Stuck in Pending

Symptoms:

NAME                READY   STATUS    RESTARTS   AGE
lightning-asr-xxx   0/1     Pending   0          5m

Causes and Solutions:

Insufficient GPU Resources

Check:

$ kubectl describe pod lightning-asr-xxx -n smallest

Look for: 0/3 nodes are available: 3 Insufficient nvidia.com/gpu

Solutions:

Add GPU nodes to cluster
Check GPU nodes are ready: kubectl get nodes -l nvidia.com/gpu=true
Verify GPU device plugin: kubectl get pods -n kube-system -l name=nvidia-device-plugin
Reduce requested GPUs or add more nodes

Node Selector Mismatch

Check:

$ kubectl get nodes --show-labels
$ kubectl describe pod lightning-asr-xxx -n smallest | grep "Node-Selectors"

Solutions:

Update nodeSelector in values.yaml to match actual node labels
Remove nodeSelector if not needed
Add labels to nodes: kubectl label nodes <node-name> workload=gpu

Tolerations Missing

Check:

$ kubectl describe pod lightning-asr-xxx -n smallest | grep -A5 "Tolerations"
$ kubectl describe node <node-name> | grep "Taints"

Solutions: Update tolerations in values.yaml:

1 lightningAsr:
2   tolerations:
3     - key: nvidia.com/gpu
4       operator: Exists
5       effect: NoSchedule

PVC Not Bound

Check:

$ kubectl get pvc -n smallest

Look for: STATUS: Pending

Solutions:

Check storage class exists: kubectl get storageclass
Verify sufficient storage: kubectl describe pvc <pvc-name> -n smallest
Check EFS/EBS CSI driver running: kubectl get pods -n kube-system -l app=efs-csi-controller

ImagePullBackOff

Symptoms:

NAME                READY   STATUS             RESTARTS   AGE
lightning-asr-xxx   0/1     ImagePullBackOff   0          2m

Diagnosis:

$ kubectl describe pod lightning-asr-xxx -n smallest

Look for errors in Events section.

Solutions:

Invalid Credentials

Error: unauthorized: authentication required

Solutions:

Verify imageCredentials in values.yaml
Check secret created: kubectl get secrets -n smallest | grep registry
Test credentials locally: docker login quay.io

Recreate secret:

$ kubectl delete secret <pull-secret> -n smallest
$ helm upgrade smallest-self-host ... -f values.yaml

Image Not Found

Error: manifest unknown or not found

Solutions:

Verify image name in values.yaml
Check image exists: docker pull quay.io/smallestinc/lightning-asr:latest
Contact support@smallest.ai for access

Rate Limited

Error: rate limit exceeded

Solutions:

Wait and retry
Use authenticated pulls (imageCredentials)

CrashLoopBackOff

Symptoms:

NAME                READY   STATUS             RESTARTS   AGE
lightning-asr-xxx   0/1     CrashLoopBackOff   5          5m

Diagnosis:

$ kubectl logs lightning-asr-xxx -n smallest
$ kubectl logs lightning-asr-xxx -n smallest --previous
$ kubectl describe pod lightning-asr-xxx -n smallest

Common Causes:

License Validation Failed

Error: License validation failed or Invalid license key

Solutions:

Check License Proxy is running: kubectl get pods -l app=license-proxy -n smallest
Verify license key in values.yaml
Check License Proxy logs: kubectl logs -l app=license-proxy -n smallest
Test License Proxy: kubectl exec -it <api-server-pod> -- curl http://license-proxy:3369/health

Model Download Failed

Error: Failed to download model or Connection timeout

Solutions:

Verify MODEL_URL in values.yaml
Check network connectivity
Check disk space: kubectl exec -it <pod> -- df -h
Test URL: kubectl run test --rm -it --image=curlimages/curl -- curl -I $MODEL_URL

Out of Memory

Error: Pod killed, exit code 137

Solutions:

Check memory limits:

$ kubectl describe pod lightning-asr-xxx -n smallest | grep -A5 Limits

Increase memory:

1 lightningAsr:
2   resources:
3     limits:
4       memory: 16Gi

Check node capacity: kubectl describe node <node-name>

GPU Not Accessible

Error: No CUDA-capable device or GPU not found

Solutions:

Verify GPU available on node: kubectl describe node <node-name> | grep nvidia.com/gpu
Check NVIDIA device plugin: kubectl get pods -n kube-system -l name=nvidia-device-plugin
Restart device plugin: kubectl delete pod -n kube-system -l name=nvidia-device-plugin
Verify GPU driver on node

Service Not Accessible

Symptoms:

Cannot connect to API server
Connection refused errors
Timeouts

Diagnosis:

$ kubectl get svc -n smallest
$ kubectl describe svc api-server -n smallest
$ kubectl get endpoints -n smallest

Solutions:

No Endpoints

Issue: Service has no endpoints

Check:

$ kubectl get endpoints api-server -n smallest

Solutions:

Verify pods are running: kubectl get pods -l app=api-server -n smallest
Check pod labels match service selector
Check pods are ready: kubectl get pods -l app=api-server -o wide

Wrong Port

Solutions:

Verify service port:

$ kubectl get svc api-server -n smallest -o yaml

Use correct port in connections (7100 for API Server)

Network Policy Blocking

Check:

$ kubectl get networkpolicy -n smallest

Solutions:

Review network policies

Temporarily disable to test:

$ kubectl delete networkpolicy <policy-name> -n smallest

HPA Not Scaling

Symptoms:

HPA shows <unknown> for metrics
Pods not scaling despite high load

Diagnosis:

$ kubectl get hpa -n smallest
$ kubectl describe hpa lightning-asr -n smallest
$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

Solutions:

Metrics Not Available

Check:

$ kubectl get servicemonitor -n smallest
$ kubectl logs -n kube-system -l app.kubernetes.io/name=prometheus-adapter

Solutions:

Enable ServiceMonitor:

1 scaling:
2   auto:
3     lightningAsr:
4       servicemonitor:
5         enabled: true

Verify Prometheus is scraping:

$ kubectl port-forward svc/smallest-prometheus-stack-prometheus 9090:9090

Query: asr_active_requests

Already at Max Replicas

Check:

$ kubectl get hpa lightning-asr -n smallest

Solutions:

Increase maxReplicas:

1 scaling:
2   auto:
3     lightningAsr:
4       hpa:
5         maxReplicas: 20

Insufficient Cluster Resources

Solutions:

Add more nodes
Enable Cluster Autoscaler
Check pending pods: kubectl get pods --field-selector=status.phase=Pending

Persistent Volume Issues

Symptoms:

PVC stuck in Pending
Mount failures
Permission denied

Solutions:

No Storage Class

Check:

$ kubectl get storageclass

Solutions:

Install EBS CSI driver (AWS)
Install EFS CSI driver (AWS)
Create storage class

EFS Mount Failed

Check:

$ kubectl describe pod <pod-name> | grep -A10 "Events"

Solutions:

Verify EFS file system ID
Check security group allows NFS (port 2049)
Verify EFS CSI driver: kubectl get pods -n kube-system -l app=efs-csi-controller

Permission Denied

Solutions:

Check volume permissions

Add fsGroup to pod securityContext:

1 securityContext:
2   fsGroup: 1000

Performance Issues

Slow Response Times

Check:

$ kubectl top pods -n smallest
$ kubectl top nodes
$ kubectl logs -l app=lightning-asr -n smallest | grep -i "latency\|duration"

Solutions:

Increase pod resources
Scale up replicas
Check GPU utilization: kubectl exec -it <lightning-asr-pod> -- nvidia-smi
Review model configuration
Check network latency

High CPU/Memory Usage

Check:

$ kubectl top pods -n smallest
$ kubectl describe pod <pod-name> -n smallest | grep -A5 "Limits"

Solutions:

Increase resource limits
Scale horizontally (more pods)
Investigate memory leaks in logs
Enable monitoring with Grafana

Debugging Tools

Interactive Shell

$ kubectl exec -it <pod-name> -n smallest -- /bin/sh

Debug Container

$ kubectl debug <pod-name> -n smallest -it --image=ubuntu -- bash

Network Debugging

$ kubectl run netdebug --rm -it --restart=Never \
>   --image=nicolaka/netshoot \
>   --namespace=smallest

Inside the debug pod:

$ nslookup api-server
$ curl http://api-server:7100/health
$ traceroute lightning-asr

Copy Files

$ kubectl cp <pod-name>:/path/to/file ./local-file -n smallest
$ kubectl cp ./local-file <pod-name>:/path/to/file -n smallest

Getting Help

Collect Diagnostic Information

Before contacting support, collect:

$ kubectl get all -n smallest > status.txt
$ kubectl describe pods -n smallest > pods.txt
$ kubectl logs -l app=lightning-asr -n smallest --tail=500 > asr-logs.txt
$ kubectl logs -l app=api-server -n smallest --tail=500 > api-logs.txt
$ kubectl logs -l app=license-proxy -n smallest --tail=500 > license-logs.txt
$ kubectl get events -n smallest --sort-by='.lastTimestamp' > events.txt
$ kubectl top nodes > nodes.txt
$ kubectl top pods -n smallest > pod-resources.txt
$ helm get values smallest-self-host -n smallest > values.txt

Contact Support

Email: support@smallest.ai

Include:

Description of the issue
Steps to reproduce
Diagnostic files collected above
Cluster information (EKS version, node types, etc.)
Helm chart version

What’s Next?

General Troubleshooting

Platform-agnostic troubleshooting guide

API Reference

API integration documentation

$	kubectl get all -n smallest
$	kubectl get pods -n smallest --show-labels
$	kubectl top pods -n smallest
$	kubectl top nodes

$	kubectl describe pod <pod-name> -n smallest
$	kubectl logs <pod-name> -n smallest
$	kubectl logs <pod-name> -n smallest --previous
$	kubectl logs <pod-name> -c <container-name> -n smallest -f

$	kubectl get events -n smallest --sort-by='.lastTimestamp'
$	kubectl get events -n smallest --field-selector type=Warning

$	kubectl get nodes --show-labels
$	kubectl describe pod lightning-asr-xxx -n smallest \| grep "Node-Selectors"

$	kubectl describe pod lightning-asr-xxx -n smallest \| grep -A5 "Tolerations"
$	kubectl describe node <node-name> \| grep "Taints"

1	lightningAsr:
2	tolerations:
3	- key: nvidia.com/gpu
4	operator: Exists
5	effect: NoSchedule

$	kubectl delete secret <pull-secret> -n smallest
$	helm upgrade smallest-self-host ... -f values.yaml

$	kubectl logs lightning-asr-xxx -n smallest
$	kubectl logs lightning-asr-xxx -n smallest --previous
$	kubectl describe pod lightning-asr-xxx -n smallest

$	kubectl get svc -n smallest
$	kubectl describe svc api-server -n smallest
$	kubectl get endpoints -n smallest

$	kubectl get hpa -n smallest
$	kubectl describe hpa lightning-asr -n smallest
$	kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" \| jq .

$	kubectl get servicemonitor -n smallest
$	kubectl logs -n kube-system -l app.kubernetes.io/name=prometheus-adapter

$	kubectl top pods -n smallest
$	kubectl top nodes
$	kubectl logs -l app=lightning-asr -n smallest \| grep -i "latency\\|duration"

$	kubectl run netdebug --rm -it --restart=Never \
>	--image=nicolaka/netshoot \
>	--namespace=smallest

$	nslookup api-server
$	curl http://api-server:7100/health
$	traceroute lightning-asr

$	kubectl cp <pod-name>:/path/to/file ./local-file -n smallest
$	kubectl cp ./local-file <pod-name>:/path/to/file -n smallest

$	kubectl get all -n smallest > status.txt
$	kubectl describe pods -n smallest > pods.txt
$	kubectl logs -l app=lightning-asr -n smallest --tail=500 > asr-logs.txt
$	kubectl logs -l app=api-server -n smallest --tail=500 > api-logs.txt
$	kubectl logs -l app=license-proxy -n smallest --tail=500 > license-logs.txt
$	kubectl get events -n smallest --sort-by='.lastTimestamp' > events.txt
$	kubectl top nodes > nodes.txt
$	kubectl top pods -n smallest > pod-resources.txt
$	helm get values smallest-self-host -n smallest > values.txt