Horizontal Pod Autoscaling (HPA) automatically adjusts the number of Lightning ASR and API Server pods based on workload demand. This guide covers configuring HPA using custom metrics like active request count.
Lightning ASR exports the asr_active_requests metric, which tracks the number of requests currently being processed. HPA uses this to scale pods up or down.
Configure autoscaling for Lightning ASR based on active requests:
Parameters:
minReplicas: Minimum number of pods (never scales below)maxReplicas: Maximum number of pods (never scales above)targetActiveRequests: Target active requests per pod (scales when exceeded)scaleUpStabilizationWindowSeconds: Delay before scaling up (0 = immediate)scaleDownStabilizationWindowSeconds: Delay before scaling down (prevents flapping)Configure autoscaling for API Server based on Lightning ASR replicas:
Parameters:
lightningAsrToApiServerRatio: Ratio of Lightning ASR to API Server pods (2 = 2 ASR pods per 1 API pod)Fine-tune scaling behavior:
Scale Up Policies:
selectPolicy: Max)Scale Down Policies:
selectPolicy: Min)Scale based on multiple metrics:
Expected output:
Look for:
Verify prometheus-adapter is providing metrics:
Should show asr_active_requests in the list.
Query specific metric:
Generate load to trigger scaling:
Watch scaling in action:
In another terminal:
You should see:
Situation: Sudden increase in requests
HPA Response:
asr_active_requests > 5 per podConfiguration:
Situation: Traffic decreases after peak hours
HPA Response:
asr_active_requests < 5 per podConfiguration:
Situation: No traffic during night
HPA Response:
minReplicas: 1Configuration:
For complete cost savings during off-hours, use Cluster Autoscaler to scale nodes to zero.
Symptom:
Diagnosis:
Check prometheus-adapter logs:
Check ServiceMonitor:
Check Prometheus is scraping:
Open http://localhost:9090 and query: asr_active_requests
Solutions:
Symptom: Metrics show high load but pods not increasing
Check:
Look for events explaining why scaling didn’t occur:
Common causes:
maxReplicasSymptom: Pods constantly scaling up and down
Solution: Increase stabilization windows:
Symptom: Pods remain after traffic drops
Solution: Reduce scale-down stabilization:
Be careful: too aggressive scale-down causes flapping.
Choose targetActiveRequests based on your model performance:
Test with load to find optimal value.
Scale up quickly, scale down slowly:
Prevents request failures during traffic fluctuations.
Consider cluster capacity when setting maxReplicas:
Don’t set higher than available GPU resources.
Use Grafana to visualize:
Regularly load test to verify HPA behavior: