HPA Configuration
Overview
Horizontal Pod Autoscaling (HPA) automatically adjusts the number of Lightning ASR and API Server pods based on workload demand. This guide covers configuring HPA using custom metrics like active request count.
How HPA Works
Lightning ASR exports the asr_active_requests metric, which tracks the number of requests currently being processed. HPA uses this to scale pods up or down.
Prerequisites
Enable HPA
Lightning ASR HPA
Configure autoscaling for Lightning ASR based on active requests:
Parameters:
minReplicas: Minimum number of pods (never scales below)maxReplicas: Maximum number of pods (never scales above)targetActiveRequests: Target active requests per pod (scales when exceeded)scaleUpStabilizationWindowSeconds: Delay before scaling up (0 = immediate)scaleDownStabilizationWindowSeconds: Delay before scaling down (prevents flapping)
API Server HPA
Configure autoscaling for API Server based on Lightning ASR replicas:
Parameters:
lightningAsrToApiServerRatio: Ratio of Lightning ASR to API Server pods (2 = 2 ASR pods per 1 API pod)
Advanced Scaling Behavior
Custom Scaling Policies
Fine-tune scaling behavior:
Scale Up Policies:
- Add up to 100% more pods every 15 seconds
- OR add up to 2 pods every 15 seconds
- Use whichever is higher (
selectPolicy: Max)
Scale Down Policies:
- Remove up to 50% of pods every 60 seconds
- OR remove up to 1 pod every 60 seconds
- Use whichever is lower (
selectPolicy: Min)
Multi-Metric HPA
Scale based on multiple metrics:
Verify HPA Configuration
Check HPA Status
Expected output:
Describe HPA
Look for:
Check Custom Metrics
Verify prometheus-adapter is providing metrics:
Should show asr_active_requests in the list.
Query specific metric:
Testing HPA
Load Testing
Generate load to trigger scaling:
Watch scaling in action:
Monitor Pod Count
In another terminal:
You should see:
- Active requests increase
- HPA detects load above target
- New pods created
- Load distributed across pods
- After load decreases, pods scale down (after stabilization window)
Scaling Scenarios
Scenario 1: Traffic Spike
Situation: Sudden increase in requests
HPA Response:
- Detects
asr_active_requests> 5 per pod - Immediately scales up (stabilization: 0s)
- Adds pods based on policy (2 pods or 100%, whichever is higher)
- Repeats every 15 seconds until load is distributed
Configuration:
Scenario 2: Gradual Traffic Decline
Situation: Traffic decreases after peak hours
HPA Response:
- Detects
asr_active_requests< 5 per pod - Waits 300 seconds (5 minutes) before scaling down
- Gradually removes pods (1 pod or 50%, whichever is lower)
- Prevents premature scale-down
Configuration:
Scenario 3: Off-Hours
Situation: No traffic during night
HPA Response:
- Scales down to
minReplicas: 1 - Keeps one pod ready for incoming requests
- Scales up immediately when traffic resumes
Configuration:
For complete cost savings during off-hours, use Cluster Autoscaler to scale nodes to zero.
Troubleshooting
HPA Shows “Unknown”
Symptom:
Diagnosis:
Check prometheus-adapter logs:
Check ServiceMonitor:
Check Prometheus is scraping:
Open http://localhost:9090 and query: asr_active_requests
Solutions:
- Ensure ServiceMonitor is created
- Verify Prometheus is scraping Lightning ASR pods
- Check prometheus-adapter configuration
HPA Not Scaling
Symptom: Metrics show high load but pods not increasing
Check:
Look for events explaining why scaling didn’t occur:
Common causes:
- Metrics not available (see above)
- Already at
maxReplicas - Insufficient cluster resources
- Stabilization window preventing scale-up
Pods Scaling Too Aggressively
Symptom: Pods constantly scaling up and down
Solution: Increase stabilization windows:
Scale-Down Too Slow
Symptom: Pods remain after traffic drops
Solution: Reduce scale-down stabilization:
Be careful: too aggressive scale-down causes flapping.
Best Practices
Set Appropriate Targets
Choose targetActiveRequests based on your model performance:
- Larger models (slower inference): Lower target (e.g., 3)
- Smaller models (faster inference): Higher target (e.g., 10)
Test with load to find optimal value.
Use Conservative Scale-Down
Scale up quickly, scale down slowly:
Prevents request failures during traffic fluctuations.
Set Realistic Limits
Consider cluster capacity when setting maxReplicas:
Don’t set higher than available GPU resources.
Monitor HPA Decisions
Use Grafana to visualize:
- Current vs target metrics
- Pod count over time
- Scale-up/down events
Test Under Load
Regularly load test to verify HPA behavior:

