HPA Configuration

Overview

Horizontal Pod Autoscaling (HPA) automatically adjusts the number of Lightning ASR and API Server pods based on workload demand. This guide covers configuring HPA using custom metrics like active request count.

How HPA Works

Lightning ASR exports the asr_active_requests metric, which tracks the number of requests currently being processed. HPA uses this to scale pods up or down.

Prerequisites

Prometheus Stack

Install kube-prometheus-stack (included in Helm chart):

values.yaml

1 scaling:
2   auto:
3     enabled: true
4 
5 kube-prometheus-stack:
6   prometheus:
7     prometheusSpec:
8       serviceMonitorSelectorNilUsesHelmValues: false
9   prometheusOperator:
10     enabled: true
11   grafana:
12     enabled: true

Prometheus Adapter

Install prometheus-adapter (included in Helm chart):

values.yaml

1 prometheus-adapter:
2   prometheus:
3     url: http://smallest-prometheus-stack-prometheus.default.svc
4     port: 9090

Service Monitor

Enable ServiceMonitor for Lightning ASR:

values.yaml

1 scaling:
2   auto:
3     lightningAsr:
4       servicemonitor:
5         enabled: true

Enable HPA

Lightning ASR HPA

Configure autoscaling for Lightning ASR based on active requests:

values.yaml

1 scaling:
2   auto:
3     enabled: true
4     lightningAsr:
5       hpa:
6         enabled: true
7         minReplicas: 1
8         maxReplicas: 10
9         targetActiveRequests: 5
10         scaleUpStabilizationWindowSeconds: 0
11         scaleDownStabilizationWindowSeconds: 300

Parameters:

minReplicas: Minimum number of pods (never scales below)
maxReplicas: Maximum number of pods (never scales above)
targetActiveRequests: Target active requests per pod (scales when exceeded)
scaleUpStabilizationWindowSeconds: Delay before scaling up (0 = immediate)
scaleDownStabilizationWindowSeconds: Delay before scaling down (prevents flapping)

API Server HPA

Configure autoscaling for API Server based on Lightning ASR replicas:

values.yaml

1 scaling:
2   auto:
3     enabled: true
4     apiServer:
5       hpa:
6         enabled: true
7         minReplicas: 1
8         maxReplicas: 10
9         lightningAsrToApiServerRatio: 2
10         scaleUpStabilizationWindowSeconds: 30
11         scaleDownStabilizationWindowSeconds: 60

Parameters:

lightningAsrToApiServerRatio: Ratio of Lightning ASR to API Server pods (2 = 2 ASR pods per 1 API pod)

Advanced Scaling Behavior

Custom Scaling Policies

Fine-tune scaling behavior:

values.yaml

1 scaling:
2   auto:
3     lightningAsr:
4       hpa:
5         enabled: true
6         minReplicas: 2
7         maxReplicas: 20
8         targetActiveRequests: 5
9         behavior:
10           scaleUp:
11             stabilizationWindowSeconds: 5
12             policies:
13               - type: Percent
14                 value: 100
15                 periodSeconds: 15
16               - type: Pods
17                 value: 2
18                 periodSeconds: 15
19             selectPolicy: Max
20           scaleDown:
21             stabilizationWindowSeconds: 300
22             policies:
23               - type: Percent
24                 value: 50
25                 periodSeconds: 60
26               - type: Pods
27                 value: 1
28                 periodSeconds: 60
29             selectPolicy: Min

Scale Up Policies:

Add up to 100% more pods every 15 seconds
OR add up to 2 pods every 15 seconds
Use whichever is higher (selectPolicy: Max)

Scale Down Policies:

Remove up to 50% of pods every 60 seconds
OR remove up to 1 pod every 60 seconds
Use whichever is lower (selectPolicy: Min)

Multi-Metric HPA

Scale based on multiple metrics:

1 spec:
2   metrics:
3     - type: Pods
4       pods:
5         metric:
6           name: asr_active_requests
7         target:
8           type: AverageValue
9           averageValue: "5"
10     - type: Resource
11       resource:
12         name: cpu
13         target:
14           type: Utilization
15           averageUtilization: 70
16     - type: Resource
17       resource:
18         name: memory
19         target:
20           type: Utilization
21           averageUtilization: 80

Verify HPA Configuration

Check HPA Status

$ kubectl get hpa -n smallest

Expected output:

NAME            REFERENCE                TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
lightning-asr   Deployment/lightning-asr   3/5       1         10        2          5m
api-server      Deployment/api-server      2/4       1         10        1          5m

Describe HPA

$ kubectl describe hpa lightning-asr -n smallest

Look for:

Metrics:
  "asr_active_requests" on pods:
    Current: 3
    Target:  5 (average)
Events:
  Normal   SuccessfulRescale   1m    horizontal-pod-autoscaler  New size: 2; reason: pods metric asr_active_requests above target

Check Custom Metrics

Verify prometheus-adapter is providing metrics:

$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

Should show asr_active_requests in the list.

Query specific metric:

$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/smallest/pods/*/asr_active_requests" | jq .

Testing HPA

Load Testing

Generate load to trigger scaling:

$ for i in {1..100}; do
$   curl -X POST http://api-server.smallest.svc.cluster.local:7100/v1/listen \
>     -H "Authorization: Token ${LICENSE_KEY}" \
>     -H "Content-Type: application/json" \
>     -d '{"url": "https://example.com/test-audio.wav"}' &
$ done

Watch scaling in action:

$ kubectl get hpa -n smallest -w

Monitor Pod Count

In another terminal:

$ watch -n 2 kubectl get pods -l app=lightning-asr -n smallest

You should see:

Active requests increase
HPA detects load above target
New pods created
Load distributed across pods
After load decreases, pods scale down (after stabilization window)

Scaling Scenarios

Scenario 1: Traffic Spike

Situation: Sudden increase in requests

HPA Response:

Detects asr_active_requests > 5 per pod
Immediately scales up (stabilization: 0s)
Adds pods based on policy (2 pods or 100%, whichever is higher)
Repeats every 15 seconds until load is distributed

Configuration:

1 scaleUpStabilizationWindowSeconds: 0
2 behavior:
3   scaleUp:
4     policies:
5       - type: Percent
6         value: 100
7       - type: Pods
8         value: 2
9     selectPolicy: Max

Scenario 2: Gradual Traffic Decline

Situation: Traffic decreases after peak hours

HPA Response:

Detects asr_active_requests < 5 per pod
Waits 300 seconds (5 minutes) before scaling down
Gradually removes pods (1 pod or 50%, whichever is lower)
Prevents premature scale-down

Configuration:

1 scaleDownStabilizationWindowSeconds: 300
2 behavior:
3   scaleDown:
4     policies:
5       - type: Percent
6         value: 50
7       - type: Pods
8         value: 1
9     selectPolicy: Min

Scenario 3: Off-Hours

Situation: No traffic during night

HPA Response:

Scales down to minReplicas: 1
Keeps one pod ready for incoming requests
Scales up immediately when traffic resumes

Configuration:

1 minReplicas: 1
2 maxReplicas: 10

For complete cost savings during off-hours, use Cluster Autoscaler to scale nodes to zero.

Troubleshooting

HPA Shows “Unknown”

Symptom:

NAME            TARGETS         MINPODS   MAXPODS
lightning-asr   <unknown>/5     1         10

Diagnosis:

Check prometheus-adapter logs:

$ kubectl logs -n kube-system -l app.kubernetes.io/name=prometheus-adapter

Check ServiceMonitor:

$ kubectl get servicemonitor -n smallest
$ kubectl describe servicemonitor lightning-asr -n smallest

Check Prometheus is scraping:

$ kubectl port-forward -n default svc/smallest-prometheus-stack-prometheus 9090:9090

Open http://localhost:9090 and query: asr_active_requests

Solutions:

Ensure ServiceMonitor is created
Verify Prometheus is scraping Lightning ASR pods
Check prometheus-adapter configuration

HPA Not Scaling

Symptom: Metrics show high load but pods not increasing

Check:

$ kubectl describe hpa lightning-asr -n smallest

Look for events explaining why scaling didn’t occur:

Events:
  Warning  FailedGetPodsMetric  1m  horizontal-pod-autoscaler  unable to get metric asr_active_requests

Common causes:

Metrics not available (see above)
Already at maxReplicas
Insufficient cluster resources
Stabilization window preventing scale-up

Pods Scaling Too Aggressively

Symptom: Pods constantly scaling up and down

Solution: Increase stabilization windows:

1 scaleUpStabilizationWindowSeconds: 30
2 scaleDownStabilizationWindowSeconds: 600

Scale-Down Too Slow

Symptom: Pods remain after traffic drops

Solution: Reduce scale-down stabilization:

1 scaleDownStabilizationWindowSeconds: 120

Be careful: too aggressive scale-down causes flapping.

Best Practices

Set Appropriate Targets

Choose targetActiveRequests based on your model performance:

Larger models (slower inference): Lower target (e.g., 3)
Smaller models (faster inference): Higher target (e.g., 10)

Test with load to find optimal value.

Use Conservative Scale-Down

Scale up quickly, scale down slowly:

1 scaleUpStabilizationWindowSeconds: 0
2 scaleDownStabilizationWindowSeconds: 300

Prevents request failures during traffic fluctuations.

Set Realistic Limits

Consider cluster capacity when setting maxReplicas:

1 maxReplicas: 10  # If cluster has 10 GPU nodes

Don’t set higher than available GPU resources.

Monitor HPA Decisions

Use Grafana to visualize:

Current vs target metrics
Pod count over time
Scale-up/down events

See Grafana Dashboards

Test Under Load

Regularly load test to verify HPA behavior:

$ kubectl run load-test --image=williamyeh/hey -it --rm -- \
>   -z 5m -c 50 http://api-server:7100/health

What’s Next?

Cluster Autoscaler

Scale cluster nodes automatically

Metrics Setup

Configure Prometheus and custom metrics

Grafana Dashboards

Visualize metrics and scaling behavior

1	scaling:
2	auto:
3	enabled: true
4
5	kube-prometheus-stack:
6	prometheus:
7	prometheusSpec:
8	serviceMonitorSelectorNilUsesHelmValues: false
9	prometheusOperator:
10	enabled: true
11	grafana:
12	enabled: true

1	prometheus-adapter:
2	prometheus:
3	url: http://smallest-prometheus-stack-prometheus.default.svc
4	port: 9090

1	scaling:
2	auto:
3	lightningAsr:
4	hpa:
5	enabled: true
6	minReplicas: 2
7	maxReplicas: 20
8	targetActiveRequests: 5
9	behavior:
10	scaleUp:
11	stabilizationWindowSeconds: 5
12	policies:
13	- type: Percent
14	value: 100
15	periodSeconds: 15
16	- type: Pods
17	value: 2
18	periodSeconds: 15
19	selectPolicy: Max
20	scaleDown:
21	stabilizationWindowSeconds: 300
22	policies:
23	- type: Percent
24	value: 50
25	periodSeconds: 60
26	- type: Pods
27	value: 1
28	periodSeconds: 60
29	selectPolicy: Min

1	spec:
2	metrics:
3	- type: Pods
4	pods:
5	metric:
6	name: asr_active_requests
7	target:
8	type: AverageValue
9	averageValue: "5"
10	- type: Resource
11	resource:
12	name: cpu
13	target:
14	type: Utilization
15	averageUtilization: 70
16	- type: Resource
17	resource:
18	name: memory
19	target:
20	type: Utilization
21	averageUtilization: 80

$	for i in {1..100}; do
$	curl -X POST http://api-server.smallest.svc.cluster.local:7100/v1/listen \
>	-H "Authorization: Token ${LICENSE_KEY}" \
>	-H "Content-Type: application/json" \
>	-d '{"url": "https://example.com/test-audio.wav"}' &
$	done

1	scaleUpStabilizationWindowSeconds: 0
2	behavior:
3	scaleUp:
4	policies:
5	- type: Percent
6	value: 100
7	- type: Pods
8	value: 2
9	selectPolicy: Max

1	scaleDownStabilizationWindowSeconds: 300
2	behavior:
3	scaleDown:
4	policies:
5	- type: Percent
6	value: 50
7	- type: Pods
8	value: 1
9	selectPolicy: Min

$	kubectl get servicemonitor -n smallest
$	kubectl describe servicemonitor lightning-asr -n smallest

1	scaleUpStabilizationWindowSeconds: 30
2	scaleDownStabilizationWindowSeconds: 600

$	kubectl run load-test --image=williamyeh/hey -it --rm -- \
>	-z 5m -c 50 http://api-server:7100/health