Metrics Setup

View as MarkdownOpen in Claude

Overview

The metrics setup enables autoscaling by collecting Lightning ASR metrics with Prometheus and exposing them to Kubernetes HPA through the Prometheus Adapter.

Architecture

/* The original had a syntax error in Mermaid—edges must connect nodes, not labels. “Metrics” is now a node, and edge directions/names are consistent. */

Components

Prometheus

Collects and stores metrics from Lightning ASR pods.

Included in chart:

values.yaml
1scaling:
2 auto:
3 enabled: true
4
5kube-prometheus-stack:
6 prometheus:
7 prometheusSpec:
8 serviceMonitorSelectorNilUsesHelmValues: false
9 retention: 7d
10 resources:
11 requests:
12 memory: 2Gi

ServiceMonitor

CRD that tells Prometheus which services to scrape.

Enabled for Lightning ASR:

values.yaml
1scaling:
2 auto:
3 lightningAsr:
4 servicemonitor:
5 enabled: true

Prometheus Adapter

Converts Prometheus metrics to Kubernetes custom metrics API.

Configuration:

values.yaml
1prometheus-adapter:
2 prometheus:
3 url: http://smallest-prometheus-stack-prometheus.default.svc
4 port: 9090
5 rules:
6 custom:
7 - seriesQuery: "asr_active_requests"
8 resources:
9 overrides:
10 namespace: {resource: "namespace"}
11 pod: {resource: "pod"}
12 name:
13 matches: "^(.*)$"
14 as: "${1}"
15 metricsQuery: "asr_active_requests{<<.LabelMatchers>>}"

Available Metrics

Lightning ASR exposes the following metrics:

MetricTypeDescription
asr_active_requestsGaugeCurrent number of active transcription requests
asr_total_requestsCounterTotal requests processed
asr_failed_requestsCounterTotal failed requests
asr_request_duration_secondsHistogramRequest processing time
asr_model_load_time_secondsGaugeTime to load model on startup
asr_gpu_utilizationGaugeGPU utilization percentage
asr_gpu_memory_used_bytesGaugeGPU memory used

Verify Metrics Setup

Check Prometheus

Forward Prometheus port:

$kubectl port-forward -n default svc/smallest-prometheus-stack-prometheus 9090:9090

Open http://localhost:9090 and verify:

  1. Status → Targets: Lightning ASR endpoints should be “UP”
  2. Graph: Query asr_active_requests - should return data
  3. Status → Service Discovery: Should show ServiceMonitor

Check ServiceMonitor

$kubectl get servicemonitor -n smallest

Expected output:

NAME AGE
lightning-asr 5m

Describe ServiceMonitor:

$kubectl describe servicemonitor lightning-asr -n smallest

Should show:

1Spec:
2 Endpoints:
3 Port: metrics
4 Path: /metrics
5 Selector:
6 Match Labels:
7 app: lightning-asr

Check Prometheus Adapter

Verify custom metrics are available:

$kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq -r '.resources[].name' | grep asr

Expected output:

pods/asr_active_requests
pods/asr_total_requests
pods/asr_failed_requests

Query specific metric:

$kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/smallest/pods/*/asr_active_requests" | jq .

Custom Metric Configuration

Add New Custom Metrics

To expose additional metrics to HPA:

values.yaml
1prometheus-adapter:
2 rules:
3 custom:
4 - seriesQuery: "asr_active_requests"
5 resources:
6 overrides:
7 namespace: {resource: "namespace"}
8 pod: {resource: "pod"}
9 name:
10 matches: "^(.*)$"
11 as: "${1}"
12 metricsQuery: "asr_active_requests{<<.LabelMatchers>>}"
13
14 - seriesQuery: "asr_gpu_utilization"
15 resources:
16 overrides:
17 namespace: {resource: "namespace"}
18 pod: {resource: "pod"}
19 name:
20 as: "gpu_utilization"
21 metricsQuery: "avg_over_time(asr_gpu_utilization{<<.LabelMatchers>>}[2m])"

External Metrics

For cluster-wide metrics:

values.yaml
1prometheus-adapter:
2 rules:
3 external:
4 - seriesQuery: 'kube_deployment_status_replicas{deployment="lightning-asr"}'
5 metricsQuery: 'sum(kube_deployment_status_replicas{deployment="lightning-asr"})'
6 name:
7 as: "lightning_asr_replica_count"
8 resources:
9 overrides:
10 namespace: {resource: "namespace"}

Use in HPA:

1spec:
2 metrics:
3 - type: External
4 external:
5 metric:
6 name: lightning_asr_replica_count
7 target:
8 type: Value
9 value: "5"

Prometheus Configuration

Retention Policy

Configure how long metrics are stored:

values.yaml
1kube-prometheus-stack:
2 prometheus:
3 prometheusSpec:
4 retention: 15d
5 retentionSize: "50GB"

Storage

Persist Prometheus data:

values.yaml
1kube-prometheus-stack:
2 prometheus:
3 prometheusSpec:
4 storageSpec:
5 volumeClaimTemplate:
6 spec:
7 storageClassName: gp3
8 accessModes: ["ReadWriteOnce"]
9 resources:
10 requests:
11 storage: 100Gi

Scrape Interval

Adjust how frequently metrics are collected:

values.yaml
1kube-prometheus-stack:
2 prometheus:
3 prometheusSpec:
4 scrapeInterval: 30s
5 evaluationInterval: 30s

Lower intervals (e.g., 15s) provide faster HPA response but increase storage.

Recording Rules

Pre-compute expensive queries:

1kube-prometheus-stack:
2 prometheus:
3 prometheusSpec:
4 additionalScrapeConfigs:
5 - job_name: 'lightning-asr-aggregated'
6 scrape_interval: 15s
7 static_configs:
8 - targets: ['lightning-asr:2269']
9
10 additionalPrometheusRulesMap:
11 asr-rules:
12 groups:
13 - name: asr_aggregations
14 interval: 30s
15 rules:
16 - record: asr:requests:rate5m
17 expr: rate(asr_total_requests[5m])
18
19 - record: asr:requests:active_avg
20 expr: avg(asr_active_requests) by (namespace)
21
22 - record: asr:gpu:utilization_avg
23 expr: avg(asr_gpu_utilization) by (namespace)

Use recording rules in HPA for better performance.

Alerting Rules

Create alerts for anomalies:

1kube-prometheus-stack:
2 prometheus:
3 prometheusSpec:
4 additionalPrometheusRulesMap:
5 asr-alerts:
6 groups:
7 - name: asr_alerts
8 rules:
9 - alert: HighErrorRate
10 expr: rate(asr_failed_requests[5m]) > 0.1
11 for: 5m
12 labels:
13 severity: warning
14 annotations:
15 summary: "High ASR error rate"
16 description: "Error rate is {{ $value }} errors/sec"
17
18 - alert: HighQueueLength
19 expr: asr_active_requests > 50
20 for: 2m
21 labels:
22 severity: warning
23 annotations:
24 summary: "ASR queue backing up"
25 description: "{{ $value }} requests queued"
26
27 - alert: GPUMemoryHigh
28 expr: asr_gpu_memory_used_bytes / 24000000000 > 0.9
29 for: 5m
30 labels:
31 severity: warning
32 annotations:
33 summary: "GPU memory usage high"
34 description: "GPU memory at {{ $value | humanizePercentage }}"

Debugging Metrics

Check Metrics Endpoint

Directly query Lightning ASR metrics:

$kubectl port-forward -n smallest svc/lightning-asr 2269:2269
$curl http://localhost:2269/metrics

Expected output:

# HELP asr_active_requests Current active requests
# TYPE asr_active_requests gauge
asr_active_requests{pod="lightning-asr-xxx"} 3
# HELP asr_total_requests Total requests processed
# TYPE asr_total_requests counter
asr_total_requests{pod="lightning-asr-xxx"} 1523
...

Test Prometheus Query

Access Prometheus UI and test queries:

1asr_active_requests
2rate(asr_total_requests[5m])
3histogram_quantile(0.95, asr_request_duration_seconds_bucket)

Check Prometheus Targets

$kubectl port-forward -n default svc/smallest-prometheus-stack-prometheus 9090:9090

Navigate to: http://localhost:9090/targets

Verify Lightning ASR targets are “UP”

View Prometheus Logs

$kubectl logs -n default -l app.kubernetes.io/name=prometheus --tail=100

Look for scrape errors.

Troubleshooting

Metrics Not Appearing

Check ServiceMonitor is created:

$kubectl get servicemonitor -n smallest

Check Prometheus is discovering:

$kubectl logs -n default -l app.kubernetes.io/name=prometheus | grep lightning-asr

Check service has metrics port:

$kubectl get svc lightning-asr -n smallest -o yaml

Should show:

1ports:
2 - name: metrics
3 port: 2269

Custom Metrics Not Available

Check Prometheus Adapter logs:

$kubectl logs -n kube-system -l app.kubernetes.io/name=prometheus-adapter

Verify adapter configuration:

$kubectl get configmap prometheus-adapter -n kube-system -o yaml

Test API manually:

$kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

High Cardinality Issues

If Prometheus is using too much memory:

  1. Reduce label cardinality
  2. Increase retention limits
  3. Use recording rules for complex queries
1kube-prometheus-stack:
2 prometheus:
3 prometheusSpec:
4 resources:
5 requests:
6 memory: 4Gi
7 limits:
8 memory: 8Gi

Best Practices

Pre-compute expensive queries:

1- record: asr:requests:rate5m
2 expr: rate(asr_total_requests[5m])

Then use in HPA instead of raw query

Balance responsiveness vs storage:

  • Fast autoscaling: 15s
  • Normal: 30s
  • Cost-optimized: 60s

Always persist Prometheus data:

1storageSpec:
2 volumeClaimTemplate:
3 spec:
4 resources:
5 requests:
6 storage: 100Gi

Track Prometheus performance:

  • Query duration
  • Scrape duration
  • Memory usage
  • TSDB size

Don’t rely on Prometheus UI

Use Grafana dashboards for ops

See Grafana Dashboards

What’s Next?