Quick Start

View as MarkdownOpen in Claude

Kubernetes deployment is currently available for ASR (Speech-to-Text) only. For TTS deployments, use Docker.

Ensure you’ve completed all prerequisites before starting.

Add Helm Repository

$helm repo add smallest-self-host https://smallest-inc.github.io/smallest-self-host
$helm repo update

Create Namespace

$kubectl create namespace smallest
$kubectl config set-context --current --namespace=smallest

Configure Values

Create a values.yaml file:

values.yaml
1global:
2 licenseKey: "your-license-key-here"
3 imageCredentials:
4 create: true
5 registry: quay.io
6 username: "your-registry-username"
7 password: "your-registry-password"
8 email: "your-email@example.com"
9
10models:
11 asrModelUrl: "your-model-url-here"
12
13scaling:
14 replicas:
15 lightningAsr: 1
16 licenseProxy: 1
17
18lightningAsr:
19 nodeSelector:
20 tolerations:
21
22redis:
23 enabled: true
24 auth:
25 enabled: true

Replace placeholder values with credentials provided by Smallest.ai support.

Install

$helm install smallest-self-host smallest-self-host/smallest-self-host \
> -f values.yaml \
> --namespace smallest

Monitor the deployment:

$kubectl get pods -w
ComponentStartup TimeReady Indicator
Redis~30s1/1 Running
License Proxy~1m1/1 Running
Lightning ASR2-10m1/1 Running (model download on first run)
API Server~30s1/1 Running

Model downloads are cached when using shared storage (EFS). Subsequent starts complete in under a minute.

Verify Installation

$kubectl get pods,svc

All pods should show Running status with the following services available:

ServicePortDescription
api-server7100REST API endpoint
lightning-asr-internal2269ASR inference service
license-proxy3369License validation
redis-master6379Request queue

Test the API

Port forward and send a health check:

$kubectl port-forward svc/api-server 7100:7100
$curl http://localhost:7100/health

Autoscaling

Enable automatic scaling based on real-time inference load:

values.yaml
1scaling:
2 auto:
3 enabled: true

This deploys HorizontalPodAutoscalers that scale based on active requests:

ComponentMetricDefault TargetBehavior
Lightning ASRasr_active_requests4 per podScales GPU workers based on inference queue depth
API Serverlightning_asr_replica_count2:1 ratioMaintains API capacity proportional to ASR workers

How It Works

  1. Lightning ASR exposes asr_active_requests metric on port 9090
  2. Prometheus scrapes this metric via ServiceMonitor
  3. Prometheus Adapter makes it available to the Kubernetes metrics API
  4. HPA scales pods when average requests per pod exceeds target

Configuration

values.yaml
1scaling:
2 auto:
3 enabled: true
4 lightningAsr:
5 hpa:
6 minReplicas: 1
7 maxReplicas: 10
8 targetActiveRequests: 4

Verify Autoscaling

$kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
lightning-asr Deployment/lightning-asr 0/4 1 10 1
api-server Deployment/api-server 1/2 1 10 1

The TARGETS column shows current/target. When current exceeds target, pods scale up.

Autoscaling requires the Prometheus stack. It’s included as a dependency and enabled by default.

Helm Operations

$helm upgrade smallest-self-host smallest-self-host/smallest-self-host \
> -f values.yaml -n smallest

Troubleshooting

IssueCauseResolution
Pods PendingInsufficient resources or missing GPU nodesCheck kubectl describe pod <name> for scheduling errors
ImagePullBackOffInvalid registry credentialsVerify imageCredentials in values.yaml
CrashLoopBackOffInvalid license or insufficient memoryCheck logs with kubectl logs <pod> --previous
Slow model downloadLarge model size (~20GB)Use shared storage (EFS) for caching

For detailed troubleshooting, see Troubleshooting Guide.

Next Steps