Quick Start | Smallest AI Docs

Kubernetes deployment is currently available for ASR (Speech-to-Text) only. For TTS deployments, use Docker.

Ensure you’ve completed all prerequisites before starting.

Add Helm Repository

$ helm repo add smallest-self-host https://smallest-inc.github.io/smallest-self-host
$ helm repo update

Create Namespace

$ kubectl create namespace smallest
$ kubectl config set-context --current --namespace=smallest

Configure Values

Create a values.yaml file:

values.yaml

1 global:
2   licenseKey: "your-license-key-here"
3   imageCredentials:
4     create: true
5     registry: quay.io
6     username: "your-registry-username"
7     password: "your-registry-password"
8     email: "your-email@example.com"
9 
10 models:
11   asrModelUrl: "your-model-url-here"
12 
13 scaling:
14   replicas:
15     lightningAsr: 1
16     licenseProxy: 1
17 
18 lightningAsr:
19   nodeSelector:
20   tolerations:
21 
22 redis:
23   enabled: true
24   auth:
25     enabled: true

Replace placeholder values with credentials provided by Smallest.ai support.

Install

$ helm install smallest-self-host smallest-self-host/smallest-self-host \
>   -f values.yaml \
>   --namespace smallest

Monitor the deployment:

$ kubectl get pods -w

Component	Startup Time	Ready Indicator
Redis	~30s	`1/1 Running`
License Proxy	~1m	`1/1 Running`
Lightning ASR	2-10m	`1/1 Running` (model download on first run)
API Server	~30s	`1/1 Running`

Model downloads are cached when using shared storage (EFS). Subsequent starts complete in under a minute.

Verify Installation

$ kubectl get pods,svc

All pods should show Running status with the following services available:

Service	Port	Description
api-server	7100	REST API endpoint
lightning-asr-internal	2269	ASR inference service
license-proxy	3369	License validation
redis-master	6379	Request queue

Test the API

Port forward and send a health check:

$ kubectl port-forward svc/api-server 7100:7100

$ curl http://localhost:7100/health

Autoscaling

Enable automatic scaling based on real-time inference load:

values.yaml

1 scaling:
2   auto:
3     enabled: true

This deploys HorizontalPodAutoscalers that scale based on active requests:

Component	Metric	Default Target	Behavior
Lightning ASR	`asr_active_requests`	4 per pod	Scales GPU workers based on inference queue depth
API Server	`lightning_asr_replica_count`	2:1 ratio	Maintains API capacity proportional to ASR workers

How It Works

Lightning ASR exposes asr_active_requests metric on port 9090
Prometheus scrapes this metric via ServiceMonitor
Prometheus Adapter makes it available to the Kubernetes metrics API
HPA scales pods when average requests per pod exceeds target

Configuration

values.yaml

1 scaling:
2   auto:
3     enabled: true
4     lightningAsr:
5       hpa:
6         minReplicas: 1
7         maxReplicas: 10
8         targetActiveRequests: 4

Verify Autoscaling

$ kubectl get hpa

NAME            REFERENCE                  TARGETS   MINPODS   MAXPODS   REPLICAS
lightning-asr   Deployment/lightning-asr   0/4       1         10        1
api-server      Deployment/api-server      1/2       1         10        1

The TARGETS column shows current/target. When current exceeds target, pods scale up.

Autoscaling requires the Prometheus stack. It’s included as a dependency and enabled by default.

Helm Operations

$ helm upgrade smallest-self-host smallest-self-host/smallest-self-host \
>   -f values.yaml -n smallest

Troubleshooting

Issue	Cause	Resolution
Pods `Pending`	Insufficient resources or missing GPU nodes	Check `kubectl describe pod <name>` for scheduling errors
`ImagePullBackOff`	Invalid registry credentials	Verify `imageCredentials` in values.yaml
`CrashLoopBackOff`	Invalid license or insufficient memory	Check logs with `kubectl logs <pod> —previous`
Slow model download	Large model size (~20GB)	Use shared storage (EFS) for caching

For detailed troubleshooting, see Troubleshooting Guide.

Next Steps

AWS Setup

EKS-specific configuration

Model Storage

Shared storage for faster cold starts

Advanced Autoscaling

Fine-tune scaling behavior and thresholds

Monitoring

Grafana dashboards and alerting