***

title: Quick Start
description: Deploy Smallest Self-Host on Kubernetes with Helm
---------------------

For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.smallest.ai/waves/v-4-0-0/self-host/kubernetes-setup/llms.txt. For full documentation content, see https://docs.smallest.ai/waves/v-4-0-0/self-host/kubernetes-setup/llms-full.txt.

<Warning>
  Kubernetes deployment is currently available for **ASR (Speech-to-Text)** only. For TTS deployments, use [Docker](/waves/self-host/docker-setup/tts-deployment/quick-start).
</Warning>

<Note>
  Ensure you've completed all [prerequisites](/waves/self-host/kubernetes-setup/prerequisites/hardware-requirements) before starting.
</Note>

## Add Helm Repository

```bash
helm repo add smallest-self-host https://smallest-inc.github.io/smallest-self-host
helm repo update
```

## Create Namespace

```bash
kubectl create namespace smallest
kubectl config set-context --current --namespace=smallest
```

## Configure Values

Create a `values.yaml` file:

```yaml values.yaml
global:
  licenseKey: "your-license-key-here"
  imageCredentials:
    create: true
    registry: quay.io
    username: "your-registry-username"
    password: "your-registry-password"
    email: "your-email@example.com"

models:
  asrModelUrl: "your-model-url-here"

scaling:
  replicas:
    lightningAsr: 1
    licenseProxy: 1

lightningAsr:
  nodeSelector:
  tolerations:

redis:
  enabled: true
  auth:
    enabled: true
```

<Warning>
  Replace placeholder values with credentials provided by Smallest.ai support.
</Warning>

## Install

```bash
helm install smallest-self-host smallest-self-host/smallest-self-host \
  -f values.yaml \
  --namespace smallest
```

Monitor the deployment:

```bash
kubectl get pods -w
```

<table>
  <thead>
    <tr>
      <th>
        Component
      </th>

      <th>
        Startup Time
      </th>

      <th>
        Ready Indicator
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        Redis
      </td>

      <td>
        \~30s
      </td>

      <td>
        <code>1/1 Running</code>
      </td>
    </tr>

    <tr>
      <td>
        License Proxy
      </td>

      <td>
        \~1m
      </td>

      <td>
        <code>1/1 Running</code>
      </td>
    </tr>

    <tr>
      <td>
        Lightning ASR
      </td>

      <td>
        2-10m
      </td>

      <td>
        <code>1/1 Running</code>

         (model download on first run)
      </td>
    </tr>

    <tr>
      <td>
        API Server
      </td>

      <td>
        \~30s
      </td>

      <td>
        <code>1/1 Running</code>
      </td>
    </tr>
  </tbody>
</table>

<Tip>
  Model downloads are cached when using shared storage (EFS). Subsequent starts complete in under a minute.
</Tip>

## Verify Installation

```bash
kubectl get pods,svc
```

All pods should show `Running` status with the following services available:

<table>
  <thead>
    <tr>
      <th>
        Service
      </th>

      <th>
        Port
      </th>

      <th>
        Description
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        api-server
      </td>

      <td>
        7100
      </td>

      <td>
        REST API endpoint
      </td>
    </tr>

    <tr>
      <td>
        lightning-asr-internal
      </td>

      <td>
        2269
      </td>

      <td>
        ASR inference service
      </td>
    </tr>

    <tr>
      <td>
        license-proxy
      </td>

      <td>
        3369
      </td>

      <td>
        License validation
      </td>
    </tr>

    <tr>
      <td>
        redis-master
      </td>

      <td>
        6379
      </td>

      <td>
        Request queue
      </td>
    </tr>
  </tbody>
</table>

## Test the API

Port forward and send a health check:

```bash
kubectl port-forward svc/api-server 7100:7100
```

```bash
curl http://localhost:7100/health
```

## Autoscaling

Enable automatic scaling based on real-time inference load:

```yaml values.yaml
scaling:
  auto:
    enabled: true
```

This deploys HorizontalPodAutoscalers that scale based on active requests:

<table>
  <thead>
    <tr>
      <th>
        Component
      </th>

      <th>
        Metric
      </th>

      <th>
        Default Target
      </th>

      <th>
        Behavior
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        Lightning ASR
      </td>

      <td>
        <code>asr_active_requests</code>
      </td>

      <td>
        4 per pod
      </td>

      <td>
        Scales GPU workers based on inference queue depth
      </td>
    </tr>

    <tr>
      <td>
        API Server
      </td>

      <td>
        <code>lightning_asr_replica_count</code>
      </td>

      <td>
        2:1 ratio
      </td>

      <td>
        Maintains API capacity proportional to ASR workers
      </td>
    </tr>
  </tbody>
</table>

### How It Works

1. **Lightning ASR** exposes `asr_active_requests` metric on port 9090
2. **Prometheus** scrapes this metric via ServiceMonitor
3. **Prometheus Adapter** makes it available to the Kubernetes metrics API
4. **HPA** scales pods when average requests per pod exceeds target

### Configuration

```yaml values.yaml
scaling:
  auto:
    enabled: true
    lightningAsr:
      hpa:
        minReplicas: 1
        maxReplicas: 10
        targetActiveRequests: 4
```

### Verify Autoscaling

```bash
kubectl get hpa
```

```
NAME            REFERENCE                  TARGETS   MINPODS   MAXPODS   REPLICAS
lightning-asr   Deployment/lightning-asr   0/4       1         10        1
api-server      Deployment/api-server      1/2       1         10        1
```

The `TARGETS` column shows `current/target`. When current exceeds target, pods scale up.

<Note>
  Autoscaling requires the Prometheus stack. It's included as a dependency and enabled by default.
</Note>

## Helm Operations

<CodeGroup>
  ```bash Upgrade
  helm upgrade smallest-self-host smallest-self-host/smallest-self-host \
    -f values.yaml -n smallest
  ```

  ```bash Rollback
  helm rollback smallest-self-host -n smallest
  ```

  ```bash Uninstall
  helm uninstall smallest-self-host -n smallest
  ```

  ```bash View Config
  helm get values smallest-self-host -n smallest
  ```
</CodeGroup>

## Troubleshooting

<table>
  <thead>
    <tr>
      <th>
        Issue
      </th>

      <th>
        Cause
      </th>

      <th>
        Resolution
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        Pods 

        <code>Pending</code>
      </td>

      <td>
        Insufficient resources or missing GPU nodes
      </td>

      <td>
        Check 

        <code>kubectl describe pod <name></code>

         for scheduling errors
      </td>
    </tr>

    <tr>
      <td>
        <code>ImagePullBackOff</code>
      </td>

      <td>
        Invalid registry credentials
      </td>

      <td>
        Verify 

        <code>imageCredentials</code>

         in values.yaml
      </td>
    </tr>

    <tr>
      <td>
        <code>CrashLoopBackOff</code>
      </td>

      <td>
        Invalid license or insufficient memory
      </td>

      <td>
        Check logs with 

        <code>kubectl logs <pod> --previous</code>
      </td>
    </tr>

    <tr>
      <td>
        Slow model download
      </td>

      <td>
        Large model size (~20GB)
      </td>

      <td>
        Use shared storage (EFS) for caching
      </td>
    </tr>
  </tbody>
</table>

For detailed troubleshooting, see [Troubleshooting Guide](/waves/self-host/kubernetes-setup/k8s-troubleshooting).

## Next Steps

<CardGroup cols={2}>
  <Card title="AWS Setup" href="/waves/self-host/kubernetes-setup/aws/eks-setup">
    EKS-specific configuration
  </Card>

  <Card title="Model Storage" href="/waves/self-host/kubernetes-setup/storage-pvc/model-storage">
    Shared storage for faster cold starts
  </Card>

  <Card title="Advanced Autoscaling" href="/waves/self-host/kubernetes-setup/autoscaling/hpa-configuration">
    Fine-tune scaling behavior and thresholds
  </Card>

  <Card title="Monitoring" href="/waves/self-host/kubernetes-setup/autoscaling/grafana-dashboards">
    Grafana dashboards and alerting
  </Card>
</CardGroup>