> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Services Overview

> Detailed breakdown of each service component in the STT Docker deployment

## Architecture

The Docker deployment consists of four main services that work together:

```mermaid
graph LR
    Client[Client] -->|HTTP/WebSocket| API[API Server :7100]
    API -->|gRPC| ASR[Lightning ASR :2233]
    API -->|HTTP| License[License Proxy :6699]
    ASR -->|HTTP| License
    ASR -->|Cache| Redis[Redis :6379]
    License -->|HTTPS| External[Smallest License Server]
    
    style API fill:#07C983
    style ASR fill:#0D9373
    style License fill:#1E90FF
    style Redis fill:#DC382D
```

## API Server

The API Server is the main entry point for all client requests.

### Purpose

* Routes incoming API requests to Lightning ASR workers
* Manages WebSocket connections for streaming
* Handles request queuing and load balancing
* Provides unified API interface

### Container Details

`quay.io/smallestinc/self-hosted-api-server:latest`

`7100` - Main API endpoint

* CPU: 0.5-2 cores
* Memory: 512 MB - 2 GB
* No GPU required

### Key Endpoints

<table>
  <thead>
    <tr>
      <th>
        Endpoint
      </th>

      <th>
        Method
      </th>

      <th>
        Purpose
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        <code>/health</code>
      </td>

      <td>
        GET
      </td>

      <td>
        Health check
      </td>
    </tr>

    <tr>
      <td>
        <code>/v1/listen</code>
      </td>

      <td>
        POST
      </td>

      <td>
        Synchronous transcription
      </td>
    </tr>

    <tr>
      <td>
        <code>/v1/listen/stream</code>
      </td>

      <td>
        WebSocket
      </td>

      <td>
        Streaming transcription
      </td>
    </tr>
  </tbody>
</table>

### Environment Variables

```yaml
LICENSE_KEY: Your license key
LIGHTNING_ASR_BASE_URL: Internal URL to Lightning ASR
API_BASE_URL: Internal URL to License Proxy
```

### Logs

Key log messages:

```
✓ Connected to Lightning ASR at http://lightning-asr:2233
✓ License validation successful
✓ API server listening on port 7100
```

### Dependencies

* Requires Lightning ASR to be running
* Requires License Proxy for validation
* Optionally uses Redis for request coordination

## Lightning ASR

The core speech recognition engine powered by GPU acceleration.

### Purpose

* Performs audio-to-text transcription
* Processes both batch and streaming requests
* Manages GPU resources and model inference
* Handles audio preprocessing and postprocessing

### Container Details

`quay.io/smallestinc/lightning-asr:latest`

`2233` - ASR service endpoint

* CPU: 4-8 cores
* Memory: 12-16 GB
* **GPU: 1x NVIDIA GPU (16+ GB VRAM)**

### GPU Requirements

Lightning ASR requires NVIDIA GPU with CUDA support:

<table>
  <thead>
    <tr>
      <th>
        GPU Model
      </th>

      <th>
        VRAM
      </th>

      <th>
        Performance
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        A100
      </td>

      <td>
        40-80 GB
      </td>

      <td>
        Excellent
      </td>
    </tr>

    <tr>
      <td>
        A10
      </td>

      <td>
        24 GB
      </td>

      <td>
        Excellent
      </td>
    </tr>

    <tr>
      <td>
        L4
      </td>

      <td>
        24 GB
      </td>

      <td>
        Very Good
      </td>
    </tr>

    <tr>
      <td>
        T4
      </td>

      <td>
        16 GB
      </td>

      <td>
        Good
      </td>
    </tr>
  </tbody>
</table>

### Environment Variables

```yaml
MODEL_URL: Download URL for ASR model
LICENSE_KEY: Your license key
REDIS_URL: Redis connection string
PORT: Service port (default 2233)
GPU_DEVICE_ID: GPU to use (for multi-GPU)
```

### Model Loading

On first startup, Lightning ASR:

1. Downloads model from MODEL\_URL (\~20 GB)
2. Validates model integrity
3. Loads model into GPU memory
4. Performs warmup inference

Use persistent volumes to cache models and avoid re-downloading on container restart.

### Logs

Key log messages:

```
✓ GPU detected: NVIDIA A10 (24GB)
✓ Downloading model from URL...
✓ Model loaded successfully (5.2GB)
✓ Warmup completed in 3.2s
✓ Server ready on port 2233
```

### Performance

Typical performance metrics:

<table>
  <thead>
    <tr>
      <th>
        Metric
      </th>

      <th>
        Value
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        Real-time Factor
      </td>

      <td>
        0.05-0.15x
      </td>
    </tr>

    <tr>
      <td>
        Cold Start
      </td>

      <td>
        30-60 seconds
      </td>
    </tr>

    <tr>
      <td>
        Warm Inference
      </td>

      <td>
        50-200ms latency
      </td>
    </tr>

    <tr>
      <td>
        Throughput
      </td>

      <td>
        100+ hours/hour (A10)
      </td>
    </tr>
  </tbody>
</table>

### Dependencies

* Requires License Proxy for validation
* Requires Redis for request coordination
* Requires NVIDIA GPU

## License Proxy

Validates license keys and reports usage to Smallest servers.

### Purpose

* Validates license keys on startup
* Reports usage metadata to Smallest
* Provides grace period for offline operation
* Acts as licensing gateway for all services

### Container Details

`quay.io/smallestinc/license-proxy:latest`

`6699` - License validation endpoint (internal)

* CPU: 0.25-1 core
* Memory: 256-512 MB
* No GPU required

### Environment Variables

```yaml
LICENSE_KEY: Your license key
```

### Network Requirements

License Proxy requires outbound HTTPS access to:

* `api.smallest.ai` on port 443

Ensure your firewall allows these connections.

### Validation Process

1. On startup, validates license key with Smallest servers
2. Receives license terms and quotas
3. Caches validation (valid for grace period)
4. Periodically reports usage metadata

### Usage Reporting

License Proxy reports only metadata:

<table>
  <thead>
    <tr>
      <th>
        Data Reported
      </th>

      <th>
        Example
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        Audio duration
      </td>

      <td>
        3600 seconds
      </td>
    </tr>

    <tr>
      <td>
        Request count
      </td>

      <td>
        150 requests
      </td>
    </tr>

    <tr>
      <td>
        Features used
      </td>

      <td>
        streaming, punctuation
      </td>
    </tr>

    <tr>
      <td>
        Response codes
      </td>

      <td>
        200, 400, 500
      </td>
    </tr>
  </tbody>
</table>

**No audio or transcript data is transmitted** to Smallest servers.

### Offline Mode

If connection to license server fails:

* Uses cached validation (24-hour grace period)
* Continues serving requests
* Logs warning messages
* Retries connection periodically

### Logs

Key log messages:

```
✓ License validated successfully
✓ License valid until: 2024-12-31
✓ Server listening on port 6699
⚠ Connection to license server failed, using cached validation
```

## Redis

Provides caching and state management for the system.

### Purpose

* Request queuing and coordination
* Session state for streaming connections
* Caching of frequent requests
* Performance optimization

### Container Details

`redis:latest` or `redis:7-alpine`

`6379` - Redis protocol

* CPU: 0.5-1 core
* Memory: 512 MB - 1 GB
* No GPU required

### Configuration Options

Default configuration with minimal setup:

```yaml
redis:
  image: redis:latest
  ports:
    - "6379:6379"
```

Enable data persistence:

```yaml
redis:
  image: redis:latest
  command: redis-server --appendonly yes
  volumes:
    - redis-data:/data
```

Add password protection:

```yaml
redis:
  image: redis:latest
  command: redis-server --requirepass ${REDIS_PASSWORD}
```

Use external Redis instance:

```yaml
environment:
  REDIS_URL: redis://external-host:6379
```

Remove Redis service from docker-compose.yml

### Data Stored

Redis stores:

* Request queue state
* WebSocket session data
* Temporary audio chunks (streaming)
* Worker status and health

Data in Redis is temporary and can be safely cleared. No persistent state is stored.

### Health Check

Built-in health check:

```yaml
healthcheck:
  test: ["CMD", "redis-cli", "ping"]
  interval: 5s
  timeout: 3s
  retries: 5
```

## Service Dependencies

Startup order and dependencies:

```mermaid
graph TD
    Redis[Redis] --> ASR[Lightning ASR]
    License[License Proxy] --> ASR
    ASR --> API[API Server]
    License --> API
    
    style Redis fill:#DC382D
    style License fill:#1E90FF
    style ASR fill:#0D9373
    style API fill:#07C983
```

### Recommended Startup Sequence

1. **Redis** - Starts immediately (5 seconds)
2. **License Proxy** - Validates license (10-15 seconds)
3. **Lightning ASR** - Downloads/loads model (30-600 seconds)
4. **API Server** - Connects to services (5-10 seconds)

## Resource Planning

### Minimum Configuration

For development/testing:

```yaml
Total Resources:
  CPU: 6 cores
  Memory: 16 GB
  GPU: 1x T4 (16 GB VRAM)
  Storage: 100 GB
```

### Production Configuration

For production workloads:

```yaml
Total Resources:
  CPU: 12 cores
  Memory: 32 GB
  GPU: 1x A10 (24 GB VRAM)
  Storage: 200 GB
```

### Multi-Worker Configuration

For high-volume production:

```yaml
Total Resources:
  CPU: 24 cores
  Memory: 64 GB
  GPU: 2x A10 (24 GB VRAM each)
  Storage: 300 GB
```

## Monitoring

### Container Health

Check container status:

```bash
docker compose ps
```

### Resource Usage

Monitor resource consumption:

```bash
docker stats
```

### GPU Usage

Monitor GPU utilization:

```bash
watch -n 1 nvidia-smi
```

### Logs

View service logs:

```bash
docker compose logs -f [service-name]
```

## What's Next?

Customize service configuration and resource allocation

Debug issues and optimize performance
Endpoint	Method	Purpose
`/health`	GET	Health check
`/v1/listen`	POST	Synchronous transcription
`/v1/listen/stream`	WebSocket	Streaming transcription
GPU Model	VRAM	Performance
A100	40-80 GB	Excellent
A10	24 GB	Excellent
L4	24 GB	Very Good
T4	16 GB	Good
Metric	Value
Real-time Factor	0.05-0.15x
Cold Start	30-60 seconds
Warm Inference	50-200ms latency
Throughput	100+ hours/hour (A10)
Data Reported	Example
Audio duration	3600 seconds
Request count	150 requests
Features used	streaming, punctuation
Response codes	200, 400, 500