Services Overview

View as Markdown

Architecture

The Docker deployment consists of four main services that work together:

API Server

The API Server is the main entry point for all client requests.

Purpose

  • Routes incoming API requests to Lightning ASR workers
  • Manages WebSocket connections for streaming
  • Handles request queuing and load balancing
  • Provides unified API interface

Container Details

Image
string

quay.io/smallestinc/self-hosted-api-server:latest

Port
integer

7100 - Main API endpoint

Resources
object
  • CPU: 0.5-2 cores
  • Memory: 512 MB - 2 GB
  • No GPU required

Key Endpoints

EndpointMethodPurpose
/healthGETHealth check
/v1/listenPOSTSynchronous transcription
/v1/listen/streamWebSocketStreaming transcription

Environment Variables

1LICENSE_KEY: Your license key
2LIGHTNING_ASR_BASE_URL: Internal URL to Lightning ASR
3API_BASE_URL: Internal URL to License Proxy

Logs

Key log messages:

✓ Connected to Lightning ASR at http://lightning-asr:2233
✓ License validation successful
✓ API server listening on port 7100

Dependencies

  • Requires Lightning ASR to be running
  • Requires License Proxy for validation
  • Optionally uses Redis for request coordination

Lightning ASR

The core speech recognition engine powered by GPU acceleration.

Purpose

  • Performs audio-to-text transcription
  • Processes both batch and streaming requests
  • Manages GPU resources and model inference
  • Handles audio preprocessing and postprocessing

Container Details

Image
string

quay.io/smallestinc/lightning-asr:latest

Port
integer

2233 - ASR service endpoint

Resources
object
  • CPU: 4-8 cores
  • Memory: 12-16 GB
  • GPU: 1x NVIDIA GPU (16+ GB VRAM)

GPU Requirements

Lightning ASR requires NVIDIA GPU with CUDA support:

GPU ModelVRAMPerformance
A10040-80 GBExcellent
A1024 GBExcellent
L424 GBVery Good
T416 GBGood

Environment Variables

1MODEL_URL: Download URL for ASR model
2LICENSE_KEY: Your license key
3REDIS_URL: Redis connection string
4PORT: Service port (default 2233)
5GPU_DEVICE_ID: GPU to use (for multi-GPU)

Model Loading

On first startup, Lightning ASR:

  1. Downloads model from MODEL_URL (~20 GB)
  2. Validates model integrity
  3. Loads model into GPU memory
  4. Performs warmup inference

Use persistent volumes to cache models and avoid re-downloading on container restart.

Logs

Key log messages:

✓ GPU detected: NVIDIA A10 (24GB)
✓ Downloading model from URL...
✓ Model loaded successfully (5.2GB)
✓ Warmup completed in 3.2s
✓ Server ready on port 2233

Performance

Typical performance metrics:

MetricValue
Real-time Factor0.05-0.15x
Cold Start30-60 seconds
Warm Inference50-200ms latency
Throughput100+ hours/hour (A10)

Dependencies

  • Requires License Proxy for validation
  • Requires Redis for request coordination
  • Requires NVIDIA GPU

License Proxy

Validates license keys and reports usage to Smallest servers.

Purpose

  • Validates license keys on startup
  • Reports usage metadata to Smallest
  • Provides grace period for offline operation
  • Acts as licensing gateway for all services

Container Details

Image
string

quay.io/smallestinc/license-proxy:latest

Port
integer

6699 - License validation endpoint (internal)

Resources
object
  • CPU: 0.25-1 core
  • Memory: 256-512 MB
  • No GPU required

Environment Variables

1LICENSE_KEY: Your license key

Network Requirements

License Proxy requires outbound HTTPS access to:

  • api.smallest.ai on port 443

Ensure your firewall allows these connections.

Validation Process

  1. On startup, validates license key with Smallest servers
  2. Receives license terms and quotas
  3. Caches validation (valid for grace period)
  4. Periodically reports usage metadata

Usage Reporting

License Proxy reports only metadata:

Data ReportedExample
Audio duration3600 seconds
Request count150 requests
Features usedstreaming, punctuation
Response codes200, 400, 500

No audio or transcript data is transmitted to Smallest servers.

Offline Mode

If connection to license server fails:

  • Uses cached validation (24-hour grace period)
  • Continues serving requests
  • Logs warning messages
  • Retries connection periodically

Logs

Key log messages:

✓ License validated successfully
✓ License valid until: 2024-12-31
✓ Server listening on port 6699
⚠ Connection to license server failed, using cached validation

Redis

Provides caching and state management for the system.

Purpose

  • Request queuing and coordination
  • Session state for streaming connections
  • Caching of frequent requests
  • Performance optimization

Container Details

Image
string

redis:latest or redis:7-alpine

Port
integer

6379 - Redis protocol

Resources
object
  • CPU: 0.5-1 core
  • Memory: 512 MB - 1 GB
  • No GPU required

Configuration Options

Default configuration with minimal setup:

1redis:
2 image: redis:latest
3 ports:
4 - "6379:6379"

Data Stored

Redis stores:

  • Request queue state
  • WebSocket session data
  • Temporary audio chunks (streaming)
  • Worker status and health

Data in Redis is temporary and can be safely cleared. No persistent state is stored.

Health Check

Built-in health check:

1healthcheck:
2 test: ["CMD", "redis-cli", "ping"]
3 interval: 5s
4 timeout: 3s
5 retries: 5

Service Dependencies

Startup order and dependencies:

  1. Redis - Starts immediately (5 seconds)
  2. License Proxy - Validates license (10-15 seconds)
  3. Lightning ASR - Downloads/loads model (30-600 seconds)
  4. API Server - Connects to services (5-10 seconds)

Resource Planning

Minimum Configuration

For development/testing:

1Total Resources:
2 CPU: 6 cores
3 Memory: 16 GB
4 GPU: 1x T4 (16 GB VRAM)
5 Storage: 100 GB

Production Configuration

For production workloads:

1Total Resources:
2 CPU: 12 cores
3 Memory: 32 GB
4 GPU: 1x A10 (24 GB VRAM)
5 Storage: 200 GB

Multi-Worker Configuration

For high-volume production:

1Total Resources:
2 CPU: 24 cores
3 Memory: 64 GB
4 GPU: 2x A10 (24 GB VRAM each)
5 Storage: 300 GB

Monitoring

Container Health

Check container status:

$docker compose ps

Resource Usage

Monitor resource consumption:

$docker stats

GPU Usage

Monitor GPU utilization:

$watch -n 1 nvidia-smi

Logs

View service logs:

$docker compose logs -f [service-name]

What’s Next?