> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Parallelism and Latency

> Throughput, real-time factor (RTFx), and latency figures for self-hosted Pulse and Pulse Pro deployments.

The numbers below come from internal benchmarks on a single GPU host. Use them to size deployments for batch transcription or to set customer-facing SLOs.

All figures are single-GPU steady-state. Multi-GPU clusters scale linearly. Re-benchmark in your own environment before locking SLOs; throughput depends on hardware revision, driver version, and audio characteristics.

## Pulse Pro

**Recommended GPU:** 1× NVIDIA L4 (24 GB VRAM). The numbers below were measured on **L40S (48 GB)** for reference; L4 delivers lower throughput at materially lower cost, A100 / H100 deliver higher. Re-benchmark on your target GPU before locking customer SLOs.

### Long-form audio on L40S (single 2 hr file)

| Mode                 | Throughput (RTFx) | 2 hr file latency |
| -------------------- | ----------------- | ----------------- |
| No word timestamps   | **250–300×**      | \~24–29 sec       |
| With word timestamps | **\~200×**        | \~36 sec          |

RTFx is the multiple of real-time speed (250× means a 1 hour audio file transcribes in \~14 seconds). RTFx for the no-timestamps mode is roughly one-third faster because the alignment pass is skipped.

### Sustained throughput on L40S (batched, 50-second chunks)

| Mode                 | RPS (requests / second) | Effective RTFx |
| -------------------- | ----------------------- | -------------- |
| No word timestamps   | 19–21                   | \~1,000×       |
| With word timestamps | 8–10                    | \~450×         |

These numbers assume optimal batching of typical-length audio. On a single challenging long-form file we have measured down to **68× RTFx** (1.92 hr Earnings22 sample). Plan for the lower bound on single very-long-form files.

### Hardware reference

The public Open ASR Leaderboard measures throughput on A100-80GB at batch 64. The L40S figures above are roughly half of what A100 delivers for the same workload. The recommended L4 deployment delivers lower throughput than L40S; expect a multi-fold drop on RTFx relative to the L40S numbers, especially with word timestamps enabled.

## Pulse

Pulse is multilingual and runs on a smaller GPU footprint than Pulse Pro.

### Time to first transcript (TTFT)

| Concurrency | TTFT   |
| ----------- | ------ |
| 1           | 64 ms  |
| 100         | 300 ms |

Streaming TTFT is the time from the first audio frame arriving until the first transcription event leaves the server.

### Sustained throughput

Sustained throughput on Pulse depends on language and feature mix (diarization adds latency, word timestamps add a small amount). Benchmark in your environment for production sizing; rough order of magnitude is similar to Pulse Pro batched mode.

***

## What affects throughput in practice

* **Word timestamps.** Word alignment costs roughly one-third of throughput on Pulse Pro. Skip them if you only need the transcript text.
* **Speaker diarization.** Adds latency, more pronounced on shorter audio. On long-form files the relative cost is smaller.
* **Audio length and chunking.** Pulse Pro processes audio in internal chunks; very long single files do not parallelize across the GPU the way batches of medium-length files do.
* **Batch size.** The published 250–300× RTFx assumes the worker is fed efficiently. A bursty single-request workload realizes lower numbers; a steady batched workload realizes higher numbers.
* **GPU host class.** L4 is the recommended production GPU for STT. L40S is the internal benchmark reference (used for the numbers above); A100 / H100 deliver higher throughput; T4 is supported with reduced throughput.

## Next steps

* [Hardware requirements](/waves/self-host/docker-setup/stt-deployment/prerequisites/hardware-requirements) for picking a GPU.
* [Cloud deployment recommendations](/waves/self-host/docker-setup/stt-deployment/cloud-deployment) for AWS, GCP, and Azure.
* [Quick Start](/waves/self-host/docker-setup/stt-deployment/quick-start) to bring up a self-hosted STT cluster.