Parallelism and Latency | Smallest AI Docs

The numbers below come from internal benchmarks on a single GPU host. Use them to size deployments for batch transcription or to set customer-facing SLOs.

All figures are single-GPU steady-state. Multi-GPU clusters scale linearly. Re-benchmark in your own environment before locking SLOs; throughput depends on hardware revision, driver version, and audio characteristics.

Pulse Pro

Recommended GPU: 1× NVIDIA L4 (24 GB VRAM). The numbers below were measured on L40S (48 GB) for reference; L4 delivers lower throughput at materially lower cost, A100 / H100 deliver higher. Re-benchmark on your target GPU before locking customer SLOs.

Long-form audio on L40S (single 2 hr file)

Mode	Throughput (RTFx)	2 hr file latency
No word timestamps	250–300×	~24–29 sec
With word timestamps	~200×	~36 sec

RTFx is the multiple of real-time speed (250× means a 1 hour audio file transcribes in ~14 seconds). RTFx for the no-timestamps mode is roughly one-third faster because the alignment pass is skipped.

Sustained throughput on L40S (batched, 50-second chunks)

Mode	RPS (requests / second)	Effective RTFx
No word timestamps	19–21	~1,000×
With word timestamps	8–10	~450×

These numbers assume optimal batching of typical-length audio. On a single challenging long-form file we have measured down to 68× RTFx (1.92 hr Earnings22 sample). Plan for the lower bound on single very-long-form files.

Hardware reference

The public Open ASR Leaderboard measures throughput on A100-80GB at batch 64. The L40S figures above are roughly half of what A100 delivers for the same workload. The recommended L4 deployment delivers lower throughput than L40S; expect a multi-fold drop on RTFx relative to the L40S numbers, especially with word timestamps enabled.

Pulse

Pulse is multilingual and runs on a smaller GPU footprint than Pulse Pro.

Time to first transcript (TTFT)

Concurrency	TTFT
1	64 ms
100	300 ms

Streaming TTFT is the time from the first audio frame arriving until the first transcription event leaves the server.

Sustained throughput

Sustained throughput on Pulse depends on language and feature mix (diarization adds latency, word timestamps add a small amount). Benchmark in your environment for production sizing; rough order of magnitude is similar to Pulse Pro batched mode.

What affects throughput in practice

Word timestamps. Word alignment costs roughly one-third of throughput on Pulse Pro. Skip them if you only need the transcript text.
Speaker diarization. Adds latency, more pronounced on shorter audio. On long-form files the relative cost is smaller.
Audio length and chunking. Pulse Pro processes audio in internal chunks; very long single files do not parallelize across the GPU the way batches of medium-length files do.
Batch size. The published 250–300× RTFx assumes the worker is fed efficiently. A bursty single-request workload realizes lower numbers; a steady batched workload realizes higher numbers.
GPU host class. L4 is the recommended production GPU for STT. L40S is the internal benchmark reference (used for the numbers above); A100 / H100 deliver higher throughput; T4 is supported with reduced throughput.

Next steps

Hardware requirements for picking a GPU.
Cloud deployment recommendations for AWS, GCP, and Azure.
Quick Start to bring up a self-hosted STT cluster.