The numbers below come from internal benchmarks on a single GPU host. Use them to size deployments for batch transcription or to set customer-facing SLOs.
All figures are single-GPU steady-state. Multi-GPU clusters scale linearly. Re-benchmark in your own environment before locking SLOs; throughput depends on hardware revision, driver version, and audio characteristics.
Recommended GPU: 1× NVIDIA L4 (24 GB VRAM). The numbers below were measured on L40S (48 GB) for reference; L4 delivers lower throughput at materially lower cost, A100 / H100 deliver higher. Re-benchmark on your target GPU before locking customer SLOs.
RTFx is the multiple of real-time speed (250× means a 1 hour audio file transcribes in ~14 seconds). RTFx for the no-timestamps mode is roughly one-third faster because the alignment pass is skipped.
These numbers assume optimal batching of typical-length audio. On a single challenging long-form file we have measured down to 68× RTFx (1.92 hr Earnings22 sample). Plan for the lower bound on single very-long-form files.
The public Open ASR Leaderboard measures throughput on A100-80GB at batch 64. The L40S figures above are roughly half of what A100 delivers for the same workload. The recommended L4 deployment delivers lower throughput than L40S; expect a multi-fold drop on RTFx relative to the L40S numbers, especially with word timestamps enabled.
Pulse is multilingual and runs on a smaller GPU footprint than Pulse Pro.
Streaming TTFT is the time from the first audio frame arriving until the first transcription event leaves the server.
Sustained throughput on Pulse depends on language and feature mix (diarization adds latency, word timestamps add a small amount). Benchmark in your environment for production sizing; rough order of magnitude is similar to Pulse Pro batched mode.