For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Prerequisites
    • Why Self-Host?
    • Architecture
  • Docker Setup
      • Quick Start
      • Cloud Deployment
      • Parallelism and Latency
      • Services Overview
      • Configuration
      • Multi-checkpoint deployment
      • Troubleshooting
  • Kubernetes Setup
    • Quick Start
    • Troubleshooting
  • Troubleshooting
    • Common Issues
    • Debugging Guide
    • Logs Analysis
  • API Reference
    • Authentication
    • Examples
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Pulse Pro
  • Long-form audio on L40S (single 2 hr file)
  • Sustained throughput on L40S (batched, 50-second chunks)
  • Hardware reference
  • Pulse
  • Time to first transcript (TTFT)
  • Sustained throughput
  • What affects throughput in practice
  • Next steps
Docker SetupSTT Deployment

Parallelism and Latency

||View as Markdown|
Was this page helpful?
Previous

Cloud Deployment

Next

Services Overview

Built with

The numbers below come from internal benchmarks on a single GPU host. Use them to size deployments for batch transcription or to set customer-facing SLOs.

All figures are single-GPU steady-state. Multi-GPU clusters scale linearly. Re-benchmark in your own environment before locking SLOs; throughput depends on hardware revision, driver version, and audio characteristics.

Pulse Pro

Recommended GPU: 1× NVIDIA L4 (24 GB VRAM). The numbers below were measured on L40S (48 GB) for reference; L4 delivers lower throughput at materially lower cost, A100 / H100 deliver higher. Re-benchmark on your target GPU before locking customer SLOs.

Long-form audio on L40S (single 2 hr file)

ModeThroughput (RTFx)2 hr file latency
No word timestamps250–300×~24–29 sec
With word timestamps~200×~36 sec

RTFx is the multiple of real-time speed (250× means a 1 hour audio file transcribes in ~14 seconds). RTFx for the no-timestamps mode is roughly one-third faster because the alignment pass is skipped.

Sustained throughput on L40S (batched, 50-second chunks)

ModeRPS (requests / second)Effective RTFx
No word timestamps19–21~1,000×
With word timestamps8–10~450×

These numbers assume optimal batching of typical-length audio. On a single challenging long-form file we have measured down to 68× RTFx (1.92 hr Earnings22 sample). Plan for the lower bound on single very-long-form files.

Hardware reference

The public Open ASR Leaderboard measures throughput on A100-80GB at batch 64. The L40S figures above are roughly half of what A100 delivers for the same workload. The recommended L4 deployment delivers lower throughput than L40S; expect a multi-fold drop on RTFx relative to the L40S numbers, especially with word timestamps enabled.

Pulse

Pulse is multilingual and runs on a smaller GPU footprint than Pulse Pro.

Time to first transcript (TTFT)

ConcurrencyTTFT
164 ms
100300 ms

Streaming TTFT is the time from the first audio frame arriving until the first transcription event leaves the server.

Sustained throughput

Sustained throughput on Pulse depends on language and feature mix (diarization adds latency, word timestamps add a small amount). Benchmark in your environment for production sizing; rough order of magnitude is similar to Pulse Pro batched mode.


What affects throughput in practice

  • Word timestamps. Word alignment costs roughly one-third of throughput on Pulse Pro. Skip them if you only need the transcript text.
  • Speaker diarization. Adds latency, more pronounced on shorter audio. On long-form files the relative cost is smaller.
  • Audio length and chunking. Pulse Pro processes audio in internal chunks; very long single files do not parallelize across the GPU the way batches of medium-length files do.
  • Batch size. The published 250–300× RTFx assumes the worker is fed efficiently. A bursty single-request workload realizes lower numbers; a steady batched workload realizes higher numbers.
  • GPU host class. L4 is the recommended production GPU for STT. L40S is the internal benchmark reference (used for the numbers above); A100 / H100 deliver higher throughput; T4 is supported with reduced throughput.

Next steps

  • Hardware requirements for picking a GPU.
  • Cloud deployment recommendations for AWS, GCP, and Azure.
  • Quick Start to bring up a self-hosted STT cluster.