For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
      • Performance
      • Metrics Overview
      • Evaluation Walkthrough
  • Speech to Speech (Hydra)
    • Overview
    • Quickstart
    • WebSocket connection
    • Managing sessions
    • Audio I/O
    • Turn detection & barge-in
    • Tool calling
    • Prompting voice agents
    • Errors & reconnection
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Pulse Pro: Open ASR Leaderboard
  • Head-to-head vs leaderboard top-3
  • Position on the public leaderboard
  • FLEURS English
  • Throughput
  • Pulse: multilingual benchmarks
  • Latency
  • Time-to-First-Transcript (TTFT)
  • FLEURS Streaming — English
  • A note on audio amplitude normalization
  • Pre-recorded — FLEURS
  • Streaming — FLEURS
  • English STT — ESB Dataset (Streaming)
  • Hindi — multi-dataset (Streaming)
  • Internal Hindi Perturbation Benchmark
  • ASR Robustness — WildASR Dataset (Streaming)
  • Internal English Perturbation Benchmark
  • Optimization Tips
  • Next Steps
Speech to Text (Pulse)Benchmarks

Performance

||View as Markdown|
Was this page helpful?
Previous

Finalize Control

Next

Metrics Overview

Built with

Smallest STT models are evaluated against three open-source datasets, FLEURS, ESB, and WildASR, plus an internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. NA = not available or not supported by that provider.

This page covers both models:

  • Pulse Pro (English only) sits in the leaderboard-accuracy band. Benchmarks are on the Open ASR Leaderboard ESB suite and FLEURS English.
  • Pulse (38 languages, streaming + non-streaming) is evaluated on FLEURS, ESB, WildASR, and our internal perturbation suite below.

Pulse Pro: Open ASR Leaderboard

Pulse Pro is tied for #2 on the public Open ASR Leaderboard at 5.42% average WER across eight ESB datasets. Whisper EnglishTextNormalizer applied, normalized WER. Lower is better.

Head-to-head vs leaderboard top-3

DatasetPulse ProGranite 4.1 2BCohere Transcribe
AMI (meetings)7.328.098.13
Earnings229.048.3710.86
GigaSpeech9.529.809.34
LibriSpeech clean1.731.331.25
LibriSpeech other3.742.502.37
SPGISpeech (financial)2.043.783.08
TED-LIUM3.683.072.49
VoxPopuli6.325.705.87
Average (8 datasets)5.425.335.42
Open ASR rank🥈 #2 (tied)🥇 #1🥈 #2 (tied)

Pulse Pro leads on conversational (AMI) and financial (SPGISpeech) workloads. Cohere edges ahead on read speech (LibriSpeech, TED-LIUM).

Position on the public leaderboard

Sorted by ESB average WER. Lower is better. Commercial APIs in our accuracy band:

RankModelESB Avg WER ↓
1IBM Granite Speech 4.1 2B5.33
2Pulse Pro5.42
2Cohere Labs Transcribe (tied)5.42
3Zoom Scribe v15.47
5NVIDIA Canary Qwen 2.5B5.63
8ElevenLabs Scribe v25.83
12AssemblyAI Universal-3 Pro6.21
18Speechmatics Enhanced6.91
23OpenAI Whisper Large v37.44

FLEURS English

MetricPulse Pro
WER (FLEURS en_us)3.92%
CER (FLEURS en_us)1.73%

Throughput

Measured on 1× NVIDIA L40S (48 GB), long-form audio.

ModeThroughput (RTFx)
No word timestamps250–300×
With word timestamps~200×

L4 is the recommended production GPU and runs at lower throughput than the L40S reference. See Cloud deployment for sizing.


Pulse: multilingual benchmarks

Latency

Time-to-First-Transcript (TTFT)

TTFT measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.

ModelLatency (ms)
Smallest Pulse STT64
Deepgram Nova 276
Deepgram Nova 371

FLEURS Streaming — English

WER on the English subset of FLEURS across providers in streaming mode. Lower is better.

ProviderSmallest PulseAssembly Universal 3 ProAWS TranscribeAzureDeepgram Nova 3GrokSarvam Saras V3ElevenLabs Scribe V2
WER6.03%3.13%6.54%13.79%11.59%60.00%6.34%3.88%

A note on audio amplitude normalization

Audio amplitude normalization materially changes WER on FLEURS. Most competitors benchmark on raw FLEURS — which has variable, often low amplitude — without normalizing peak audio to −10 dBFS. This makes some models look much better than they actually are. Pulse is stable across all amplitude regimes.

ModelRaw FLEURS−10 dBFS−20 dBFSStable across regimes?
Smallest Pulse6.03%6.06%5.81%Yes
Deepgram Nova 311.59%6.57%6.51%Partial — 1.8× degradation on raw
Grok60.00%7.58%8.59%Collapses on raw

Pre-recorded — FLEURS

Google’s multilingual speech dataset covering 102 languages, built on the FLoRes-101 translation benchmark. Contains ~12 hours of read speech per language and is the standard benchmark for evaluating multilingual ASR, including low-resource languages.

Evaluated on the FLEURS dataset (non-streaming / batch mode).

LanguageSmallest PulseDeepgram Nova 2Deepgram Nova 3
Italian3.0%10.7%6.2%
English4.5%7.9%6.7%
Spanish3.2%8.6%4.1%
Portuguese5.0%9.9%7.5%
German6.4%8.2%8.5%
French7.1%13.3%10.7%
Russian9.6%7.9%11.8%
Ukrainian7.5%12.4%NA
Polish10.3%12.2%NA
Hindi6.3%23.5%23.6%
Kannada9.8%NANA
Malayalam10.0%NANA
Gujarati12.3%NANA
Marathi11.5%NANA
Czech12.4%22.9%19.2%
Oriya14.8%NANA
Bengali16.4%NANA
Slovak13.5%31.2%NA
Dutch15.0%16.3%12.5%
Swedish18.7%17.7%14.3%
Telugu14.3%NANA
Finnish18.3%14.1%13.2%
Latvian16.5%48.7%NA
Romanian17.8%36.0%NA
Punjabi18.3%NANA
Estonian17.8%49.0%NA
Bulgarian24.1%32.7%NA
Danish19.8%21.1%16.1%
Tamil21.6%NANA
Hungarian22.5%31.8%28.6%
Maltese25.5%NANA
Lithuanian25.1%44.9%NA

Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.

Streaming — FLEURS

Evaluated on the FLEURS dataset (streaming mode).

LanguageSmallest PulseDeepgram Nova 2Deepgram Nova 3
Italian4.4111.056.99
English6.0315.5911.59
Spanish5.9910.677.52
Portuguese8.3214.1511.46
German9.511.110.15
French10.7114.312.07
Russian14.35NANA
Hindi8.320.015.46
Kannada16.97NANA
Malayalam15.91NANA
Gujarati20.05NANA
Marathi15.68NANA
Oriya22.74NANA
Bengali17.48NANA
Dutch11.90NANA
Telugu24.79NANA
Tamil20.15NANA

Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.

English STT — ESB Dataset (Streaming)

A Hugging Face benchmark suite aggregating 9 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization. Lower WER is better.

Evaluated on the open-source Hugging Face ESB datasets. Numbers from internal evaluation.

DatasetSmallest PulseAssembly Universal 3 ProAWS TranscribeAzureDeepgram Nova 3GrokSarvam Saras V3ElevenLabs Scribe V2
LibriSpeech Clean2.461.652.162.483.203.613.091.97
LibriSpeech Other5.312.864.885.746.607.286.854.45
Common Voice10.896.7310.6947.2814.2243.4611.379.83
VoxPopuli7.167.287.0714.109.5511.497.777.91
TED-LIUM4.072.952.663.813.596.902.893.16
GigaSpeech10.439.1210.095.3510.0510.059.579.66
SPGISpeech2.861.744.183.532.999.703.894.40
Earnings2212.2511.5212.218.5415.7927.0211.9712.20
AMI10.5814.6013.198.4617.0419.1913.0812.23
Aggregate7.336.497.4611.039.2315.417.837.31

Hindi — multi-dataset (Streaming)

WER across seven Hindi speech datasets covering read speech (FLEURS), conversational speech (Kathbath, Common Voice), telephony / contact-center audio (Mucs, Gramvaani), TTS-derived audio (Indic-TTS), and a noise-augmented variant (Kathbath noisy). Compared against the strongest Hindi STT baselines: IndicWhisper, Sarvam Saaras v3, and Deepgram Nova-3.

Evaluated on the open-source datasets. Smallest Pulse numbers from internal evaluation. Lower is better.

DatasetSmallest PulseIndicWhisperSarvam Saaras v3Deepgram Nova-3
FLEURS9.5515.008.3114.09
Kathbath9.7110.308.1516.22
Kathbath (noisy)10.9412.0010.8117.06
Common Voice11.2011.4011.3623.55
Indic-TTS6.397.606.4910.72
MUCS9.1912.008.9616.20
Gramvaani21.4326.8021.8031.44

Internal Hindi Perturbation Benchmark

Not a public dataset. Hindi audio sliced by perturbation type to isolate model weaknesses. Lower WER is better except for Entity EDR where higher is better (↑).

CategorySmallest PulseSarvam Saras V3Deepgram Nova 3
Noise15.75%22.18%21.52%
Silence9.72%11.38%18.40%
Entity10.82%17.36%14.67%
Entity NE-WER13.32%26.72%26.58%
Entity EDR (↑)83.13%76.13%67.80%
Boundary11.99%17.52%17.36%
Long Audios18.11%18.42%19.21%
Speed16.37%21.39%38.21%
Pitch11.81%11.92%19.59%
Audio Quality10.65%11.75%19.51%
Volume8.21%15.25%16.76%
Disfluency11.83%12.06%18.44%
Repetition11.44%11.27%20.40%

ASR Robustness — WildASR Dataset (Streaming)

An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech. Lower WER is better. n/a = not supported by that provider.

Evaluated on the open-source WildASR dataset. Numbers from internal evaluation.

DatasetSmallest PulseAssembly Universal 3 ProAWS TranscribeAzureDeepgram Nova 3Sarvam Saras V3ElevenLabs Scribe V2
Clean5.983.337.0111.1111.627.024.24
Clipping14.036.5942.104.3547.3528.7411.20
Far-field13.3826.0738.76n/a62.9921.277.38
Noise Gap8.904.049.77n/a15.049.746.30
Phone Codec7.193.458.70n/a9.1310.644.98
Reverberation9.0623.5014.83n/a27.274.356.48
Accent5.822.804.45n/a7.31n/a4.01
Aggregate9.6312.5218.358.8228.1717.756.47

Internal English Perturbation Benchmark

Not a public dataset. The English audio is sliced by perturbation type (Noise, Silence, Telephony 911, Boundary, Disfluency, Long Audios, Repetition, Entity, Accent, Emotion, Speaker Diversity, Speed, Pitch, Volume, Audio Quality) to isolate model weaknesses. Lower WER is better.

CategoryPulseAssemblyAWSDeepgramScribe
Noise10.5311.9314.1914.5810.05
Silence5.814.228.2213.2810.61
Telephony 91121.0323.9327.8828.4320.29
Boundary2.833.093.183.661.73
Disfluency7.687.819.238.629.29
Long Audios12.818.5811.6611.169.25
Repetition11.389.8210.399.5710.81
Entity12.4310.1313.3511.699.48
Accent8.687.899.5110.427.25
Emotion13.9216.3418.5718.0711.84
Speaker Diversity7.336.728.819.485.95
Speed4.323.634.406.883.74
Pitch2.933.073.214.071.61
Volume2.373.052.413.671.47
Audio Quality2.732.863.034.081.60
Average WER8.458.209.8710.517.66

Optimization Tips

  • Use 16 kHz sample rate for an optimal balance of quality and latency.
  • Choose linear16 encoding for the lowest latency.
  • Enable only the features your use case requires; each optional feature adds work.
  • Batch process when latency is not critical.

Next Steps

  • Metrics Overview
  • Evaluation Walkthrough
  • Best Practices