Smallest STT models are evaluated against three open-source datasets, FLEURS, ESB, and WildASR, plus an internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. NA = not available or not supported by that provider.
This page covers both models:
Pulse Pro is tied for #2 on the public Open ASR Leaderboard at 5.42% average WER across eight ESB datasets. Whisper EnglishTextNormalizer applied, normalized WER. Lower is better.
Pulse Pro leads on conversational (AMI) and financial (SPGISpeech) workloads. Cohere edges ahead on read speech (LibriSpeech, TED-LIUM).
Sorted by ESB average WER. Lower is better. Commercial APIs in our accuracy band:
Measured on 1× NVIDIA L40S (48 GB), long-form audio.
L4 is the recommended production GPU and runs at lower throughput than the L40S reference. See Cloud deployment for sizing.
TTFT measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.
WER on the English subset of FLEURS across providers in streaming mode. Lower is better.
Audio amplitude normalization materially changes WER on FLEURS. Most competitors benchmark on raw FLEURS — which has variable, often low amplitude — without normalizing peak audio to −10 dBFS. This makes some models look much better than they actually are. Pulse is stable across all amplitude regimes.
Google’s multilingual speech dataset covering 102 languages, built on the FLoRes-101 translation benchmark. Contains ~12 hours of read speech per language and is the standard benchmark for evaluating multilingual ASR, including low-resource languages.
Evaluated on the FLEURS dataset (non-streaming / batch mode).
Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.
Evaluated on the FLEURS dataset (streaming mode).
Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.
A Hugging Face benchmark suite aggregating 9 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization. Lower WER is better.
Evaluated on the open-source Hugging Face ESB datasets. Numbers from internal evaluation.
WER across seven Hindi speech datasets covering read speech (FLEURS), conversational speech (Kathbath, Common Voice), telephony / contact-center audio (Mucs, Gramvaani), TTS-derived audio (Indic-TTS), and a noise-augmented variant (Kathbath noisy). Compared against the strongest Hindi STT baselines: IndicWhisper, Sarvam Saaras v3, and Deepgram Nova-3.
Evaluated on the open-source datasets. Smallest Pulse numbers from internal evaluation. Lower is better.
Not a public dataset. Hindi audio sliced by perturbation type to isolate model weaknesses. Lower WER is better except for Entity EDR where higher is better (↑).
An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech. Lower WER is better. n/a = not supported by that provider.
Evaluated on the open-source WildASR dataset. Numbers from internal evaluation.
Not a public dataset. The English audio is sliced by perturbation type (Noise, Silence, Telephony 911, Boundary, Disfluency, Long Audios, Repetition, Entity, Accent, Emotion, Speaker Diversity, Speed, Pitch, Volume, Audio Quality) to isolate model weaknesses. Lower WER is better.
linear16 encoding for the lowest latency.