Performance
Smallest STT models are evaluated against three open-source datasets, FLEURS, ESB, and WildASR, plus an internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. NA = not available or not supported by that provider.
This page covers both models:
- Pulse Pro (English only) sits in the leaderboard-accuracy band. Benchmarks are on the Open ASR Leaderboard ESB suite and FLEURS English.
- Pulse (17 streaming + 26 pre-recorded languages) is evaluated on FLEURS, ESB, WildASR, and our internal perturbation suite below.
Pulse Pro: Open ASR Leaderboard
Pulse Pro is tied for #2 on the public Open ASR Leaderboard at 5.42% average WER across eight ESB datasets. Whisper EnglishTextNormalizer applied, normalized WER. Lower is better.
Head-to-head vs leaderboard top-3
Pulse Pro leads on conversational (AMI) and financial (SPGISpeech) workloads. Cohere edges ahead on read speech (LibriSpeech, TED-LIUM).
Position on the public leaderboard
Sorted by ESB average WER. Lower is better. Commercial APIs in our accuracy band:
FLEURS English
Throughput
Measured on 1× NVIDIA L40S (48 GB), long-form audio.
L4 is the recommended production GPU and runs at lower throughput than the L40S reference. See Cloud deployment for sizing.
Pulse: multilingual benchmarks
Latency
Time-to-First-Transcript (TTFT)
TTFT measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.
FLEURS Streaming — English
WER on the English subset of FLEURS across providers in streaming mode. Lower is better.
A note on audio amplitude normalization
Audio amplitude normalization materially changes WER on FLEURS. Most competitors benchmark on raw FLEURS — which has variable, often low amplitude — without normalizing peak audio to −10 dBFS. This makes some models look much better than they actually are. Pulse is stable across all amplitude regimes.
Pre-recorded — FLEURS
Google’s multilingual speech dataset covering 102 languages, built on the FLoRes-101 translation benchmark. Contains ~12 hours of read speech per language and is the standard benchmark for evaluating multilingual ASR, including low-resource languages.
Evaluated on the FLEURS dataset (non-streaming / batch mode).
Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.
Streaming — FLEURS
Evaluated on the FLEURS dataset (streaming mode).
Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.
English STT — ESB Dataset (Streaming)
A Hugging Face benchmark suite aggregating 9 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization. Lower WER is better.
Evaluated on the open-source Hugging Face ESB datasets. Numbers from internal evaluation.
Hindi — multi-dataset (Streaming)
WER across seven Hindi speech datasets covering read speech (FLEURS), conversational speech (Kathbath, Common Voice), telephony / contact-center audio (Mucs, Gramvaani), TTS-derived audio (Indic-TTS), and a noise-augmented variant (Kathbath noisy). Compared against the strongest Hindi STT baselines: IndicWhisper, Sarvam Saaras v3, scribe v2, and Deepgram Nova-3.
Evaluated on the open-source datasets. Smallest Pulse numbers from internal evaluation. Lower is better.
East Asian languages — Multi-dataset (Streaming)
WER for the four East Asian languages on the streaming endpoint (US region). Three datasets per language covering read speech (FLEURS), conversational/crowdsourced speech (Common Voice 25), and language-specific corpora (JSUT, Zeroth-Korean, MDCC, AISHELL-1). Compared head-to-head against Deepgram Nova-3. Lower WER is better.
Pulse averages 10.91% WER vs Deepgram Nova-3’s 15.50% across the four East Asian languages — Pulse leads on 10 of 12 dataset rows, with the largest gains on Japanese CV-25 and Cantonese CV-25.
These four languages stream from wss://api.us.smallest.ai/waves/v1/stt/live?model=pulse only (US region). See the Pulse model card for the region-routing details.
ASR Robustness — WildASR Dataset (Streaming)
An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech. Lower WER is better. n/a = not supported by that provider.
Evaluated on the open-source WildASR dataset. Numbers from internal evaluation.
Internal English Perturbation Benchmark
Not a public dataset. The English audio is sliced by perturbation type (Noise, Silence, Telephony 911, Boundary, Disfluency, Long Audios, Repetition, Entity, Accent, Emotion, Speaker Diversity, Speed, Pitch, Volume, Audio Quality) to isolate model weaknesses. Lower WER is better.
Internal Hindi Perturbation Benchmark
Not a public dataset. Hindi audio is sliced by perturbation type to isolate model weaknesses. Compared against Sarvam Saaras v3 and Deepgram Nova-3. Most metrics are WER — lower is better. Entity EDR (↑) is higher is better.
Optimization Tips
- Use 16 kHz sample rate for an optimal balance of quality and latency.
- Choose
linear16encoding for the lowest latency. - Enable only the features your use case requires; each optional feature adds work.
- Batch process when latency is not critical.

