Performance

View as Markdown

Smallest STT models are evaluated against three open-source datasets, FLEURS, ESB, and WildASR, plus an internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. NA = not available or not supported by that provider.

This page covers both models:

  • Pulse Pro (English only) sits in the leaderboard-accuracy band. Benchmarks are on the Open ASR Leaderboard ESB suite and FLEURS English.
  • Pulse (17 streaming + 26 pre-recorded languages) is evaluated on FLEURS, ESB, WildASR, and our internal perturbation suite below.

Pulse Pro: Open ASR Leaderboard

Pulse Pro is tied for #2 on the public Open ASR Leaderboard at 5.42% average WER across eight ESB datasets. Whisper EnglishTextNormalizer applied, normalized WER. Lower is better.

Head-to-head vs leaderboard top-3

DatasetPulse ProGranite 4.1 2BCohere Transcribe
AMI (meetings)7.328.098.13
Earnings229.048.3710.86
GigaSpeech9.529.809.34
LibriSpeech clean1.731.331.25
LibriSpeech other3.742.502.37
SPGISpeech (financial)2.043.783.08
TED-LIUM3.683.072.49
VoxPopuli6.325.705.87
Average (8 datasets)5.425.335.42
Open ASR rank🥈 #2 (tied)🥇 #1🥈 #2 (tied)

Pulse Pro leads on conversational (AMI) and financial (SPGISpeech) workloads. Cohere edges ahead on read speech (LibriSpeech, TED-LIUM).

Position on the public leaderboard

Sorted by ESB average WER. Lower is better. Commercial APIs in our accuracy band:

RankModelESB Avg WER ↓
1IBM Granite Speech 4.1 2B5.33
2Pulse Pro5.42
2Cohere Labs Transcribe (tied)5.42
3Zoom Scribe v15.47
5NVIDIA Canary Qwen 2.5B5.63
8ElevenLabs Scribe v25.83
12AssemblyAI Universal-3 Pro6.21
18Speechmatics Enhanced6.91
23OpenAI Whisper Large v37.44

FLEURS English

MetricPulse Pro
WER (FLEURS en_us)3.92%
CER (FLEURS en_us)1.73%

Throughput

Measured on 1× NVIDIA L40S (48 GB), long-form audio.

ModeThroughput (RTFx)
No word timestamps250–300×
With word timestamps~200×

L4 is the recommended production GPU and runs at lower throughput than the L40S reference. See Cloud deployment for sizing.


Pulse: multilingual benchmarks

Latency

Time-to-First-Transcript (TTFT)

TTFT measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.

ModelLatency (ms)
Smallest Pulse STT64
Deepgram Nova 276
Deepgram Nova 371

FLEURS Streaming — English

WER on the English subset of FLEURS across providers in streaming mode. Lower is better.

ProviderSmallest PulseAssembly Universal 3 ProAWS TranscribeAzureDeepgram Nova 3GrokSarvam Saras V3ElevenLabs Scribe V2
WER6.03%3.13%6.54%13.79%11.59%60.00%6.34%3.88%

A note on audio amplitude normalization

Audio amplitude normalization materially changes WER on FLEURS. Most competitors benchmark on raw FLEURS — which has variable, often low amplitude — without normalizing peak audio to −10 dBFS. This makes some models look much better than they actually are. Pulse is stable across all amplitude regimes.

ModelRaw FLEURS−10 dBFS−20 dBFSStable across regimes?
Smallest Pulse6.03%6.06%5.81%Yes
Deepgram Nova 311.59%6.57%6.51%Partial — 1.8× degradation on raw
Grok60.00%7.58%8.59%Collapses on raw

Pre-recorded — FLEURS

Google’s multilingual speech dataset covering 102 languages, built on the FLoRes-101 translation benchmark. Contains ~12 hours of read speech per language and is the standard benchmark for evaluating multilingual ASR, including low-resource languages.

Evaluated on the FLEURS dataset (non-streaming / batch mode).

LanguageSmallest PulseDeepgram Nova 2Deepgram Nova 3
Italian3.0%10.7%6.2%
English4.5%7.9%6.7%
Spanish3.2%8.6%4.1%
Portuguese5.0%9.9%7.5%
German6.4%8.2%8.5%
French7.1%13.3%10.7%
Russian9.6%7.9%11.8%
Ukrainian7.5%12.4%NA
Polish10.3%12.2%NA
Hindi6.3%23.5%23.6%
Czech12.4%22.9%19.2%
Slovak13.5%31.2%NA
Dutch15.0%16.3%12.5%
Swedish18.7%17.7%14.3%
Finnish18.3%14.1%13.2%
Latvian16.5%48.7%NA
Romanian17.8%36.0%NA
Estonian17.8%49.0%NA
Bulgarian24.1%32.7%NA
Danish19.8%21.1%16.1%
Hungarian22.5%31.8%28.6%
Maltese25.5%NANA
Lithuanian25.1%44.9%NA

Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.

Streaming — FLEURS

Evaluated on the FLEURS dataset (streaming mode).

LanguageSmallest PulseDeepgram Nova 2Deepgram Nova 3
Italian4.4111.056.99
English6.0315.5911.59
Spanish5.9910.677.52
Portuguese8.3214.1511.46
German9.511.110.15
French10.7114.312.07
Russian14.35NANA
Hindi8.320.015.46
Gujarati20.05NANA
Marathi15.68NANA
Oriya22.74NANA
Bengali17.48NANA
Dutch11.90NANA

Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.

English STT — ESB Dataset (Streaming)

A Hugging Face benchmark suite aggregating 9 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization. Lower WER is better.

Evaluated on the open-source Hugging Face ESB datasets. Numbers from internal evaluation.

DatasetSmallest PulseAssembly Universal 3 ProAWS TranscribeAzureDeepgram Nova 3GrokSarvam Saras V3ElevenLabs Scribe V2
LibriSpeech Clean2.461.652.162.483.203.613.091.97
LibriSpeech Other5.312.864.885.746.607.286.854.45
Common Voice10.896.7310.6947.2814.2243.4611.379.83
VoxPopuli7.167.287.0714.109.5511.497.777.91
TED-LIUM4.072.952.663.813.596.902.893.16
GigaSpeech10.439.1210.095.3510.0510.059.579.66
SPGISpeech2.861.744.183.532.999.703.894.40
Earnings2212.2511.5212.218.5415.7927.0211.9712.20
AMI10.5814.6013.198.4617.0419.1913.0812.23
Aggregate7.336.497.4611.039.2315.417.837.31

Hindi — multi-dataset (Streaming)

WER across seven Hindi speech datasets covering read speech (FLEURS), conversational speech (Kathbath, Common Voice), telephony / contact-center audio (Mucs, Gramvaani), TTS-derived audio (Indic-TTS), and a noise-augmented variant (Kathbath noisy). Compared against the strongest Hindi STT baselines: IndicWhisper, Sarvam Saaras v3, scribe v2, and Deepgram Nova-3.

Evaluated on the open-source datasets. Smallest Pulse numbers from internal evaluation. Lower is better.

DatasetSmallest PulseIndicWhisperSarvam Saaras v3scribe v2Deepgram Nova-3
FLEURS9.0315.007.318.9614.09
Kathbath7.6310.308.158.6716.22
Kathbath (noisy)8.5212.008.8510.1117.06
Common Voice8.5811.4010.3613.6123.55
Indic-TTS6.387.606.298.7510.72
MUCS3.6112.008.158.1516.20
Gramvaani19.7726.8020.8024.0931.44
AVERAGE8.7813.5910.6911.7618.47

East Asian languages — Multi-dataset (Streaming)

WER for the four East Asian languages on the streaming endpoint (US region). Three datasets per language covering read speech (FLEURS), conversational/crowdsourced speech (Common Voice 25), and language-specific corpora (JSUT, Zeroth-Korean, MDCC, AISHELL-1). Compared head-to-head against Deepgram Nova-3. Lower WER is better.

LangDatasetSmallest PulseDeepgram Nova 3
JapaneseCV-2523.84%34.81%
JapaneseFLEURS10.78%17.11%
JapaneseJSUT BASIC500011.47%11.65%
KoreanCV-259.79%9.66%
KoreanFLEURS7.95%10.79%
KoreanZeroth-Korean5.25%6.46%
CantoneseCV-256.16%14.09%
CantoneseFLEURS13.06%15.43%
CantoneseMDCC5.85%12.77%
MandarinCV-2515.99%22.44%
MandarinFLEURS14.25%13.89%
MandarinAISHELL-17.34%8.69%
Average10.91%15.50%

Pulse averages 10.91% WER vs Deepgram Nova-3’s 15.50% across the four East Asian languages — Pulse leads on 10 of 12 dataset rows, with the largest gains on Japanese CV-25 and Cantonese CV-25.

These four languages stream from wss://api.us.smallest.ai/waves/v1/stt/live?model=pulse only (US region). See the Pulse model card for the region-routing details.

ASR Robustness — WildASR Dataset (Streaming)

An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech. Lower WER is better. n/a = not supported by that provider.

Evaluated on the open-source WildASR dataset. Numbers from internal evaluation.

DatasetSmallest PulseAssembly Universal 3 ProAWS TranscribeAzureDeepgram Nova 3Sarvam Saras V3ElevenLabs Scribe V2
Clean5.983.337.0111.1111.627.024.24
Clipping14.036.5942.104.3547.3528.7411.20
Far-field13.3826.0738.76n/a62.9921.277.38
Noise Gap8.904.049.77n/a15.049.746.30
Phone Codec7.193.458.70n/a9.1310.644.98
Reverberation9.0623.5014.83n/a27.274.356.48
Accent5.822.804.45n/a7.31n/a4.01
Aggregate9.6312.5218.358.8228.1717.756.47

Internal English Perturbation Benchmark

Not a public dataset. The English audio is sliced by perturbation type (Noise, Silence, Telephony 911, Boundary, Disfluency, Long Audios, Repetition, Entity, Accent, Emotion, Speaker Diversity, Speed, Pitch, Volume, Audio Quality) to isolate model weaknesses. Lower WER is better.

CategoryPulseAssemblyAWSDeepgramScribe
Noise10.5311.9314.1914.5810.05
Silence5.814.228.2213.2810.61
Telephony 91121.0323.9327.8828.4320.29
Boundary2.833.093.183.661.73
Disfluency7.687.819.238.629.29
Long Audios12.818.5811.6611.169.25
Repetition11.389.8210.399.5710.81
Entity12.4310.1313.3511.699.48
Accent8.687.899.5110.427.25
Emotion13.9216.3418.5718.0711.84
Speaker Diversity7.336.728.819.485.95
Speed4.323.634.406.883.74
Pitch2.933.073.214.071.61
Volume2.373.052.413.671.47
Audio Quality2.732.863.034.081.60
Average WER8.458.209.8710.517.66

Internal Hindi Perturbation Benchmark

Not a public dataset. Hindi audio is sliced by perturbation type to isolate model weaknesses. Compared against Sarvam Saaras v3 and Deepgram Nova-3. Most metrics are WER — lower is better. Entity EDR (↑) is higher is better.

CategorySmallest PulseSarvam Saaras v3Deepgram Nova-3
Noise15.76%22.18%21.52%
Silence8.08%11.38%18.40%
Entity10.82%17.36%14.67%
Entity NE-WER13.32%26.72%26.58%
Entity EDR (↑)83.13%76.13%67.80%
Boundary11.67%17.52%17.36%
Long Audios17.93%18.42%19.21%
Speed16.16%21.39%38.21%
Pitch11.43%11.92%19.59%
Audio Quality10.86%11.75%19.51%
Volume9.31%15.25%16.76%
Disfluency11.51%12.06%18.44%
Repetition11.38%11.27%20.40%

Optimization Tips

  • Use 16 kHz sample rate for an optimal balance of quality and latency.
  • Choose linear16 encoding for the lowest latency.
  • Enable only the features your use case requires; each optional feature adds work.
  • Batch process when latency is not critical.

Next Steps