Performance | Smallest AI Docs

Pulse STT is evaluated against three open-source datasets — FLEURS, ESB, and WildASR — and one internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. NA = not available or not supported by that provider.

Latency

Time-to-First-Transcript (TTFT)

TTFT measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.

Model	Latency (ms)
Smallest Pulse STT	64
Deepgram Nova 2	76
Deepgram Nova 3	71

Pre-recorded — FLEURS

Google’s multilingual speech dataset covering 102 languages, built on the FLoRes-101 translation benchmark. Contains ~12 hours of read speech per language and is the standard benchmark for evaluating multilingual ASR, including low-resource languages.

Evaluated on the FLEURS dataset (non-streaming / batch mode).

Language	Smallest Pulse	Deepgram Nova 2	Deepgram Nova 3
Italian	3.0%	10.7%	6.2%
English	4.5%	7.9%	6.7%
Spanish	3.2%	8.6%	4.1%
Portuguese	5.0%	9.9%	7.5%
German	6.4%	8.2%	8.5%
French	7.1%	13.3%	10.7%
Russian	9.6%	7.9%	11.8%
Ukrainian	7.5%	12.4%	NA
Polish	10.3%	12.2%	NA
Hindi	6.3%	23.5%	23.6%
Kannada	9.8%	NA	NA
Malayalam	10.0%	NA	NA
Gujarati	12.3%	NA	NA
Marathi	11.5%	NA	NA
Czech	12.4%	22.9%	19.2%
Oriya	14.8%	NA	NA
Bengali	16.4%	NA	NA
Slovak	13.5%	31.2%	NA
Dutch	15.0%	16.3%	12.5%
Swedish	18.7%	17.7%	14.3%
Telugu	14.3%	NA	NA
Finnish	18.3%	14.1%	13.2%
Latvian	16.5%	48.7%	NA
Romanian	17.8%	36.0%	NA
Punjabi	18.3%	NA	NA
Estonian	17.8%	49.0%	NA
Bulgarian	24.1%	32.7%	NA
Danish	19.8%	21.1%	16.1%
Tamil	21.6%	NA	NA
Hungarian	22.5%	31.8%	28.6%
Maltese	25.5%	NA	NA
Lithuanian	25.1%	44.9%	NA

Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.

Streaming — FLEURS

Evaluated on the FLEURS dataset (streaming mode).

Language	Smallest Pulse	Deepgram Nova 2	Deepgram Nova 3
Italian	4.41	11.05	6.99
English	4.5	15.59	11.21
Spanish	5.99	10.67	7.52
Portuguese	8.32	14.15	11.46
German	9.5	11.1	10.15
French	10.71	14.3	12.07
Russian	14.35	NA	NA
Hindi	8.3	20.0	15.46
Kannada	16.97	NA	NA
Malayalam	15.91	NA	NA
Gujarati	20.05	NA	NA
Marathi	15.68	NA	NA
Oriya	22.74	NA	NA
Bengali	17.48	NA	NA
Dutch	11.90	NA	NA
Telugu	24.79	NA	NA
Tamil	20.15	NA	NA

Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.

English STT — ESB Dataset (Streaming)

A Hugging Face benchmark suite aggregating 8 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization.

Evaluated on the open-source Hugging Face ESB datasets. Smallest Pulse numbers from internal evaluation.

Dataset	Smallest Pulse	Deepgram Nova 2	Deepgram Nova 3
LibriSpeech Clean	1.80	4.35	3.71
LibriSpeech Other	3.94	9.36	7.72
Common Voice	9.20	17.79	14.59
VoxPopuli	3.17	9.95	9.38
TEDELIUM	2.36	4.35	3.57
GigaSpeech	4.74	11.63	10.05
SPGISpeech	2.67	5.26	3.28
Earnings22	8.73	18.98	15.34
AMI	11.93	19.86	16.06
Overall	5.39	11.28	9.30

Hindi — multi-dataset (Streaming)

WER across seven Hindi speech datasets covering read speech (FLEURS), conversational speech (Kathbath, Common Voice), telephony / contact-center audio (Mucs, Gramvaani), TTS-derived audio (Indic-TTS), and a noise-augmented variant (Kathbath noisy). Compared against the strongest Hindi STT baselines: IndicWhisper, Sarvam Saaras v3, and Deepgram Nova-3.

Evaluated on the open-source datasets. Smallest Pulse numbers from internal evaluation. Lower is better.

Dataset	Smallest Pulse	IndicWhisper	Sarvam Saaras v3	Deepgram Nova-3
FLEURS	9.55	15.00	8.31	14.09
Kathbath	9.71	10.30	8.15	16.22
Kathbath (noisy)	10.94	12.00	10.81	17.06
Common Voice	11.20	11.40	11.36	23.55
Indic-TTS	6.39	7.60	6.49	10.72
MUCS	9.19	12.00	8.96	16.20
Gramvaani	21.43	26.80	21.80	31.44

ASR Robustness — WildASR Dataset (Streaming)

An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech.

Evaluated on the open-source WildASR dataset. Smallest Pulse numbers from internal evaluation.

Dataset	Smallest Pulse	Deepgram Nova 2	Deepgram Nova 3
Clean	4.41	15.28	10.76
Clipping	12.93	70.41	43.15
Far Field	12.09	74.52	58.72
Noise Gap	9.03	21.91	14.19
Phone Codec	5.71	12.22	9.27
Reverberation	7.91	40.71	27.21
Accent	5.35	9.17	7.23
Overall	8.76	34.89	24.36

Internal English Perturbation Benchmark

Not a public dataset. The English audio is sliced by perturbation type (Emotion, Entity, Disfluency, Noise, Accent, Silence, Speaker Diversity, Speed, Boundary, Pitch, Audio Quality, Volume) to isolate model weaknesses.

Category	Pulse English (Streaming)	Deepgram Nova 3 (en)
Emotion	15.43%	19.42%
Entity	12.14%	11.80%
Disfluency	11.91%	8.64%
Noise	11.57%	14.61%
Accent	9.13%	10.43%
Silence	8.99%	13.17%
Speaker Diversity	7.77%	9.91%
Speed	3.54%	6.85%
Boundary	3.02%	6.30%
Pitch	2.60%	4.04%
Audio Quality	2.45%	4.05%
Volume	2.11%	3.59%

Optimization Tips

Use 16 kHz sample rate for an optimal balance of quality and latency.
Choose linear16 encoding for the lowest latency.
Enable only the features your use case requires; each optional feature adds work.
Batch process when latency is not critical.