Performance

View as Markdown

Pulse STT is evaluated against three open-source datasets — FLEURS, ESB, and WildASR — and one internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. NA = not available or not supported by that provider.

Latency

Time-to-First-Transcript (TTFT)

TTFT measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.

ModelLatency (ms)
Smallest Pulse STT64
Deepgram Nova 276
Deepgram Nova 371

Pre-recorded — FLEURS

Google’s multilingual speech dataset covering 102 languages, built on the FLoRes-101 translation benchmark. Contains ~12 hours of read speech per language and is the standard benchmark for evaluating multilingual ASR, including low-resource languages.

Evaluated on the FLEURS dataset (non-streaming / batch mode).

LanguageSmallest PulseDeepgram Nova 2Deepgram Nova 3
Italian3.0%10.7%6.2%
English4.5%7.9%6.7%
Spanish3.2%8.6%4.1%
Portuguese5.0%9.9%7.5%
German6.4%8.2%8.5%
French7.1%13.3%10.7%
Russian9.6%7.9%11.8%
Ukrainian7.5%12.4%NA
Polish10.3%12.2%NA
Hindi6.3%23.5%23.6%
Kannada9.8%NANA
Malayalam10.0%NANA
Gujarati12.3%NANA
Marathi11.5%NANA
Czech12.4%22.9%19.2%
Oriya14.8%NANA
Bengali16.4%NANA
Slovak13.5%31.2%NA
Dutch15.0%16.3%12.5%
Swedish18.7%17.7%14.3%
Telugu14.3%NANA
Finnish18.3%14.1%13.2%
Latvian16.5%48.7%NA
Romanian17.8%36.0%NA
Punjabi18.3%NANA
Estonian17.8%49.0%NA
Bulgarian24.1%32.7%NA
Danish19.8%21.1%16.1%
Tamil21.6%NANA
Hungarian22.5%31.8%28.6%
Maltese25.5%NANA
Lithuanian25.1%44.9%NA

Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.

Streaming — FLEURS

Evaluated on the FLEURS dataset (streaming mode).

LanguageSmallest PulseDeepgram Nova 2Deepgram Nova 3
Italian4.4111.056.99
English4.515.5911.21
Spanish5.9910.677.52
Portuguese8.3214.1511.46
German9.511.110.15
French10.7114.312.07
Russian14.35NANA
Hindi8.320.015.46
Kannada16.97NANA
Malayalam15.91NANA
Gujarati20.05NANA
Marathi15.68NANA
Oriya22.74NANA
Bengali17.48NANA
Dutch11.90NANA
Telugu24.79NANA
Tamil20.15NANA

Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.

English STT — ESB Dataset (Streaming)

A Hugging Face benchmark suite aggregating 8 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization.

Evaluated on the open-source Hugging Face ESB datasets. Smallest Pulse numbers from internal evaluation.

DatasetSmallest PulseDeepgram Nova 2Deepgram Nova 3
LibriSpeech Clean1.804.353.71
LibriSpeech Other3.949.367.72
Common Voice9.2017.7914.59
VoxPopuli3.179.959.38
TEDELIUM2.364.353.57
GigaSpeech4.7411.6310.05
SPGISpeech2.675.263.28
Earnings228.7318.9815.34
AMI11.9319.8616.06
Overall5.3911.289.30

Hindi — multi-dataset (Streaming)

WER across seven Hindi speech datasets covering read speech (FLEURS), conversational speech (Kathbath, Common Voice), telephony / contact-center audio (Mucs, Gramvaani), TTS-derived audio (Indic-TTS), and a noise-augmented variant (Kathbath noisy). Compared against the strongest Hindi STT baselines: IndicWhisper, Sarvam Saaras v3, and Deepgram Nova-3.

Evaluated on the open-source datasets. Smallest Pulse numbers from internal evaluation. Lower is better.

DatasetSmallest PulseIndicWhisperSarvam Saaras v3Deepgram Nova-3
FLEURS9.5515.008.3114.09
Kathbath9.7110.308.1516.22
Kathbath (noisy)10.9412.0010.8117.06
Common Voice11.2011.4011.3623.55
Indic-TTS6.397.606.4910.72
MUCS9.1912.008.9616.20
Gramvaani21.4326.8021.8031.44

ASR Robustness — WildASR Dataset (Streaming)

An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech.

Evaluated on the open-source WildASR dataset. Smallest Pulse numbers from internal evaluation.

DatasetSmallest PulseDeepgram Nova 2Deepgram Nova 3
Clean4.4115.2810.76
Clipping12.9370.4143.15
Far Field12.0974.5258.72
Noise Gap9.0321.9114.19
Phone Codec5.7112.229.27
Reverberation7.9140.7127.21
Accent5.359.177.23
Overall8.7634.8924.36

Internal English Perturbation Benchmark

Not a public dataset. The English audio is sliced by perturbation type (Emotion, Entity, Disfluency, Noise, Accent, Silence, Speaker Diversity, Speed, Boundary, Pitch, Audio Quality, Volume) to isolate model weaknesses.

CategoryPulse English (Streaming)Deepgram Nova 3 (en)
Emotion15.43%19.42%
Entity12.14%11.80%
Disfluency11.91%8.64%
Noise11.57%14.61%
Accent9.13%10.43%
Silence8.99%13.17%
Speaker Diversity7.77%9.91%
Speed3.54%6.85%
Boundary3.02%6.30%
Pitch2.60%4.04%
Audio Quality2.45%4.05%
Volume2.11%3.59%

Optimization Tips

  • Use 16 kHz sample rate for an optimal balance of quality and latency.
  • Choose linear16 encoding for the lowest latency.
  • Enable only the features your use case requires; each optional feature adds work.
  • Batch process when latency is not critical.

Next Steps