Performance
Pulse STT is evaluated against three open-source datasets — FLEURS, ESB, and WildASR — and one internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. NA = not available or not supported by that provider.
Latency
Time-to-First-Transcript (TTFT)
TTFT measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.
Pre-recorded — FLEURS
Google’s multilingual speech dataset covering 102 languages, built on the FLoRes-101 translation benchmark. Contains ~12 hours of read speech per language and is the standard benchmark for evaluating multilingual ASR, including low-resource languages.
Evaluated on the FLEURS dataset (non-streaming / batch mode).
Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.
Streaming — FLEURS
Evaluated on the FLEURS dataset (streaming mode).
Sources: Deepgram internal benchmarks; Smallest AI internal evaluation.
English STT — ESB Dataset (Streaming)
A Hugging Face benchmark suite aggregating 8 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization.
Evaluated on the open-source Hugging Face ESB datasets. Smallest Pulse numbers from internal evaluation.
Hindi — multi-dataset (Streaming)
WER across seven Hindi speech datasets covering read speech (FLEURS), conversational speech (Kathbath, Common Voice), telephony / contact-center audio (Mucs, Gramvaani), TTS-derived audio (Indic-TTS), and a noise-augmented variant (Kathbath noisy). Compared against the strongest Hindi STT baselines: IndicWhisper, Sarvam Saaras v3, and Deepgram Nova-3.
Evaluated on the open-source datasets. Smallest Pulse numbers from internal evaluation. Lower is better.
ASR Robustness — WildASR Dataset (Streaming)
An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech.
Evaluated on the open-source WildASR dataset. Smallest Pulse numbers from internal evaluation.
Internal English Perturbation Benchmark
Not a public dataset. The English audio is sliced by perturbation type (Emotion, Entity, Disfluency, Noise, Accent, Silence, Speaker Diversity, Speed, Boundary, Pitch, Audio Quality, Volume) to isolate model weaknesses.
Optimization Tips
- Use 16 kHz sample rate for an optimal balance of quality and latency.
- Choose
linear16encoding for the lowest latency. - Enable only the features your use case requires; each optional feature adds work.
- Batch process when latency is not critical.

