For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Text to Speech
    • Lightning v3.1 Pro
    • Lightning v3.1
    • TTS Evaluation Script
  • Speech to Text
    • Pulse Pro
    • Pulse
  • LLM
    • Electron
  • Speech to Speech
    • Hydra
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Model Overview
  • Key Capabilities
  • Performance & Benchmarks
  • FLEURS Streaming — English
  • A note on audio amplitude normalization
  • FLEURS — Streaming
  • FLEURS — Pre-recorded
  • Hindi — multi-dataset (Streaming)
  • English STT — ESB Dataset (Streaming)
  • ASR Robustness — WildASR Dataset (Streaming)
  • Internal English Perturbation Benchmark
  • Internal Hindi Perturbation Benchmark
  • Features — Non-streaming
  • Features — Streaming
  • Supported Languages — Non-streaming
  • Supported Languages — Streaming
  • Best Practices
  • Specify the language parameter when known
  • Use Cases
  • Direct use
  • Downstream use
  • Safety & Compliance
  • Contact
Speech to Text

Pulse

||View as Markdown|
Was this page helpful?
Previous

Pulse Pro

Next

Electron

Built with

Pulse is a high-accuracy, low-latency speech-to-text model built for real-time transcription across 38 languages, with streaming and non-streaming support.

64ms

TTFT at 1 concurrency

300ms

TTFT at 100 concurrency

38 Languages

Streaming + Non-streaming

2 Modes

Streaming + Non-streaming

Model Overview

Developed bySmallest AI
Model typeSpeech-to-Text
Languages38 supported (plus multi, multi-eu, multi-indic, multi-asian aggregators)
LicenseProprietary
Model format (non-streaming)pulse_offline_<lang>_<version>.smlst
Model format (streaming)pulse_streaming_<lang>_<version>.smlst
Documentationdocs.smallest.ai/waves
Consoleapp.smallest.ai/dashboard
Supportsupport@smallest.ai

Key Capabilities

Real-Time Optimized

Ultra-low latency architecture delivering 64ms TTFT at 1 concurrency and 300ms at 100 concurrent requests — designed for live transcription and conversational AI.

Multi-Language

38 languages supported across streaming and non-streaming modes, with automatic language detection and code-switching within a single session.

PII / PCI Redaction

Built-in redaction of personal and payment card data across both streaming and non-streaming use cases.

Speaker Diarization

Automatic multi-speaker identification across both streaming and non-streaming modes, with per-word and per-utterance speaker labels.

Noise Reduction

Background noise handling built into the model.

Code-Switching

Supports multi-language audio within a single session. Best used by setting the known primary language (e.g. es for Spanish handles English+Spanish automatically).


Performance & Benchmarks

Pulse STT is evaluated against three open-source datasets — FLEURS, ESB, and WildASR — and one internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. NA = not available or not supported by that provider.

For the full benchmark comparison across every dataset, see the Performance page.

FLEURS Streaming — English

WER on the English subset of FLEURS across providers in streaming mode. Lower is better.

ProviderSmallest PulseAssembly Universal 3 ProAWS transcribeAzureDeepgram Nova 3GrokSarvam Saras 3ElevenLabs Scribe V2
WER6.03%3.13%6.54%13.79%11.59%60.00%6.34%3.88%

A note on audio amplitude normalization

Audio amplitude normalization materially changes WER on FLEURS. Most competitors benchmark on raw FLEURS — which has variable, often low amplitude — without normalizing peak audio to −10 dBFS. This makes some models look much better than they actually are. Pulse is stable across all amplitude regimes.

ModelRaw FLEURS−10 dBFS−20 dBFSStable across regimes?
Pulse6.03%6.06%5.81%Yes
Deepgram Nova 311.59%6.57%6.51%Partial — 1.8× degradation on raw
Grok60.00%7.58%8.59%Collapses on raw

FLEURS — Streaming

European Languages
Indic Languages
LanguageSmallest PulseDeepgram Nova 2Deepgram Nova 3
Italian4.41%11.05%6.99%
English6.03%15.59%11.21%
Spanish5.99%10.67%7.52%
Portuguese8.32%14.15%11.46%
German9.5%11.1%10.15%
French10.71%14.3%12.07%
Russian14.35%NANA
Dutch11.90%NANA

FLEURS — Pre-recorded

European Languages
Indic Languages
LanguageSmallest PulseDeepgram Nova 2Deepgram Nova 3
English4.55%7.9%6.7%
Italian3.0%10.7%6.2%
Spanish3.2%8.6%4.1%
Portuguese5.0%9.9%7.5%
German6.4%8.2%8.5%
French7.1%13.3%10.7%
Russian9.6%7.9%11.8%
Ukrainian7.5%12.4%NA
Polish10.3%12.2%NA
Dutch15.0%16.3%12.5%
Czech12.4%22.9%19.2%
Slovak13.5%31.2%NA
Swedish18.7%17.7%14.3%
Finnish18.3%14.1%13.2%
Latvian16.5%48.7%NA
Romanian17.8%36.0%NA
Estonian17.8%49.0%NA
Bulgarian24.1%32.7%NA
Danish19.8%21.1%16.1%
Hungarian22.5%31.8%28.6%
Maltese25.5%NANA
Lithuanian25.1%44.9%NA

Hindi — multi-dataset (Streaming)

WER across seven Hindi datasets covering read speech, conversational speech, telephony / contact-center audio, and noise-augmented variants. Compared against IndicWhisper, Sarvam Saaras v3, and Deepgram Nova-3. Lower is better.

DatasetSmallest PulseIndicWhisperSarvam Saaras v3Deepgram Nova-3
FLEURS9.5515.008.3114.09
Kathbath9.7110.308.1516.22
Kathbath (noisy)10.9412.0010.8117.06
Common Voice11.2011.4011.3623.55
Indic-TTS6.397.606.4910.72
MUCS9.1912.008.9616.20
Gramvaani21.4326.8021.8031.44

For the full breakdown including training-data and evaluation-protocol notes, see the Performance page.

English STT — ESB Dataset (Streaming)

A Hugging Face benchmark suite aggregating 9 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization. Lower WER is better.

Evaluated on the open-source Hugging Face ESB datasets. Numbers from internal evaluation.

DatasetSmallest PulseAssembly Universal 3 ProAWS TranscribeAzureDeepgram Nova 3GrokSarvam Saras V3ElevenLabs Scribe V2
LibriSpeech Clean2.461.652.162.483.203.613.091.97
LibriSpeech Other5.312.864.885.746.607.286.854.45
Common Voice10.896.7310.6947.2814.2243.4611.379.83
VoxPopuli7.167.287.0714.109.5511.497.777.91
TED-LIUM4.072.952.663.813.596.902.893.16
GigaSpeech10.439.1210.095.3510.0510.059.579.66
SPGISpeech2.861.744.183.532.999.703.894.40
Earnings2212.2511.5212.218.5415.7927.0211.9712.20
AMI10.5814.6013.198.4617.0419.1913.0812.23
Aggregate7.336.497.4611.039.2315.417.837.31

ASR Robustness — WildASR Dataset (Streaming)

An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech. Lower WER is better. n/a = not supported by that provider.

Evaluated on the open-source WildASR dataset. Numbers from internal evaluation.

DatasetSmallest PulseAssembly Universal 3 proAWS TranscribeAzureDeepgram Nova 3Sarvam Saras V3ElevenLabs Scribe
Clean5.983.337.0111.1111.627.024.24
Clipping14.036.5942.104.3547.3528.7411.20
Far-field13.3826.0738.76n/a62.9921.277.38
Noise Gap8.904.049.77n/a15.049.746.30
Phone Codec7.193.458.70n/a9.1310.644.98
Reverberation9.0623.5014.83n/a27.274.356.48
Accent5.822.804.45n/a7.31n/a4.01
Aggregate9.6312.5218.358.8228.1717.756.47

Internal English Perturbation Benchmark

Not a public dataset. The English audio is sliced by perturbation type (Noise, Silence, Telephony 911, Boundary, Disfluency, Long Audios, Repetition, Entity, Accent, Emotion, Speaker Diversity, Speed, Pitch, Volume, Audio Quality) to isolate model weaknesses. Lower WER is better.

CategorySmallest PulseAssembly Universal 3 ProAWS TranscribeDeepgram Nova 3ElevenLabs Scribe
Noise10.5311.9314.1914.5810.05
Silence5.814.228.2213.2810.61
Telephony 91121.0323.9327.8828.4320.29
Boundary2.833.093.183.661.73
Disfluency7.687.819.238.629.29
Long Audios12.818.5811.6611.169.25
Repetition11.389.8210.399.5710.81
Entity12.4310.1313.3511.699.48
Accent8.687.899.5110.427.25
Emotion13.9216.3418.5718.0711.84
Speaker Diversity7.336.728.819.485.95
Speed4.323.634.406.883.74
Pitch2.933.073.214.071.61
Volume2.373.052.413.671.47
Audio Quality2.732.863.034.081.60
Average WER8.458.209.8710.517.66

Internal Hindi Perturbation Benchmark

Not a public dataset. Hindi audio sliced by perturbation type to isolate model weaknesses. Lower WER is better except for Entity EDR where higher is better (↑).

CategorySmallest PulseSarvam Saras V3Deepgram Nova 3
Noise15.75%22.18%21.52%
Silence9.72%11.38%18.40%
Entity10.82%17.36%14.67%
Entity NE-WER13.32%26.72%26.58%
Entity EDR (↑)83.13%76.13%67.80%
Boundary11.99%17.52%17.36%
Long Audios18.11%18.42%19.21%
Speed16.37%21.39%38.21%
Pitch11.81%11.92%19.59%
Audio Quality10.65%11.75%19.51%
Volume8.21%15.25%16.76%
Disfluency11.83%12.06%18.44%
Repetition11.44%11.27%20.40%

Features — Non-streaming

FeatureAvailableNotes
Speaker diarizationYesMulti-speaker identification
PII redactionYesPersonal info redaction
PCI redactionYesPayment card data redaction
Word-level timestampsYesPer-word timing
Sentence-level timestampsYesRequires word_timestamps=true to be enabled
PunctuationYesAuto punctuation
Profanity filterYesExplicit content filtering
Language detectionYesAuto language ID
Code-switchingYesMulti-language in same audio
Noise reductionYesBackground noise handling
Emotion and gender detectionYesReturns the percentage score of detected emotion and gender

Features — Streaming

FeatureAvailableNotes
Speaker diarizationYesMulti-speaker identification
Keyword boostingYesCustom vocabulary enhancement
PII redactionYesPersonal info redaction
PCI redactionYesPayment card data redaction
Word-level timestampsYesPer-word timing
Sentence-level timestampsYesPer-sentence timing
PunctuationYesAuto punctuation
Profanity filterNo—
Language detectionYesAuto language ID
Code-switchingYesMulti-language in same audio
Custom vocabularyNo—
Noise reductionYesBackground noise handling

Supported Languages — Non-streaming

LanguageCodeAvailable
EnglishenYes
ItalianitYes
SpanishesYes
PortugueseptYes
HindihiYes
GermandeYes
FrenchfrYes
UkrainianukYes
RussianruYes
KannadaknYes
MalayalammlYes
PolishplYes
MarathimrYes
GujaratiguYes
CzechcsYes
SlovakskYes
TeluguteYes
Oriya (Odia)orYes
DutchnlYes
BengalibnYes
LatvianlvYes
EstonianetYes
RomanianroYes
PunjabipaYes
FinnishfiYes
SwedishsvYes
BulgarianbgYes
TamiltaYes
HungarianhuYes
DanishdaYes
LithuanianltYes
MaltesemtYes
JapanesejaYes
CantoneseyueYes
MandarinzhYes
KoreankoYes
TagalogtlYes
IndonesianidYes
MalaymsYes

Supported Languages — Streaming

LanguageCodeAvailable
EnglishenYes
ItalianitYes
SpanishesYes
PortugueseptYes
HindihiYes
GermandeYes
FrenchfrYes
UkrainianukYes
RussianruYes
KannadaknYes
MalayalammlYes
PolishplYes
MarathimrYes
GujaratiguYes
CzechcsYes
SlovakskYes
TeluguteYes
Oriya (Odia)orYes
DutchnlYes
BengalibnYes
LatvianlvYes
EstonianetYes
RomanianroYes
PunjabipaYes
FinnishfiYes
SwedishsvYes
BulgarianbgYes
TamiltaYes
HungarianhuYes
DanishdaYes
LithuanianltYes
MaltesemtYes
JapanesejaYes
CantoneseyueYes
MandarinzhYes
KoreankoYes
TagalogtlYes
IndonesianidYes
MalaymsYes

Best Practices

Specify the language parameter when known

When the language of the audio is known in advance, always set it explicitly rather than relying on automatic detection. This yields better transcription accuracy because the model can optimize directly for that language without needing to first identify it.

For example, setting the language parameter to es (Spanish) tells the model to expect Spanish audio, which also handles English+Spanish code-switching scenarios. This produces more accurate outputs compared to using multi-eu or multi.

ParameterUse case
enEnglish
esSpanish (handles English+Spanish)
hiHindi (handles English+Hindi)
multi-euUnknown European-language audio (auto-detects across the European set)
multiTruly unknown or mixed-language audio (full multilingual auto-detection)

When to use multi-eu or multi:

  • When the language is truly unknown beforehand
  • When processing audio from varied or unpredictable sources
  • Prefer multi-eu for European-language input; use multi only for truly mixed multilingual audio

Use Cases

Direct use

  • Real-time call transcription
  • Voice assistant input
  • Meeting transcription
  • Accessibility and captioning
  • Customer support recording analysis

Downstream use

  • Multi-turn conversational agents
  • Voice-to-text pipelines
  • Telephony and IVR systems
  • Content indexing and search
  • Compliance and audit logging

Safety & Compliance

Pulse must not be used for:

  • Recording or transcribing individuals without their explicit consent
  • Surveillance, stalking, or any form of unauthorized monitoring
  • Any illegal or unethical purposes

Additionally:

  • Usage is monitored for policy compliance
  • For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai

Contact

Supportsupport@smallest.ai
Documentationdocs.smallest.ai/waves
Consoleapp.smallest.ai/dashboard