Pulse

View as Markdown

Pulse is a high-accuracy, low-latency speech-to-text model built for real-time transcription across 35 documented languages (21 on streaming + 26 on pre-recorded, with regional aggregators), supporting both streaming and non-streaming use cases.

Jump to: Benchmarks · Supported Languages · API Reference · Quickstart

Model Overview

Developed bySmallest AI
Model typeSpeech-to-Text Streaming · Speech-to-Text Batch
LanguagesStreaming: 21 single-language codes + north_indic, multi-asian, and multi-south-indic aggregators. Non-streaming (batch): 26 single-language codes + multi-eu and multi-asian aggregators. East Asian codes (zh, yue, ja, ko, multi-asian) are streaming + US-region only. South Indian codes (ta, te, kn, ml, multi-south-indic) are streaming + India-region only.
Audio input formatsWAV, MP3, FLAC, Opus, μ-law, A-law, raw PCM
Pricing
(Standard Plan)
Realtime: $0.008/min · Batch: $0.005/min
Concurrency
(Standard Plan)
Streaming: 100 concurrent requests · Batch: 25 RPM
Recommended Sample Rate16,000 Hz
Recommended GPU1× NVIDIA L4 (24 GB VRAM). Larger GPUs (L40S, A100, H100) supported.

Key Capabilities

Real-Time Optimized

Sub-100 ms TTFT at 1 concurrency, ~300 ms at 100 concurrent requests. Designed for live transcription and conversational AI.

Multi-Language

35 documented languages (21 streaming + 26 pre-recorded) with regional auto-detect aggregators and in-session code-switching.

PII / PCI Redaction

Built-in redaction of personal data and payment-card information on both streaming and non-streaming surfaces.

Speaker Diarization

Automatic multi-speaker identification with per-word and per-utterance speaker labels.

Noise Handling

Background-noise handling built into the model — no preprocessing required.

Code-Switching

Multi-language audio within a single session. Set the known primary language (e.g. es for Spanish) — English+Spanish is handled automatically.


Performance & Benchmarks

Pulse STT is evaluated against three open-source datasets — FLEURS, ESB, and WildASR — and one internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. NA = not available or not supported by that provider.

For the full benchmark comparison across every dataset, see the Performance page. Pick the benchmark closest to your workload — each accordion below expands its full table.

WER on the English subset of FLEURS across providers in streaming mode. Lower is better.

ProviderSmallest PulseAssembly Universal 3 ProAWS transcribeAzureDeepgram Nova 3GrokSarvam Saras 3ElevenLabs Scribe V2
WER6.03%3.13%6.54%13.79%11.59%60.00%6.34%3.88%

A note on audio amplitude normalization

Audio amplitude normalization materially changes WER on FLEURS. Most competitors benchmark on raw FLEURS — which has variable, often low amplitude — without normalizing peak audio to −10 dBFS. This makes some models look much better than they actually are. Pulse is stable across all amplitude regimes.

ModelRaw FLEURS−10 dBFS−20 dBFSStable across regimes?
Pulse6.03%6.06%5.81%Yes
Deepgram Nova 311.59%6.57%6.51%Partial — 1.8× degradation on raw
Grok60.00%7.58%8.59%Collapses on raw

WER on FLEURS in streaming mode, broken down by language family. Lower is better.

LanguageSmallest PulseDeepgram Nova 2Deepgram Nova 3
Italian4.41%11.05%6.99%
English6.03%15.59%11.21%
Spanish5.99%10.67%7.52%
Portuguese8.32%14.15%11.46%
German9.5%11.1%10.15%
French10.71%14.3%12.07%
Russian14.35%NANA
Dutch11.90%NANA

WER on FLEURS in pre-recorded mode (full-file upload). Lower is better.

LanguageSmallest PulseDeepgram Nova 2Deepgram Nova 3
English4.55%7.9%6.7%
Italian3.0%10.7%6.2%
Spanish3.2%8.6%4.1%
Portuguese5.0%9.9%7.5%
German6.4%8.2%8.5%
French7.1%13.3%10.7%
Russian9.6%7.9%11.8%
Dutch15.0%16.3%12.5%

WER across seven Hindi datasets covering read speech, conversational speech, telephony / contact-center audio, and noise-augmented variants. Compared against IndicWhisper, Sarvam Saaras v3, scribe v2, and Deepgram Nova-3. Lower is better.

DatasetSmallest PulseIndicWhisperSarvam Saaras v3scribe v2Deepgram Nova-3
FLEURS9.0315.007.318.9614.09
Kathbath7.6310.308.158.6716.22
Kathbath (noisy)8.5212.008.8510.1117.06
Common Voice8.5811.4010.3613.6123.55
Indic-TTS6.387.606.298.7510.72
MUCS3.6112.008.158.1516.20
Gramvaani19.7726.8020.8024.0931.44
AVERAGE8.7813.5910.6911.7618.47

For the full breakdown including training-data and evaluation-protocol notes, see the Performance page.

A Hugging Face benchmark suite aggregating 9 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization. Lower WER is better.

Evaluated on the open-source Hugging Face ESB datasets. Numbers from internal evaluation.

DatasetSmallest PulseAssembly Universal 3 ProAWS TranscribeAzureDeepgram Nova 3GrokSarvam Saras V3ElevenLabs Scribe V2
LibriSpeech Clean2.461.652.162.483.203.613.091.97
LibriSpeech Other5.312.864.885.746.607.286.854.45
Common Voice10.896.7310.6947.2814.2243.4611.379.83
VoxPopuli7.167.287.0714.109.5511.497.777.91
TED-LIUM4.072.952.663.813.596.902.893.16
GigaSpeech10.439.1210.095.3510.0510.059.579.66
SPGISpeech2.861.744.183.532.999.703.894.40
Earnings2212.2511.5212.218.5415.7927.0211.9712.20
AMI10.5814.6013.198.4617.0419.1913.0812.23
Aggregate7.336.497.4611.039.2315.417.837.31

An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech. Lower WER is better. n/a = not supported by that provider.

Evaluated on the open-source WildASR dataset. Numbers from internal evaluation.

DatasetSmallest PulseAssembly Universal 3 proAWS TranscribeAzureDeepgram Nova 3Sarvam Saras V3ElevenLabs Scribe
Clean5.983.337.0111.1111.627.024.24
Clipping14.036.5942.104.3547.3528.7411.20
Far-field13.3826.0738.76n/a62.9921.277.38
Noise Gap8.904.049.77n/a15.049.746.30
Phone Codec7.193.458.70n/a9.1310.644.98
Reverberation9.0623.5014.83n/a27.274.356.48
Accent5.822.804.45n/a7.31n/a4.01
Aggregate9.6312.5218.358.8228.1717.756.47

WER for the four East Asian languages newly enabled on the streaming endpoint (us-west-2). Three datasets per language covering read speech (FLEURS), conversational/crowdsourced speech (Common Voice 25), and language-specific corpora (JSUT, Zeroth-Korean, MDCC, AISHELL-1). Compared head-to-head against Deepgram Nova-3. Lower WER is better.

LangDatasetSmallest PulseDeepgram Nova 3
JapaneseCV-2523.84%34.81%
JapaneseFLEURS10.78%17.11%
JapaneseJSUT BASIC500011.47%11.65%
KoreanCV-259.79%9.66%
KoreanFLEURS7.95%10.79%
KoreanZeroth-Korean5.25%6.46%
CantoneseCV-256.16%14.09%
CantoneseFLEURS13.06%15.43%
CantoneseMDCC5.85%12.77%
MandarinCV-2515.99%22.44%
MandarinFLEURS14.25%13.89%
MandarinAISHELL-17.34%8.69%
Average10.91%15.50%

Pulse averages 10.91% WER vs Deepgram Nova-3’s 15.50% across the four East Asian languages — Pulse leads on 10 of 12 dataset rows, with the largest gains on Japanese CV-25 and Cantonese CV-25.

These four languages stream from wss://api.us.smallest.ai/waves/v1/stt/live?model=pulse only (US region). See the streaming Asian-language documentation for the region-routing details.

Not a public dataset. The English audio is sliced by perturbation type (Noise, Silence, Telephony 911, Boundary, Disfluency, Long Audios, Repetition, Entity, Accent, Emotion, Speaker Diversity, Speed, Pitch, Volume, Audio Quality) to isolate model weaknesses. Lower WER is better.

CategorySmallest PulseAssembly Universal 3 ProAWS TranscribeDeepgram Nova 3ElevenLabs Scribe
Noise10.5311.9314.1914.5810.05
Silence5.814.228.2213.2810.61
Telephony 91121.0323.9327.8828.4320.29
Boundary2.833.093.183.661.73
Disfluency7.687.819.238.629.29
Long Audios12.818.5811.6611.169.25
Repetition11.389.8210.399.5710.81
Entity12.4310.1313.3511.699.48
Accent8.687.899.5110.427.25
Emotion13.9216.3418.5718.0711.84
Speaker Diversity7.336.728.819.485.95
Speed4.323.634.406.883.74
Pitch2.933.073.214.071.61
Volume2.373.052.413.671.47
Audio Quality2.732.863.034.081.60
Average WER8.458.209.8710.517.66

Not a public dataset. Hindi audio sliced by perturbation type to isolate model weaknesses. Lower WER is better except for Entity EDR where higher is better (↑).

CategorySmallest PulseSarvam Saras V3Deepgram Nova 3
Noise15.76%22.18%21.52%
Silence8.08%11.38%18.40%
Entity10.82%17.36%14.67%
Entity NE-WER13.32%26.72%26.58%
Entity EDR (↑)83.13%76.13%67.80%
Boundary11.67%17.52%17.36%
Long Audios17.93%18.42%19.21%
Speed16.16%21.39%38.21%
Pitch11.43%11.92%19.59%
Audio Quality10.86%11.75%19.51%
Volume9.31%15.25%16.76%
Disfluency11.51%12.06%18.44%
Repetition11.38%11.27%20.40%

Supported Languages

35 unique language codes across the two modes (21 on streaming, 26 on pre-recorded). Click an accordion to expand the full per-mode list.

Streaming — 21 languages + 3 regional aggregators
#LanguageCode
1Englishen
2Hindihi
3Germande
4Spanishes
5Russianru
6Italianit
7Frenchfr
8Dutchnl
9Portuguesept
10Mandarinzh — US region only [*]
11Cantoneseyue — US region only [*]
12Japaneseja — US region only [*]
13Koreanko — US region only [*]
14Gujaratigu
15Marathimr
16Oriyaor
17Bengalibn
18Tamilta — India region only [**]
19Telugute — India region only [**]
20Kannadakn — India region only [**]
21Malayalamml — India region only [**]
22North-Indic aggregatornorth_indic — auto-detects across en, hi, gu, mr, bn, or
23Multi-Asian aggregatormulti-asian — auto-detects across zh, yue, ko, ja, en. US region only [*]. Contact sales for access in the India region.
24South-Indic aggregatormulti-south-indic — auto-detects across ta, te, kn, ml, and English code-switching. India region only [**]. Use when the South Indian language is not known in advance.

[*] East Asian languages (zh, yue, ja, ko, multi-asian) are served from the US region only. Connect to wss://api.us.smallest.ai/waves/v1/stt/live instead of wss://api.smallest.ai/... for these.

[**] South Indian languages (ta, te, kn, ml, multi-south-indic) are served from the India region only (wss://api.smallest.ai/...). Requests to wss://api.us.smallest.ai/... are rejected with error code LANGUAGE_NOT_ENABLED_IN_REGION. Contact support to request access.

#LanguageCode
1Englishen
2Hindihi
3Germande
4Spanishes
5Russianru
6Italianit
7Frenchfr
8Dutchnl
9Portuguesept
10Ukrainianuk
11Polishpl
12Czechcs
13Slovaksk
14Latvianlv
15Estonianet
16Romanianro
17Finnishfi
18Swedishsv
19Bulgarianbg
20Hungarianhu
21Danishda
22Lithuanianlt
23Maltesemt
24Mandarinzh
25Japaneseja
26Koreanko
27Multi-EU aggregatormulti-eu — auto-detects across all 21 European codes above plus en
28Multi-Asian aggregatormulti-asian — auto-detects across zh, ko, ja, en

Single language code vs. aggregator: Use a specific language code (e.g. hi, es, en) whenever you know the language of the audio — the model optimizes directly for that language and also handles code-switching with English (e.g. hi covers Hindi–English mixed speech). Use an aggregator (north_indic, multi-eu, multi-asian) only when the language is genuinely unknown or the source is mixed across multiple languages; auto-detection adds a small accuracy overhead compared to an explicit code.


Features — Streaming

FeatureNotes
Speaker diarizationIdentifies and labels each speaker
Keyword boostingImproves accuracy for custom vocabulary
PII redactionPersonal information redaction
PCI redactionPayment card data redaction
Word-level timestampsStart and end time for each word
Sentence-level timestampsStart and end time for each sentence
PunctuationAutomatically adds punctuation
Code-switchingHandles multiple languages in one session

Features — Non-streaming

FeatureNotes
Speaker diarizationIdentifies and labels each speaker
PII redactionPersonal information redaction
PCI redactionPayment card data redaction
Word-level timestampsStart and end time for each word
Sentence-level timestampsRequires word_timestamps=true
PunctuationAutomatically adds punctuation
Code-switchingHandles multiple languages in one session

API Reference

EndpointMethodUse case
https://api.smallest.ai/waves/v1/stt/?model=pulsePOSTPre-recorded transcription
wss://api.smallest.ai/waves/v1/stt/live?model=pulseWebSocketStreaming transcription

See Transcribe (Pre-recorded) for the full request/response schema, supported parameters, and error codes. The streaming surface shares parameters where applicable; see the Realtime quickstart for the WebSocket protocol details.


Throughput, Latency & Pricing

ModeTypicalNotes
Streaming TTFT (1 concurrency)150 msSee Measuring latency for methodology.
Streaming TTFT (100 concurrency)~300 msConcurrency curve documented in the Performance page.
Pre-recorded RTFx50× or higherWall-clock; metadata.processing_time_ms excludes network.

Rate limits, concurrency caps, and pricing tiers are documented on the Concurrency & Limits page. For enterprise pricing, contact sales@smallest.ai.


Use Cases

Direct UseDownstream Use
Real-time call transcriptionMulti-turn conversational agents
Voice assistant inputVoice-to-text pipelines
Meeting transcriptionTelephony and IVR systems
Accessibility and captioningContent indexing and search
Customer support recording analysisCompliance and audit logging

Safety & Compliance

Pulse must not be used for:

  • Recording or transcribing individuals without their explicit consent
  • Surveillance, stalking, or any form of unauthorized monitoring
  • Any illegal or unethical purposes

Additionally:

  • Usage is monitored for policy compliance
  • For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai

FAQ

Pulse runs two independent inference paths. Streaming uses a WebSocket connection and emits partial transcripts in real time — suited for live call transcription, voice assistants, and conversational AI where first-token latency matters. Batch accepts a full audio file over HTTP and returns the complete transcript once processing is done — suited for call recordings, media archives, and any workload where you have the full audio upfront. Features also differ: keyword boosting and sentence-level timestamps are streaming-only, and batch supports a broader set of 26 languages vs 17 on streaming.

Use a specific language code (e.g. hi, es, en) whenever you know the language of the audio. Pulse optimises directly for that language and handles code-switching with English automatically — so hi covers Hindi–English mixed speech without needing an aggregator. Use north_indic or multi-asian only when the source language is genuinely unknown or mixed across several languages. Auto-detection adds a small accuracy overhead compared to an explicit code.

The complete request/response schema, all query parameters, error codes, and WebSocket protocol details are in the API Reference. For step-by-step setup, see the Realtime quickstart for streaming and the Pre-recorded quickstart for batch.

Pulse supports both streaming and batch transcription across 31 languages (17 streaming + 26 pre-recorded). Pulse Pro is English-only and batch-only, but achieves higher accuracy — tied for #2 on the Open ASR Leaderboard at 5.42% average WER vs Pulse’s 6.03% on English FLEURS. Use Pulse for live streaming, multilingual audio, or latency-sensitive workloads. Use Pulse Pro for high-volume English batch transcription where accuracy is the top requirement.

Pulse applies built-in redaction on both streaming and batch surfaces — no preprocessing required. PII redaction masks personal identifiers such as names, phone numbers, email addresses, and SSNs. PCI redaction masks payment card data including card numbers, CVVs, and expiry dates. Both are enabled via query parameters in the API request. Redacted content is replaced with a placeholder in the transcript; the original audio is not retained post-processing.


Support