Pulse
Pulse is a high-accuracy, low-latency speech-to-text model built for real-time transcription across 35 documented languages (21 on streaming + 26 on pre-recorded, with regional aggregators), supporting both streaming and non-streaming use cases.
Jump to: Benchmarks · Supported Languages · API Reference · Quickstart
TTFT at 1 concurrency
TTFT at 100 concurrency
21 streaming + 26 pre-recorded
Streaming + Non-streaming
Model Overview
Key Capabilities
Sub-100 ms TTFT at 1 concurrency, ~300 ms at 100 concurrent requests. Designed for live transcription and conversational AI.
35 documented languages (21 streaming + 26 pre-recorded) with regional auto-detect aggregators and in-session code-switching.
Built-in redaction of personal data and payment-card information on both streaming and non-streaming surfaces.
Automatic multi-speaker identification with per-word and per-utterance speaker labels.
Background-noise handling built into the model — no preprocessing required.
Multi-language audio within a single session. Set the known primary language (e.g. es for Spanish) — English+Spanish is handled automatically.
Performance & Benchmarks
Pulse STT is evaluated against three open-source datasets — FLEURS, ESB, and WildASR — and one internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. NA = not available or not supported by that provider.
For the full benchmark comparison across every dataset, see the Performance page. Pick the benchmark closest to your workload — each accordion below expands its full table.
FLEURS Dataset Streaming - English
WER on the English subset of FLEURS across providers in streaming mode. Lower is better.
A note on audio amplitude normalization
Audio amplitude normalization materially changes WER on FLEURS. Most competitors benchmark on raw FLEURS — which has variable, often low amplitude — without normalizing peak audio to −10 dBFS. This makes some models look much better than they actually are. Pulse is stable across all amplitude regimes.
FLEURS Dataset Streaming - European + Indic Languages
WER on FLEURS in streaming mode, broken down by language family. Lower is better.
European Languages
Indic Languages
FLEURS Dataset Batch - European + Indic Languages
WER on FLEURS in pre-recorded mode (full-file upload). Lower is better.
European Languages
Indic Languages
VISTAAR Streaming - Hindi
WER across seven Hindi datasets covering read speech, conversational speech, telephony / contact-center audio, and noise-augmented variants. Compared against IndicWhisper, Sarvam Saaras v3, scribe v2, and Deepgram Nova-3. Lower is better.
For the full breakdown including training-data and evaluation-protocol notes, see the Performance page.
HF ESB Dataset Streaming - English
A Hugging Face benchmark suite aggregating 9 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization. Lower WER is better.
Evaluated on the open-source Hugging Face ESB datasets. Numbers from internal evaluation.
WildASR Dataset Streaming - English (STT Robustness)
An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech. Lower WER is better. n/a = not supported by that provider.
Evaluated on the open-source WildASR dataset. Numbers from internal evaluation.
Multi-dataset Streaming - East Asian Languages
WER for the four East Asian languages newly enabled on the streaming endpoint (us-west-2). Three datasets per language covering read speech (FLEURS), conversational/crowdsourced speech (Common Voice 25), and language-specific corpora (JSUT, Zeroth-Korean, MDCC, AISHELL-1). Compared head-to-head against Deepgram Nova-3. Lower WER is better.
Pulse averages 10.91% WER vs Deepgram Nova-3’s 15.50% across the four East Asian languages — Pulse leads on 10 of 12 dataset rows, with the largest gains on Japanese CV-25 and Cantonese CV-25.
These four languages stream from wss://api.us.smallest.ai/waves/v1/stt/live?model=pulse only (US region). See the streaming Asian-language documentation for the region-routing details.
Internal Perturbation Benchmark Streaming - English
Not a public dataset. The English audio is sliced by perturbation type (Noise, Silence, Telephony 911, Boundary, Disfluency, Long Audios, Repetition, Entity, Accent, Emotion, Speaker Diversity, Speed, Pitch, Volume, Audio Quality) to isolate model weaknesses. Lower WER is better.
Internal Perturbation Benchmark Streaming - Hindi
Not a public dataset. Hindi audio sliced by perturbation type to isolate model weaknesses. Lower WER is better except for Entity EDR where higher is better (↑).
Supported Languages
35 unique language codes across the two modes (21 on streaming, 26 on pre-recorded). Click an accordion to expand the full per-mode list.
Streaming — 21 languages + 3 regional aggregators
[*] East Asian languages (zh, yue, ja, ko, multi-asian) are served from the US region only. Connect to wss://api.us.smallest.ai/waves/v1/stt/live instead of wss://api.smallest.ai/... for these.
[**] South Indian languages (ta, te, kn, ml, multi-south-indic) are served from the India region only (wss://api.smallest.ai/...). Requests to wss://api.us.smallest.ai/... are rejected with error code LANGUAGE_NOT_ENABLED_IN_REGION. Contact support to request access.
Non-Streaming (Pre-Recorded) — 26 languages + 2 regional aggregators
Single language code vs. aggregator: Use a specific language code (e.g. hi, es, en) whenever you know the language of the audio — the model optimizes directly for that language and also handles code-switching with English (e.g. hi covers Hindi–English mixed speech). Use an aggregator (north_indic, multi-eu, multi-asian) only when the language is genuinely unknown or the source is mixed across multiple languages; auto-detection adds a small accuracy overhead compared to an explicit code.
Features — Streaming
Features — Non-streaming
API Reference
See Transcribe (Pre-recorded) for the full request/response schema, supported parameters, and error codes. The streaming surface shares parameters where applicable; see the Realtime quickstart for the WebSocket protocol details.
Throughput, Latency & Pricing
Rate limits, concurrency caps, and pricing tiers are documented on the Concurrency & Limits page. For enterprise pricing, contact sales@smallest.ai.
Use Cases
Safety & Compliance
Pulse must not be used for:
- Recording or transcribing individuals without their explicit consent
- Surveillance, stalking, or any form of unauthorized monitoring
- Any illegal or unethical purposes
Additionally:
- Usage is monitored for policy compliance
- For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai
FAQ
What is the difference between Pulse streaming and batch modes?
Pulse runs two independent inference paths. Streaming uses a WebSocket connection and emits partial transcripts in real time — suited for live call transcription, voice assistants, and conversational AI where first-token latency matters. Batch accepts a full audio file over HTTP and returns the complete transcript once processing is done — suited for call recordings, media archives, and any workload where you have the full audio upfront. Features also differ: keyword boosting and sentence-level timestamps are streaming-only, and batch supports a broader set of 26 languages vs 17 on streaming.
How do I choose between a specific language code and a regional aggregator?
Use a specific language code (e.g. hi, es, en) whenever you know the language of the audio. Pulse optimises directly for that language and handles code-switching with English automatically — so hi covers Hindi–English mixed speech without needing an aggregator. Use north_indic or multi-asian only when the source language is genuinely unknown or mixed across several languages. Auto-detection adds a small accuracy overhead compared to an explicit code.
Where can I find the full API reference and quickstart guides?
The complete request/response schema, all query parameters, error codes, and WebSocket protocol details are in the API Reference. For step-by-step setup, see the Realtime quickstart for streaming and the Pre-recorded quickstart for batch.
What is the difference between Pulse and Pulse Pro?
Pulse supports both streaming and batch transcription across 31 languages (17 streaming + 26 pre-recorded). Pulse Pro is English-only and batch-only, but achieves higher accuracy — tied for #2 on the Open ASR Leaderboard at 5.42% average WER vs Pulse’s 6.03% on English FLEURS. Use Pulse for live streaming, multilingual audio, or latency-sensitive workloads. Use Pulse Pro for high-volume English batch transcription where accuracy is the top requirement.
How does PII and PCI redaction work?
Pulse applies built-in redaction on both streaming and batch surfaces — no preprocessing required. PII redaction masks personal identifiers such as names, phone numbers, email addresses, and SSNs. PCI redaction masks payment card data including card numbers, CVVs, and expiry dates. Both are enabled via query parameters in the API request. Redacted content is replaced with a placeholder in the transcript; the original audio is not retained post-processing.

