Pulse Pro | Smallest AI Docs

Latest Release

Pulse Pro is the premium non-streaming (batch) speech-to-text model for English. It has been tied #2 on the public Open ASR Leaderboard (5.42% average WER), beating ElevenLabs Scribe v2, AssemblyAI Universal-3 Pro, Speechmatics Enhanced, and every Whisper variant.

Jump to: Benchmarks · API Reference · Pre-recorded Quickstart

5.42% WER

Open ASR Leaderboard average, English

250–300× RTFx

Long-form transcription, no timestamps

English Only

Single language: English (en)

Batch Only

No streaming, HTTP transport only

Model Overview


Developed by	Smallest AI
Model type	Speech-to-Text Batch
Languages	English (`en`)
Audio input formats	WAV, MP3, FLAC, Opus, μ-law, A-law, raw PCM
Pricing (Standard Plan)	$0.004/min
Concurrency (Standard Plan)	25 RPM
Recommended Sample Rate	16,000 Hz
Recommended GPU	1× NVIDIA L4 (24 GB VRAM). Larger GPUs (L40S, A100, H100) supported.

Key Capabilities

Leaderboard-Ranked Accuracy

Tied for #2 on the public Open ASR Leaderboard at 5.42% average WER. Outranks every commercial English STT API in our accuracy band.

Domain Wins

Best-in-class on AMI (meetings, 7.32% WER) and SPGISpeech (financial, 2.04% WER). Workloads enterprise customers actually have.

High Throughput

250–300× real-time factor on long-form audio without timestamps. ~200× with word timestamps enabled.

Word Timestamps

Per-word timing on every response. Costs roughly one-third of throughput vs no-timestamps mode.

Speaker Diarization

Multi-speaker identification with per-word speaker labels.

Noise Handling

Built-in handling for background noise, telephony artifacts, and variable audio quality — no preprocessing required.

Performance & Benchmarks

#1 for Average WER

CodeSota Leaderboard

#1 for Speed Factor

Artificial Analysis

Pulse Pro is evaluated on the public Open ASR Leaderboard (ESB benchmark, Whisper EnglishTextNormalizer) and FLEURS English.

Open ASR Leaderboard, head-to-head

WER % on the eight ESB datasets. Bold = winner per row. Whisper EnglishTextNormalizer, normalized.

Dataset	Pulse Pro	Granite 4.1 2B	Cohere Transcribe	AssemblyAI Universal-3 Pro	ElevenLabs Scribe v2	Deepgram Nova 3
AMI	7.32	8.09	8.13	14.60	12.23	17.04
Earnings22	9.04	8.37	10.86	11.52	12.20	15.79
GigaSpeech	9.52	9.80	9.34	9.12	9.66	10.05
LibriSpeech clean	1.73	1.33	1.25	1.65	1.97	3.20
LibriSpeech other	3.74	2.50	2.37	2.86	4.45	6.60
SPGISpeech	2.04	3.78	3.08	1.74	4.40	2.99
TED-LIUM	3.68	3.07	2.49	2.95	3.16	3.59
VoxPopuli	6.32	5.70	5.87	7.28	7.91	9.55
Average (8 datasets)	5.42	5.33	5.42	6.47	7.00	8.60
Open ASR rank	🥈 #2 (tied)	🥇 #1	🥈 #2 (tied)	#12	#8	—

Pulse Pro leads every commercial API in this comparison on aggregate WER. Tied with Cohere Transcribe on aggregate — Pulse Pro wins outright on AMI (meetings) and SPGISpeech (financial); Cohere edges ahead on read speech (LibriSpeech, TED-LIUM). AssemblyAI and ElevenLabs Scribe v2 are competitive on individual datasets but trail on aggregate.

Position on the public leaderboard

Sorted by ESB average WER. Source: HF Open ASR Leaderboard.

Rank	Model	ESB Avg WER ↓
1	IBM Granite Speech 4.1 2B	5.33
2	Pulse Pro	5.42
2	Cohere Labs Transcribe (tied)	5.42
3	Zoom Scribe v1	5.47
4	IBM Granite Speech 4.0 1B	5.52
5	NVIDIA Canary Qwen 2.5B	5.63
8	ElevenLabs Scribe v2	5.83
12	AssemblyAI Universal-3 Pro	6.21
18	Speechmatics Enhanced	6.91
23	OpenAI Whisper Large v3	7.44

FLEURS English

Metric	Pulse Pro
WER (FLEURS en_us)	3.92%
CER (FLEURS en_us)	1.73%

Per-language FLEURS tables for the broader European and Indic sets are tracked on standard Pulse.

Performance notes. Two caveats that matter for accurate expectation-setting:

RTFx hardware reference: the public leaderboard measures throughput on A100-80GB. Pulse Pro’s published 250–300× was measured on L40S; the recommended L4 deployment delivers lower throughput than L40S, and A100 delivers higher. Re-benchmark on your target GPU before locking SLOs.
Long-form single-file RTFx is lower than batched. On a challenging 1.92-hour Earnings22 sample we measured 68×. The 250–300× headline assumes optimal batching of typical-length audio. Plan for the lower bound on single very-long-form files.

Supported Languages

Pulse Pro is English-only. For multilingual transcription, use standard Pulse (17 streaming + 26 pre-recorded languages).

Language	Code
English	`en`

API Reference

Endpoint

Endpoint	Method	Use case
`https://api.smallest.ai/waves/v1/stt/?model=pulse-pro`	POST	Synchronous (or async via `webhook_url`) transcription

Throughput, Latency & Pricing

Mode	Throughput (RTFx, long-form)	2 hr file latency
No word timestamps	250–300×	~24–29 sec
With word timestamps	~200×	~36 sec

Customer pricing: $0.004 per minute of audio (Standard plan, non-streaming HTTP). Standard plan rate-limit defaults: 25 RPM per model. Enterprise tier is unlimited and configurable per-customer.

Use Cases

Strong Fit	Not a Fit
High-volume English batch transcription (call centers, media archives, compliance)	Multilingual workloads — use Pulse instead
Meeting and financial audio (AMI, SPGISpeech leaderboard wins)	Live streaming or sub-100ms conversational AI
Regulated industries needing on-prem or VPC deployment	Audiobook / broadcast read-speech where Cohere or IBM Granite edge ahead
Customers at scale (>1M minutes/month)

Safety & Compliance

Pulse Pro must not be used for:

Recording or transcribing individuals without their explicit consent
Surveillance, stalking, or any form of unauthorized monitoring
Any illegal or unethical purposes

Additionally:

Usage is monitored for policy compliance
For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai

FAQ

Why Pulse Pro over Whisper Large v3?

Whisper Large v3 ranks 23rd on the Open ASR Leaderboard at 7.44% WER. Pulse Pro is tied for #2 at 5.42%, roughly a 27% relative WER improvement. Pulse Pro is also cheaper per minute than every hosted Whisper API.

Why Pulse Pro over IBM Granite Speech 4.1 2B?

Granite 4.1 2B is 0.09 WER points ahead on aggregate (5.33 vs 5.42). For most workloads the gap is operationally invisible. Pulse Pro is managed, hosted, and metered per minute. Granite is open-weights, with the same self-hosting cost basis as our infrastructure, but you take on the deployment, autoscaling, and operations cost.

Why Pulse Pro over ElevenLabs Scribe v2?

Scribe v2 ranks 8th on the Open ASR Leaderboard at 5.83% WER, behind Pulse Pro by 0.41 points. The “Scribe v2 is #1 for accuracy” talking point comes from a different (smaller) benchmark. On the public, reproducible ESB benchmark Pulse Pro is more accurate and ~1,500× cheaper per minute.

Why Pulse Pro over AssemblyAI Universal-3 Pro?

Universal-3 Pro ranks 12th on ESB at 6.21% WER, behind Pulse Pro by 0.79 points. AssemblyAI is $3.50 per 1,000 minutes; Pulse Pro is$ 0.004/min ($4 per 1,000 minutes). Pulse Pro is more accurate at a comparable price.

Why English only?

Pulse Pro v4.1 is trained exclusively on English. For multilingual transcription, use standard Pulse (17 streaming + 26 pre-recorded languages). Pulse Pro and Pulse share the same /waves/v1/stt/ endpoint, so adding multilingual capability is a one-line ?model= swap.

Why are word timestamps slower than no-timestamps mode?

Word timestamps require an alignment pass after acoustic decoding. Pulse Pro currently runs alignment in the standard pipeline, which costs roughly one-third of overall throughput (~200× with timestamps vs 250–300× without). A vLLM-backed alignment port is in development to close this gap.

Why no streaming?

The streaming worker for Pulse Pro is on the roadmap but not yet deployed. Today, calls to WS /waves/v1/stt/live?model=pulse-pro return 400 before the WebSocket upgrades, with a message directing you to the HTTP endpoint. For streaming use the standard Pulse model (?model=pulse).

Support

Console

Documentation