Pulse Pro
Pulse Pro is the premium Speech-to-Text model in the Pulse family. Built for English transcription where accuracy matters more than streaming. Tied for #2 on the public Open ASR Leaderboard (5.42% average WER), beating ElevenLabs Scribe v2, AssemblyAI Universal-3 Pro, Speechmatics Enhanced, and every Whisper variant. Pre-recorded only, with no streaming worker. Use standard Pulse for live streaming or multilingual audio.
Open ASR Leaderboard average, English
Long-form transcription, no timestamps
Pre-recorded HTTP transport
Customer rate, non-streaming
Model Overview
How to use it
Pulse Pro is selected via the model query parameter on the unified Speech-to-Text endpoint.
Replace the inline sample URL with --data-binary "@./your.wav" to send a local file.
Sample response:
For long files where you do not want to hold the HTTP connection open, pass a webhook_url:
Returns 200 immediately with {"status": "processing", "request_id": "..."}. The webhook receives the full transcription payload when ready.
Pulse Pro has no streaming worker. Calls to WS /waves/v1/stt/live?model=pulse-pro return 400 with a clear message. For live transcription use standard Pulse (?model=pulse).
Key Capabilities
Tied for #2 on the public Open ASR Leaderboard at 5.42% average WER. Outranks every commercial English STT API in our accuracy band.
Best-in-class on AMI (meetings, 7.32% WER) and SPGISpeech (financial, 2.04% WER). These are workloads enterprise customers actually have.
250–300× real-time factor on long-form audio without timestamps. Around 200× with word timestamps enabled.
Per-word timing on every response. Costs roughly one-third of throughput vs no-timestamps mode.
Multi-speaker identification with per-word speaker labels.
Pass webhook_url to offload long-file transcription; receive results on your callback when ready.
Performance & Benchmarks
Pulse Pro is evaluated on the public Open ASR Leaderboard (ESB benchmark, Whisper EnglishTextNormalizer) and FLEURS English.
Open ASR Leaderboard, head-to-head
WER % on the eight ESB datasets. Bold = winner per row. Whisper EnglishTextNormalizer, normalized.
Pulse Pro and Cohere Transcribe are a statistical tie on aggregate WER. Pulse Pro leads on conversational and financial workloads (AMI, SPGISpeech); Cohere edges ahead on read speech (LibriSpeech, TED-LIUM).
Position on the public leaderboard
Sorted by ESB average WER. Source: HF Open ASR Leaderboard.
FLEURS English
Per-language FLEURS tables for the broader European and Indic sets are tracked on standard Pulse.
Performance notes. Two caveats that matter for accurate expectation-setting:
- RTFx hardware reference: the public leaderboard measures throughput on A100-80GB. Pulse Pro’s published 250–300× was measured on L40S; the recommended L4 deployment delivers lower throughput than L40S, and A100 delivers higher. Re-benchmark on your target GPU before locking SLOs.
- Long-form single-file RTFx is lower than batched. On a challenging 1.92-hour Earnings22 sample we measured 68×. The 250–300× headline assumes optimal batching of typical-length audio. Plan for the lower bound on single very-long-form files.
Throughput, latency, and pricing
Customer pricing: $0.004 per minute of audio (Standard plan, non-streaming HTTP). Standard plan rate-limit defaults: 25 RPM per model. Enterprise tier is unlimited and configurable per-customer.
Supported Languages
Pulse Pro is English-only. For multilingual transcription, use standard Pulse (17 streaming + 26 pre-recorded languages).
API Reference
Endpoint
Query parameters
Request body
- Raw audio bytes:
Content-Type: application/octet-stream - Audio-by-URL is not supported on Pulse Pro. For URL-based input use standard Pulse.
Use Cases
Strong fit
- High-volume English batch transcription (call center QA, meeting platforms, media archives, compliance audits)
- Meeting and financial audio workloads where Pulse Pro leads the leaderboard (AMI, SPGISpeech)
- Regulated industries needing on-prem or VPC deployment
- Customers with budget pressure at scale (>1M minutes per month)
Not a fit
- Multilingual workloads; use standard Pulse (17 streaming + 26 pre-recorded languages)
- Live streaming or sub-100ms conversational AI; use standard Pulse streaming
- Audiobook or broadcast read-speech transcription where Cohere Transcribe and IBM Granite edge ahead on LibriSpeech and TED-LIUM
FAQ
Why Pulse Pro over Whisper Large v3?
Whisper Large v3 ranks 23rd on the Open ASR Leaderboard at 7.44% WER. Pulse Pro is tied for #2 at 5.42%, roughly a 27% relative WER improvement. Pulse Pro is also cheaper per minute than every hosted Whisper API.
Why Pulse Pro over Cohere Transcribe?
Pulse Pro and Cohere Transcribe are tied on aggregate ESB WER (both 5.42%). Pulse Pro wins on AMI (meetings, 7.32 vs 8.13) and SPGISpeech (financial, 2.04 vs 3.08); Cohere wins on LibriSpeech and TED-LIUM (read speech). Pulse Pro ships as a managed API at $0.004/min; Cohere is open-weights and requires you to self-host.
Why Pulse Pro over IBM Granite Speech 4.1 2B?
Granite 4.1 2B is 0.09 WER points ahead on aggregate (5.33 vs 5.42). For most workloads the gap is operationally invisible. Pulse Pro is managed, hosted, and metered per minute. Granite is open-weights, with the same self-hosting cost basis as our infrastructure, but you take on the deployment, autoscaling, and operations cost.
Why Pulse Pro over ElevenLabs Scribe v2?
Scribe v2 ranks 8th on the Open ASR Leaderboard at 5.83% WER, behind Pulse Pro by 0.41 points. The “Scribe v2 is #1 for accuracy” talking point comes from a different (smaller) benchmark. On the public, reproducible ESB benchmark Pulse Pro is more accurate and ~1,500× cheaper per minute.
Why Pulse Pro over AssemblyAI Universal-3 Pro?
Universal-3 Pro ranks 12th on ESB at 6.21% WER, behind Pulse Pro by 0.79 points. AssemblyAI is 0.004/min ($4 per 1,000 minutes). Pulse Pro is more accurate at a comparable price.
Why Pulse Pro over NVIDIA Parakeet TDT?
Parakeet TDT 0.6B v3 runs at ~3,300× RTFx on A100, roughly 10× the published Pulse Pro throughput. But it ranks 13th on ESB at 6.32% WER, behind Pulse Pro by 0.90 WER points. For pure overnight bulk transcription where throughput dominates, Parakeet is competitive. For accuracy-sensitive workloads (meetings, finance, compliance), the WER gap matters.
Why English only?
Pulse Pro v4.1 is trained exclusively on English. For multilingual transcription, use standard Pulse (17 streaming + 26 pre-recorded languages). Pulse Pro and Pulse share the same /waves/v1/stt/ endpoint, so adding multilingual capability is a one-line ?model= swap.
Why are word timestamps slower than no-timestamps mode?
Word timestamps require an alignment pass after acoustic decoding. Pulse Pro currently runs alignment in the standard pipeline, which costs roughly one-third of overall throughput (~200× with timestamps vs 250–300× without). A vLLM-backed alignment port is in development to close this gap.
Why no streaming?
The streaming worker for Pulse Pro is on the roadmap but not yet deployed. Today, calls to WS /waves/v1/stt/live?model=pulse-pro return 400 before the WebSocket upgrades, with a message directing you to the HTTP endpoint. For streaming use the standard Pulse model (?model=pulse).
Safety & Compliance
Pulse Pro must not be used for:
- Recording or transcribing individuals without their explicit consent
- Surveillance, stalking, or any form of unauthorized monitoring
- Any illegal or unethical purposes
Additionally:
- Usage is monitored for policy compliance
- For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai

