Pulse Pro is the premium Speech-to-Text model in the Pulse family. Built for English transcription where accuracy matters more than streaming. Tied for #2 on the public Open ASR Leaderboard (5.42% average WER), beating ElevenLabs Scribe v2, AssemblyAI Universal-3 Pro, Speechmatics Enhanced, and every Whisper variant. Pre-recorded only, with no streaming worker. Use standard Pulse for live streaming or multilingual audio.
Open ASR Leaderboard average, English
Long-form transcription, no timestamps
Pre-recorded HTTP transport
Customer rate, non-streaming
Pulse Pro is selected via the model query parameter on the unified Speech-to-Text endpoint.
Replace the inline sample URL with --data-binary "@./your.wav" to send a local file.
Sample response:
For long files where you do not want to hold the HTTP connection open, pass a webhook_url:
Returns 200 immediately with {"status": "processing", "request_id": "..."}. The webhook receives the full transcription payload when ready.
Pulse Pro has no streaming worker. Calls to WS /waves/v1/stt/live?model=pulse-pro return 400 with a clear message. For live transcription use standard Pulse (?model=pulse).
Tied for #2 on the public Open ASR Leaderboard at 5.42% average WER. Outranks every commercial English STT API in our accuracy band.
Best-in-class on AMI (meetings, 7.32% WER) and SPGISpeech (financial, 2.04% WER). These are workloads enterprise customers actually have.
250–300× real-time factor on long-form audio without timestamps. Around 200× with word timestamps enabled.
Per-word timing on every response. Costs roughly one-third of throughput vs no-timestamps mode.
Multi-speaker identification with per-word speaker labels.
Pass webhook_url to offload long-file transcription; receive results on your callback when ready.
Pulse Pro is evaluated on the public Open ASR Leaderboard (ESB benchmark, Whisper EnglishTextNormalizer) and FLEURS English.
WER % on the eight ESB datasets. Bold = winner per row. Whisper EnglishTextNormalizer, normalized.
Pulse Pro and Cohere Transcribe are a statistical tie on aggregate WER. Pulse Pro leads on conversational and financial workloads (AMI, SPGISpeech); Cohere edges ahead on read speech (LibriSpeech, TED-LIUM).
Sorted by ESB average WER. Source: HF Open ASR Leaderboard.
Per-language FLEURS tables for the broader European and Indic sets are tracked on standard Pulse.
Performance notes. Two caveats that matter for accurate expectation-setting:
Customer pricing: $0.004 per minute of audio (Standard plan, non-streaming HTTP). Standard plan rate-limit defaults: 25 RPM per model. Enterprise tier is unlimited and configurable per-customer.
Pulse Pro is English-only. For multilingual transcription, use standard Pulse (38 languages, streaming + non-streaming).
Content-Type: application/octet-streamWhisper Large v3 ranks 23rd on the Open ASR Leaderboard at 7.44% WER. Pulse Pro is tied for #2 at 5.42%, roughly a 27% relative WER improvement. Pulse Pro is also cheaper per minute than every hosted Whisper API.
Pulse Pro and Cohere Transcribe are tied on aggregate ESB WER (both 5.42%). Pulse Pro wins on AMI (meetings, 7.32 vs 8.13) and SPGISpeech (financial, 2.04 vs 3.08); Cohere wins on LibriSpeech and TED-LIUM (read speech). Pulse Pro ships as a managed API at $0.004/min; Cohere is open-weights and requires you to self-host.
Granite 4.1 2B is 0.09 WER points ahead on aggregate (5.33 vs 5.42). For most workloads the gap is operationally invisible. Pulse Pro is managed, hosted, and metered per minute. Granite is open-weights, with the same self-hosting cost basis as our infrastructure, but you take on the deployment, autoscaling, and operations cost.
Scribe v2 ranks 8th on the Open ASR Leaderboard at 5.83% WER, behind Pulse Pro by 0.41 points. The “Scribe v2 is #1 for accuracy” talking point comes from a different (smaller) benchmark. On the public, reproducible ESB benchmark Pulse Pro is more accurate and ~1,500× cheaper per minute.
Universal-3 Pro ranks 12th on ESB at 6.21% WER, behind Pulse Pro by 0.79 points. AssemblyAI is 0.004/min ($4 per 1,000 minutes). Pulse Pro is more accurate at a comparable price.
Parakeet TDT 0.6B v3 runs at ~3,300× RTFx on A100, roughly 10× the published Pulse Pro throughput. But it ranks 13th on ESB at 6.32% WER, behind Pulse Pro by 0.90 WER points. For pure overnight bulk transcription where throughput dominates, Parakeet is competitive. For accuracy-sensitive workloads (meetings, finance, compliance), the WER gap matters.
Pulse Pro v4.1 is trained exclusively on English. For multilingual transcription, use standard Pulse (38 languages with streaming + non-streaming). Pulse Pro and Pulse share the same /waves/v1/stt/ endpoint, so adding multilingual capability is a one-line ?model= swap.
Word timestamps require an alignment pass after acoustic decoding. Pulse Pro currently runs alignment in the standard pipeline, which costs roughly one-third of overall throughput (~200× with timestamps vs 250–300× without). A vLLM-backed alignment port is in development to close this gap.
The streaming worker for Pulse Pro is on the roadmap but not yet deployed. Today, calls to WS /waves/v1/stt/live?model=pulse-pro return 400 before the WebSocket upgrades, with a message directing you to the HTTP endpoint. For streaming use the standard Pulse model (?model=pulse).
Pulse Pro must not be used for:
Additionally: