Pulse (Realtime)
Pulse (Realtime)
Pulse (Realtime)
Transcribe audio in real time over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking.
POST /waves/v1/pulse/get_text) when you have a complete file. Single request, single response, less plumbing.wss://api.smallest.ai/waves/v1/pulse/get_text with Authorization: Bearer <key> and the session params (language, sample_rate, encoding, etc.) as query string.encoding) over the socket as binary frames.transcriptionResponse messages — partial results (is_final: false) as you speak, finalized text (is_final: true) when an utterance closes.finalize message to force end-of-utterance, or close_stream to end the session.Python (real-time mic input)
JavaScript / TypeScript (using ws)
sample_rate to your audio. The server will not resample for you — sending 44.1 kHz audio with sample_rate=16000 produces garbage transcripts.finalize vs close_stream: finalize ends the current utterance and triggers a final transcript without closing the session. close_stream ends the session entirely.keywords is WebSocket-only. Pass them on connect for proper-noun / jargon boosting; not available on the REST endpoint.format/punctuate/capitalize are accepted at the wire level today. They currently return the same transcript regardless of value — pass them in your integration so it works as the behavior changes.redact_pii, redact_pci) runs server-side on finalized transcripts only — partials may show the unredacted text briefly before being replaced.smallestai npm package predates the Pulse model, so connect with the ws library directly as shown above.Header authentication of the form Bearer <token>
Bearer token for authentication. Format: Bearer YOUR_API_KEY
Audio sample rate in Hz of the bytes you stream. Must match the actual rate of your audio source. Streaming-only — the pre-recorded REST endpoint reads the rate from the file’s container.
Include word-level timestamps in transcription
Include sentence-level timestamps (utterances) in transcription
Redact personally identifiable information (name, surname, address, etc)
Redact payment card information (credit card, CVV, zip, account number, etc)
When false, strips end-of-sentence punctuation (., ,, ?, !)
from the transcript, words[].word, and utterances[].transcript.
Does not affect casing — use capitalize for that. Overridden to
false when format=false.
When false, lowercases the entire transcript output (final
transcript, words[].word, and utterances[].transcript). Does
not affect punctuation — use punctuate for that. Overridden to
false when format=false.
Enable Inverse Text Normalization to convert spoken-form entities (numbers, dates, currencies, phone numbers, etc.) into written form in finalized transcripts.
When false, disables automatic word-count-based finalization. Use with itn_normalize for agentic pipelines where you control finalization via the finalize message.
Stream audio data in chunks for real-time transcription
Force an immediate is_final transcript for pending speech without ending the session. Useful in agentic pipelines.
Signal that audio streaming is complete. The server flushes remaining audio, delivers final transcripts, and responds with is_last=true.
Get real-time transcription results as audio is processed
Language code for transcription. Set explicitly to the known language for best accuracy.
Use multi-eu for unknown European-language audio (auto-detects
across the European set: de, en, fr, it, nl, pt, ru, es). Use
multi for full multilingual auto-detection across all supported
languages.
Omitting language routes to multi-eu, which can mis-detect on
non-European audio (e.g., returning Russian for English input).
Always pass language explicitly when the source language is known.
Audio encoding of the bytes you stream over the socket. The server uses this to decode incoming frames — set it to match what your client is sending.
linear16, linear32 — raw PCM (16-bit and 32-bit). Pair
with the appropriate sample_rate.alaw, mulaw — 8 kHz telephony codecs. Pair with
sample_rate=8000.opus, ogg_opus — Opus compressed audio (raw and Ogg
container).Streaming-only — the pre-recorded REST endpoint
(POST /pulse/get_text) auto-detects the format from the
file’s container header and ignores this parameter.
Master formatting switch for transcript responses. When false,
forces punctuate=false, capitalize=false, and also disables
Inverse Text Normalization (ITN) so it cannot silently reintroduce
punctuation or casing.
When true, the punctuate and capitalize params take effect
independently. Leave format=true and use those two to fine-tune.
Boost recognition of specific words or phrases for this session.
Useful for product names, jargon, proper nouns, and other
domain-specific terms the model might otherwise mis-transcribe.
Streaming (WebSocket) only — not supported on the HTTP
/pulse/get_text endpoint.
Format: a single comma-separated string (not a JSON array).
Each entry is a word or phrase, optionally followed by
:INTENSIFIER — a numeric boost multiplier. Defaults to 1.0
when omitted.
Example: I:20,smiling:26
small language model:3.5).2, 2.5, 0.5).INTENSIFIER range: 0 to 20. Recommended value is 6.
Higher values bias recognition more aggressively toward the
keyword, but also increase the risk of hallucination and
repetition in the transcript. Values of 10 or above are
not recommended — the model may insert the keyword even when
it was not spoken. Start around 3–6 and tune from there.