Pulse (Realtime)

View as Markdown
Transcribe audio in real time over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking. ## When to use this - **Use this** for live audio: microphone input, voice-agent turns, simultaneous interpretation, low-latency captioning. Partial results stream back while audio is still arriving. - **Use the pre-recorded REST endpoint** (`POST /waves/v1/pulse/get_text`) when you have a complete file. Single request, single response, less plumbing. ## How it works 1. Open a WebSocket to `wss://api.smallest.ai/waves/v1/pulse/get_text` with `Authorization: Bearer <key>` and the session params (`language`, `sample_rate`, `encoding`, etc.) as query string. 2. Stream raw PCM (or your chosen `encoding`) over the socket as binary frames. 3. The server pushes back JSON `transcriptionResponse` messages — partial results (`is_final: false`) as you speak, finalized text (`is_final: true`) when an utterance closes. 4. Send a `finalize` message to force end-of-utterance, or `close_stream` to end the session. ## Examples **Python** (real-time mic input) ```python import asyncio, json, websockets, pyaudio URL = "wss://api.smallest.ai/waves/v1/pulse/get_text?language=en&sample_rate=16000&encoding=linear16&word_timestamps=true" HEADERS = {"Authorization": f"Bearer {API_KEY}"} async def stream_mic(): pa = pyaudio.PyAudio() stream = pa.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=320) async with websockets.connect(URL, additional_headers=HEADERS) as ws: async def send_audio(): while True: await ws.send(stream.read(320)) async def recv_transcripts(): async for msg in ws: data = json.loads(msg) tag = "FINAL " if data.get("is_final") else "partial" print(f"[{tag}] {data.get('transcript')}") await asyncio.gather(send_audio(), recv_transcripts()) asyncio.run(stream_mic()) ``` **JavaScript / TypeScript** (using `ws`) ```typescript import WebSocket from "ws"; import { readFileSync } from "node:fs"; const params = new URLSearchParams({ language: "en", sample_rate: "16000", encoding: "linear16", word_timestamps: "true", }); const ws = new WebSocket(`wss://api.smallest.ai/waves/v1/pulse/get_text?${params}`, { headers: { Authorization: `Bearer ${process.env.SMALLEST_API_KEY}` }, }); ws.on("open", () => { const audio = readFileSync("./call.pcm"); // 16-bit mono PCM at 16 kHz for (let i = 0; i < audio.length; i += 3200) { ws.send(audio.subarray(i, i + 3200)); } ws.send(JSON.stringify({ type: "finalize" })); }); ws.on("message", (raw) => { const data = JSON.parse(raw.toString()); const tag = data.is_final ? "FINAL " : "partial"; console.log(`[${tag}] ${data.transcript}`); if (data.is_last) ws.close(); }); ``` ## Common gotchas - **Match `sample_rate` to your audio.** The server will not resample for you — sending 44.1 kHz audio with `sample_rate=16000` produces garbage transcripts. - **`finalize` vs `close_stream`**: `finalize` ends the current utterance and triggers a final transcript without closing the session. `close_stream` ends the session entirely. - **`keywords` is WebSocket-only.** Pass them on connect for proper-noun / jargon boosting; not available on the REST endpoint. - **`format`/`punctuate`/`capitalize`** are accepted at the wire level today. They currently return the same transcript regardless of value — pass them in your integration so it works as the behavior changes. - **PII/PCI redaction (`redact_pii`, `redact_pci`)** runs server-side on finalized transcripts only — partials may show the unredacted text briefly before being replaced. - **JavaScript / TypeScript**: the official `smallestai` npm package predates the Pulse model, so connect with the `ws` library directly as shown above.

Handshake

WSS
wss://api.smallest.ai/waves/v1/pulse/get_text

Authentication

AuthorizationBearer

Header authentication of the form Bearer <token>

Headers

AuthorizationstringRequired

Bearer token for authentication. Format: Bearer YOUR_API_KEY

languageenumOptionalDefaults to multi-eu
Language code for transcription. Set explicitly to the known language for best accuracy. Use `multi-eu` for unknown European-language audio (auto-detects across the European set: de, en, fr, it, nl, pt, ru, es). Use `multi` for full multilingual auto-detection across all supported languages. Omitting `language` routes to `multi-eu`, which can mis-detect on non-European audio (e.g., returning Russian for English input). Always pass `language` explicitly when the source language is known.
encodingenumOptionalDefaults to linear16
Audio encoding of the bytes you stream over the socket. The server uses this to decode incoming frames — set it to match what your client is sending. - `linear16`, `linear32` — raw PCM (16-bit and 32-bit). Pair with the appropriate `sample_rate`. - `alaw`, `mulaw` — 8 kHz telephony codecs. Pair with `sample_rate=8000`. - `opus`, `ogg_opus` — Opus compressed audio (raw and Ogg container). Streaming-only — the pre-recorded REST endpoint (`POST /pulse/get_text`) auto-detects the format from the file's container header and ignores this parameter.
sample_rateenumOptionalDefaults to 16000

Audio sample rate in Hz of the bytes you stream. Must match the actual rate of your audio source. Streaming-only — the pre-recorded REST endpoint reads the rate from the file’s container.

word_timestampsenumOptionalDefaults to true

Include word-level timestamps in transcription

Allowed values:
sentence_timestampsenumOptionalDefaults to false

Include sentence-level timestamps (utterances) in transcription

Allowed values:
redact_piienumOptionalDefaults to false

Redact personally identifiable information (name, surname, address, etc)

Allowed values:
redact_pcienumOptionalDefaults to false

Redact payment card information (credit card, CVV, zip, account number, etc)

Allowed values:
numeralsenumOptionalDefaults to auto
Convert spoken numerals into digit form (e.g., 'twenty five' to '25'). `auto` enables automatic detection based on context. For new integrations we recommend `itn_normalize=true` instead — it covers digits as well as dates, currencies, phone numbers, and other spoken-form entities, and gives more consistent results across languages.
Allowed values:
formatenumOptionalDefaults to true
Master formatting switch for transcript responses. When `false`, forces `punctuate=false`, `capitalize=false`, and also disables Inverse Text Normalization (ITN) so it cannot silently reintroduce punctuation or casing. When `true`, the `punctuate` and `capitalize` params take effect independently. Leave `format=true` and use those two to fine-tune.
Allowed values:
punctuateenumOptionalDefaults to true

When false, strips end-of-sentence punctuation (., ,, ?, !) from the transcript, words[].word, and utterances[].transcript. Does not affect casing — use capitalize for that. Overridden to false when format=false.

Allowed values:
capitalizeenumOptionalDefaults to true

When false, lowercases the entire transcript output (final transcript, words[].word, and utterances[].transcript). Does not affect punctuation — use punctuate for that. Overridden to false when format=false.

Allowed values:
eou_timeout_msstringOptionalDefaults to 800
Time in milliseconds to wait after speech ends before flushing the transcript.
diarizeenumOptionalDefaults to false
Enable speaker diarization to identify different speakers in the audio
Allowed values:
keywordsstringOptional
Boost recognition of specific words or phrases for this session. Useful for product names, jargon, proper nouns, and other domain-specific terms the model might otherwise mis-transcribe. **Streaming (WebSocket) only** — not supported on the HTTP `/pulse/get_text` endpoint. **Format:** a single comma-separated string (not a JSON array). Each entry is a word or phrase, optionally followed by `:INTENSIFIER` — a numeric boost multiplier. Defaults to `1.0` when omitted. **Example:** `I:20,smiling:26` - Phrases can include spaces (`small language model:3.5`). - Intensifier accepts integers or decimals (`2`, `2.5`, `0.5`). - Mixing entries with and without intensifiers is fine. - Maximum 100 keywords per session. **INTENSIFIER range:** `0` to `20`. Recommended value is `6`. Higher values bias recognition more aggressively toward the keyword, but also **increase the risk of hallucination and repetition** in the transcript. Values of `10` or above are not recommended — the model may insert the keyword even when it was not spoken. Start around `3–6` and tune from there.
itn_normalizeenumOptionalDefaults to false

Enable Inverse Text Normalization to convert spoken-form entities (numbers, dates, currencies, phone numbers, etc.) into written form in finalized transcripts.

Allowed values:
finalize_on_wordsenumOptionalDefaults to true

When false, disables automatic word-count-based finalization. Use with itn_normalize for agentic pipelines where you control finalization via the finalize message.

Allowed values:
max_wordsintegerOptional
Maximum number of words before forced finalization. Useful for keeping ITN chunks short and accurate.

Send

sendAudioDatastringRequiredformat: "binary"

Stream audio data in chunks for real-time transcription

OR
sendFinalizeSignalobjectRequired

Force an immediate is_final transcript for pending speech without ending the session. Useful in agentic pipelines.

OR
sendCloseStreamSignalobjectRequired

Signal that audio streaming is complete. The server flushes remaining audio, delivers final transcripts, and responds with is_last=true.

Receive

receiveTranscriptionobjectRequired

Get real-time transcription results as audio is processed