For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • API References
    • Authentication
    • Concurrency and Limits
    • WebSocket
  • Text to Speech
    • POSTSynthesize Speech
    • STREAMStream Speech (SSE)
    • WSSStream Speech (WebSocket)
    • POSTLightning v3.1 (endpoint will be deprecated)
    • POSTLightning v3.1 SSE (endpoint will be deprecated)
    • WSSLightning v3.1 WebSocket (endpoint will be deprecated)
    • POSTLightning v2 (Deprecated)
    • POSTLightning v2 SSE (Deprecated)
    • WSSLightning v2 WebSocket (Deprecated)
    • GETGet Voices
    • POSTCreate a Voice Clone
    • GETList Voice Clones
    • DELDelete a Voice Clone
    • POSTAdd Voice (Deprecated)
    • GETGet Cloned Voices (Deprecated)
    • GETGet Pronunciation Dictionaries
    • POSTCreate Pronunciation Dictionary
    • PUTUpdate Pronunciation Dictionary
    • DELDelete Pronunciation Dictionary
  • Speech to Text
    • POSTPulse (Pre-Recorded)
    • WSSPulse (Realtime)
  • LLM (Chat Completions)
    • POSTElectron — Chat Completions
LogoLogo
Voice AgentsModels
Voice AgentsModels
Speech to Text

Pulse (Realtime)

||View as Markdown|
WSS
wss://api.smallest.ai/waves/v1/pulse/get_text
Handshake
URLwss://api.smallest.ai/waves/v1/pulse/get_text
MethodGET
Status101 Switching Protocols
Messages
Transcribe audio in real time over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking. ## When to use this - **Use this** for live audio: microphone input, voice-agent turns, simultaneous interpretation, low-latency captioning. Partial results stream back while audio is still arriving. - **Use the pre-recorded REST endpoint** (`POST /waves/v1/pulse/get_text`) when you have a complete file. Single request, single response, less plumbing. ## How it works 1. Open a WebSocket to `wss://api.smallest.ai/waves/v1/pulse/get_text` with `Authorization: Bearer <key>` and the session params (`language`, `sample_rate`, `encoding`, etc.) as query string. 2. Stream raw PCM (or your chosen `encoding`) over the socket as binary frames. 3. The server pushes back JSON `transcriptionResponse` messages — partial results (`is_final: false`) as you speak, finalized text (`is_final: true`) when an utterance closes. 4. Send a `finalize` message to force end-of-utterance, or `close_stream` to end the session. ## Examples **Python** (real-time mic input) ```python import asyncio, json, websockets, pyaudio URL = "wss://api.smallest.ai/waves/v1/pulse/get_text?language=en&sample_rate=16000&encoding=linear16&word_timestamps=true" HEADERS = {"Authorization": f"Bearer {API_KEY}"} async def stream_mic(): pa = pyaudio.PyAudio() stream = pa.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=320) async with websockets.connect(URL, additional_headers=HEADERS) as ws: async def send_audio(): while True: await ws.send(stream.read(320)) async def recv_transcripts(): async for msg in ws: data = json.loads(msg) tag = "FINAL " if data.get("is_final") else "partial" print(f"[{tag}] {data.get('transcript')}") await asyncio.gather(send_audio(), recv_transcripts()) asyncio.run(stream_mic()) ``` **JavaScript / TypeScript** (using `ws`) ```typescript import WebSocket from "ws"; import { readFileSync } from "node:fs"; const params = new URLSearchParams({ language: "en", sample_rate: "16000", encoding: "linear16", word_timestamps: "true", }); const ws = new WebSocket(`wss://api.smallest.ai/waves/v1/pulse/get_text?${params}`, { headers: { Authorization: `Bearer ${process.env.SMALLEST_API_KEY}` }, }); ws.on("open", () => { const audio = readFileSync("./call.pcm"); // 16-bit mono PCM at 16 kHz for (let i = 0; i < audio.length; i += 3200) { ws.send(audio.subarray(i, i + 3200)); } ws.send(JSON.stringify({ type: "finalize" })); }); ws.on("message", (raw) => { const data = JSON.parse(raw.toString()); const tag = data.is_final ? "FINAL " : "partial"; console.log(`[${tag}] ${data.transcript}`); if (data.is_last) ws.close(); }); ``` ## Common gotchas - **Match `sample_rate` to your audio.** The server will not resample for you — sending 44.1 kHz audio with `sample_rate=16000` produces garbage transcripts. - **`finalize` vs `close_stream`**: `finalize` ends the current utterance and triggers a final transcript without closing the session. `close_stream` ends the session entirely. - **`keywords` is WebSocket-only.** Pass them on connect for proper-noun / jargon boosting; not available on the REST endpoint. - **`format`/`punctuate`/`capitalize`** are accepted at the wire level today. They currently return the same transcript regardless of value — pass them in your integration so it works as the behavior changes. - **PII/PCI redaction (`redact_pii`, `redact_pci`)** runs server-side on finalized transcripts only — partials may show the unredacted text briefly before being replaced. - **JavaScript / TypeScript**: the official `smallestai` npm package predates the Pulse model, so connect with the `ws` library directly as shown above.
Was this page helpful?
Previous

Pulse (Pre-Recorded)

Next

Electron — Chat Completions

Built with

Transcribe audio in real time over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking.

When to use this

  • Use this for live audio: microphone input, voice-agent turns, simultaneous interpretation, low-latency captioning. Partial results stream back while audio is still arriving.
  • Use the pre-recorded REST endpoint (POST /waves/v1/pulse/get_text) when you have a complete file. Single request, single response, less plumbing.

How it works

  1. Open a WebSocket to wss://api.smallest.ai/waves/v1/pulse/get_text with Authorization: Bearer <key> and the session params (language, sample_rate, encoding, etc.) as query string.
  2. Stream raw PCM (or your chosen encoding) over the socket as binary frames.
  3. The server pushes back JSON transcriptionResponse messages — partial results (is_final: false) as you speak, finalized text (is_final: true) when an utterance closes.
  4. Send a finalize message to force end-of-utterance, or close_stream to end the session.

Examples

Python (real-time mic input)

1import asyncio, json, websockets, pyaudio
2
3URL = "wss://api.smallest.ai/waves/v1/pulse/get_text?language=en&sample_rate=16000&encoding=linear16&word_timestamps=true"
4HEADERS = {"Authorization": f"Bearer {API_KEY}"}
5
6async def stream_mic():
7 pa = pyaudio.PyAudio()
8 stream = pa.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=320)
9 async with websockets.connect(URL, additional_headers=HEADERS) as ws:
10 async def send_audio():
11 while True:
12 await ws.send(stream.read(320))
13 async def recv_transcripts():
14 async for msg in ws:
15 data = json.loads(msg)
16 tag = "FINAL " if data.get("is_final") else "partial"
17 print(f"[{tag}] {data.get('transcript')}")
18 await asyncio.gather(send_audio(), recv_transcripts())
19
20asyncio.run(stream_mic())

JavaScript / TypeScript (using ws)

1import WebSocket from "ws";
2import { readFileSync } from "node:fs";
3
4const params = new URLSearchParams({
5 language: "en", sample_rate: "16000", encoding: "linear16", word_timestamps: "true",
6});
7const ws = new WebSocket(`wss://api.smallest.ai/waves/v1/pulse/get_text?${params}`, {
8 headers: { Authorization: `Bearer ${process.env.SMALLEST_API_KEY}` },
9});
10
11ws.on("open", () => {
12 const audio = readFileSync("./call.pcm"); // 16-bit mono PCM at 16 kHz
13 for (let i = 0; i < audio.length; i += 3200) {
14 ws.send(audio.subarray(i, i + 3200));
15 }
16 ws.send(JSON.stringify({ type: "finalize" }));
17});
18
19ws.on("message", (raw) => {
20 const data = JSON.parse(raw.toString());
21 const tag = data.is_final ? "FINAL " : "partial";
22 console.log(`[${tag}] ${data.transcript}`);
23 if (data.is_last) ws.close();
24});

Common gotchas

  • Match sample_rate to your audio. The server will not resample for you — sending 44.1 kHz audio with sample_rate=16000 produces garbage transcripts.
  • finalize vs close_stream: finalize ends the current utterance and triggers a final transcript without closing the session. close_stream ends the session entirely.
  • keywords is WebSocket-only. Pass them on connect for proper-noun / jargon boosting; not available on the REST endpoint.
  • format/punctuate/capitalize are accepted at the wire level today. They currently return the same transcript regardless of value — pass them in your integration so it works as the behavior changes.
  • PII/PCI redaction (redact_pii, redact_pci) runs server-side on finalized transcripts only — partials may show the unredacted text briefly before being replaced.
  • JavaScript / TypeScript: the official smallestai npm package predates the Pulse model, so connect with the ws library directly as shown above.
WSS
wss://api.smallest.ai/waves/v1/pulse/get_text

Authentication

AuthorizationBearer

Header authentication of the form Bearer <token>

Headers

AuthorizationstringRequired

Bearer token for authentication. Format: Bearer YOUR_API_KEY

languageanyOptionalDefaults to multi-eu
Language code for transcription. Set explicitly to the known language for best accuracy. Use `multi-eu` for unknown European-language audio (auto-detects across the European set: de, en, fr, it, nl, pt, ru, es). Use `multi` for full multilingual auto-detection across all supported languages. Omitting `language` routes to `multi-eu`, which can mis-detect on non-European audio (e.g., returning Russian for English input). Always pass `language` explicitly when the source language is known.
encodinganyOptionalDefaults to linear16
Audio encoding of the bytes you stream over the socket. The server uses this to decode incoming frames — set it to match what your client is sending. - `linear16`, `linear32` — raw PCM (16-bit and 32-bit). Pair with the appropriate `sample_rate`. - `alaw`, `mulaw` — 8 kHz telephony codecs. Pair with `sample_rate=8000`. - `opus`, `ogg_opus` — Opus compressed audio (raw and Ogg container). Streaming-only — the pre-recorded REST endpoint (`POST /pulse/get_text`) auto-detects the format from the file's container header and ignores this parameter.
sample_rateanyOptionalDefaults to 16000

Audio sample rate in Hz of the bytes you stream. Must match the actual rate of your audio source. Streaming-only — the pre-recorded REST endpoint reads the rate from the file’s container.

word_timestampsanyOptionalDefaults to true

Include word-level timestamps in transcription

sentence_timestampsanyOptionalDefaults to false

Include sentence-level timestamps (utterances) in transcription

redact_piianyOptionalDefaults to false

Redact personally identifiable information (name, surname, address, etc)

redact_pcianyOptionalDefaults to false

Redact payment card information (credit card, CVV, zip, account number, etc)

formatanyOptionalDefaults to true
Master formatting switch for transcript responses. When `false`, forces `punctuate=false`, `capitalize=false`, and also disables Inverse Text Normalization (ITN) so it cannot silently reintroduce punctuation or casing. When `true`, the `punctuate` and `capitalize` params take effect independently. Leave `format=true` and use those two to fine-tune.
punctuateanyOptionalDefaults to true

When false, strips end-of-sentence punctuation (., ,, ?, !) from the transcript, words[].word, and utterances[].transcript. Does not affect casing — use capitalize for that. Overridden to false when format=false.

capitalizeanyOptionalDefaults to true

When false, lowercases the entire transcript output (final transcript, words[].word, and utterances[].transcript). Does not affect punctuation — use punctuate for that. Overridden to false when format=false.

eou_timeout_msstringOptionalDefaults to 800
Time in milliseconds to wait after speech ends before flushing the transcript.
diarizeanyOptionalDefaults to false
Enable speaker diarization to identify different speakers in the audio
keywordsstringOptional
Boost recognition of specific words or phrases for this session. Useful for product names, jargon, proper nouns, and other domain-specific terms the model might otherwise mis-transcribe. **Streaming (WebSocket) only** — not supported on the HTTP `/pulse/get_text` endpoint. **Format:** a single comma-separated string (not a JSON array). Each entry is a word or phrase, optionally followed by `:INTENSIFIER` — a numeric boost multiplier. Defaults to `1.0` when omitted. **Example:** `I:20,smiling:26` - Phrases can include spaces (`small language model:3.5`). - Intensifier accepts integers or decimals (`2`, `2.5`, `0.5`). - Mixing entries with and without intensifiers is fine. - Maximum 100 keywords per session. **INTENSIFIER range:** `0` to `20`. Recommended value is `6`. Higher values bias recognition more aggressively toward the keyword, but also **increase the risk of hallucination and repetition** in the transcript. Values of `10` or above are not recommended — the model may insert the keyword even when it was not spoken. Start around `3–6` and tune from there.
itn_normalizeanyOptionalDefaults to false

Enable Inverse Text Normalization to convert spoken-form entities (numbers, dates, currencies, phone numbers, etc.) into written form in finalized transcripts.

finalize_on_wordsanyOptionalDefaults to true

When false, disables automatic word-count-based finalization. Use with itn_normalize for agentic pipelines where you control finalization via the finalize message.

max_wordsintegerOptional
Maximum number of words before forced finalization. Useful for keeping ITN chunks short and accurate.

Send

sendAudioDatastringRequiredformat: "binary"

Stream audio data in chunks for real-time transcription

OR
sendFinalizeSignalobjectRequired

Force an immediate is_final transcript for pending speech without ending the session. Useful in agentic pipelines.

OR
sendCloseStreamSignalobjectRequired

Signal that audio streaming is complete. The server flushes remaining audio, delivers final transcripts, and responds with is_last=true.

Receive

receiveTranscriptionobjectRequired

Get real-time transcription results as audio is processed

Language code for transcription. Set explicitly to the known language for best accuracy.

Use multi-eu for unknown European-language audio (auto-detects across the European set: de, en, fr, it, nl, pt, ru, es). Use multi for full multilingual auto-detection across all supported languages.

Omitting language routes to multi-eu, which can mis-detect on non-European audio (e.g., returning Russian for English input). Always pass language explicitly when the source language is known.

Audio encoding of the bytes you stream over the socket. The server uses this to decode incoming frames — set it to match what your client is sending.

  • linear16, linear32 — raw PCM (16-bit and 32-bit). Pair with the appropriate sample_rate.
  • alaw, mulaw — 8 kHz telephony codecs. Pair with sample_rate=8000.
  • opus, ogg_opus — Opus compressed audio (raw and Ogg container).

Streaming-only — the pre-recorded REST endpoint (POST /pulse/get_text) auto-detects the format from the file’s container header and ignores this parameter.

Master formatting switch for transcript responses. When false, forces punctuate=false, capitalize=false, and also disables Inverse Text Normalization (ITN) so it cannot silently reintroduce punctuation or casing.

When true, the punctuate and capitalize params take effect independently. Leave format=true and use those two to fine-tune.

Boost recognition of specific words or phrases for this session. Useful for product names, jargon, proper nouns, and other domain-specific terms the model might otherwise mis-transcribe. Streaming (WebSocket) only — not supported on the HTTP /pulse/get_text endpoint.

Format: a single comma-separated string (not a JSON array). Each entry is a word or phrase, optionally followed by :INTENSIFIER — a numeric boost multiplier. Defaults to 1.0 when omitted.

Example: I:20,smiling:26

  • Phrases can include spaces (small language model:3.5).
  • Intensifier accepts integers or decimals (2, 2.5, 0.5).
  • Mixing entries with and without intensifiers is fine.
  • Maximum 100 keywords per session.

INTENSIFIER range: 0 to 20. Recommended value is 6. Higher values bias recognition more aggressively toward the keyword, but also increase the risk of hallucination and repetition in the transcript. Values of 10 or above are not recommended — the model may insert the keyword even when it was not spoken. Start around 3–6 and tune from there.