For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Transcribe audio in real time over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking.
## When to use this
- **Use this** for live audio: microphone input, voice-agent turns, simultaneous interpretation, low-latency captioning. Partial results stream back while audio is still arriving.
- **Use the pre-recorded REST endpoint** (`POST /waves/v1/pulse/get_text`) when you have a complete file. Single request, single response, less plumbing.
## How it works
1. Open a WebSocket to `wss://api.smallest.ai/waves/v1/pulse/get_text` with `Authorization: Bearer <key>` and the session params (`language`, `sample_rate`, `encoding`, etc.) as query string.
2. Stream raw PCM (or your chosen `encoding`) over the socket as binary frames.
3. The server pushes back JSON `transcriptionResponse` messages — partial results (`is_final: false`) as you speak, finalized text (`is_final: true`) when an utterance closes.
4. Send a `finalize` message to force end-of-utterance, or `close_stream` to end the session.
## Examples
**Python** (real-time mic input)
```python
import asyncio, json, websockets, pyaudio
URL = "wss://api.smallest.ai/waves/v1/pulse/get_text?language=en&sample_rate=16000&encoding=linear16&word_timestamps=true"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
async def stream_mic():
pa = pyaudio.PyAudio()
stream = pa.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=320)
async with websockets.connect(URL, additional_headers=HEADERS) as ws:
async def send_audio():
while True:
await ws.send(stream.read(320))
async def recv_transcripts():
async for msg in ws:
data = json.loads(msg)
tag = "FINAL " if data.get("is_final") else "partial"
print(f"[{tag}] {data.get('transcript')}")
await asyncio.gather(send_audio(), recv_transcripts())
asyncio.run(stream_mic())
```
**JavaScript / TypeScript** (using `ws`)
```typescript
import WebSocket from "ws";
import { readFileSync } from "node:fs";
const params = new URLSearchParams({
language: "en", sample_rate: "16000", encoding: "linear16", word_timestamps: "true",
});
const ws = new WebSocket(`wss://api.smallest.ai/waves/v1/pulse/get_text?${params}`, {
headers: { Authorization: `Bearer ${process.env.SMALLEST_API_KEY}` },
});
ws.on("open", () => {
const audio = readFileSync("./call.pcm"); // 16-bit mono PCM at 16 kHz
for (let i = 0; i < audio.length; i += 3200) {
ws.send(audio.subarray(i, i + 3200));
}
ws.send(JSON.stringify({ type: "finalize" }));
});
ws.on("message", (raw) => {
const data = JSON.parse(raw.toString());
const tag = data.is_final ? "FINAL " : "partial";
console.log(`[${tag}] ${data.transcript}`);
if (data.is_last) ws.close();
});
```
## Common gotchas
- **Match `sample_rate` to your audio.** The server will not resample for you — sending 44.1 kHz audio with `sample_rate=16000` produces garbage transcripts.
- **`finalize` vs `close_stream`**: `finalize` ends the current utterance and triggers a final transcript without closing the session. `close_stream` ends the session entirely.
- **`keywords` is WebSocket-only.** Pass them on connect for proper-noun / jargon boosting; not available on the REST endpoint.
- **`format`/`punctuate`/`capitalize`** are accepted at the wire level today. They currently return the same transcript regardless of value — pass them in your integration so it works as the behavior changes.
- **PII/PCI redaction (`redact_pii`, `redact_pci`)** runs server-side on finalized transcripts only — partials may show the unredacted text briefly before being replaced.
- **JavaScript / TypeScript**: the official `smallestai` npm package predates the Pulse model, so connect with the `ws` library directly as shown above.
Handshake
WSS
wss://api.smallest.ai/waves/v1/pulse/get_text
Authentication
AuthorizationBearer
Header authentication of the form Bearer <token>
Headers
AuthorizationstringRequired
Bearer token for authentication. Format: Bearer YOUR_API_KEY
languageenumOptionalDefaults to multi-eu
Language code for transcription. Set explicitly to the known
language for best accuracy.
Use `multi-eu` for unknown European-language audio (auto-detects
across the European set: de, en, fr, it, nl, pt, ru, es). Use
`multi` for full multilingual auto-detection across all supported
languages.
Omitting `language` routes to `multi-eu`, which can mis-detect on
non-European audio (e.g., returning Russian for English input).
Always pass `language` explicitly when the source language is known.
encodingenumOptionalDefaults to linear16
Audio encoding of the bytes you stream over the socket. The
server uses this to decode incoming frames — set it to match
what your client is sending.
- `linear16`, `linear32` — raw PCM (16-bit and 32-bit). Pair
with the appropriate `sample_rate`.
- `alaw`, `mulaw` — 8 kHz telephony codecs. Pair with
`sample_rate=8000`.
- `opus`, `ogg_opus` — Opus compressed audio (raw and Ogg
container).
Streaming-only — the pre-recorded REST endpoint
(`POST /pulse/get_text`) auto-detects the format from the
file's container header and ignores this parameter.
sample_rateenumOptionalDefaults to 16000
Audio sample rate in Hz of the bytes you stream. Must match
the actual rate of your audio source. Streaming-only — the
pre-recorded REST endpoint reads the rate from the file’s
container.
word_timestampsenumOptionalDefaults to true
Include word-level timestamps in transcription
Allowed values:
sentence_timestampsenumOptionalDefaults to false
Include sentence-level timestamps (utterances) in transcription
Allowed values:
redact_piienumOptionalDefaults to false
Redact personally identifiable information (name, surname, address, etc)
Convert spoken numerals into digit form (e.g., 'twenty five' to '25'). `auto` enables automatic detection based on context.
For new integrations we recommend `itn_normalize=true` instead — it covers digits as well as dates, currencies, phone numbers, and other spoken-form entities, and gives more consistent results across languages.
Allowed values:
formatenumOptionalDefaults to true
Master formatting switch for transcript responses. When `false`,
forces `punctuate=false`, `capitalize=false`, and also disables
Inverse Text Normalization (ITN) so it cannot silently reintroduce
punctuation or casing.
When `true`, the `punctuate` and `capitalize` params take effect
independently. Leave `format=true` and use those two to fine-tune.
Allowed values:
punctuateenumOptionalDefaults to true
When false, strips end-of-sentence punctuation (., ,, ?, !)
from the transcript, words[].word, and utterances[].transcript.
Does not affect casing — use capitalize for that. Overridden to
false when format=false.
Allowed values:
capitalizeenumOptionalDefaults to true
When false, lowercases the entire transcript output (final
transcript, words[].word, and utterances[].transcript). Does
not affect punctuation — use punctuate for that. Overridden to
false when format=false.
Allowed values:
eou_timeout_msstringOptionalDefaults to 800
Time in milliseconds to wait after speech ends before flushing the transcript.
diarizeenumOptionalDefaults to false
Enable speaker diarization to identify different speakers in the audio
Allowed values:
keywordsstringOptional
Boost recognition of specific words or phrases for this session.
Useful for product names, jargon, proper nouns, and other
domain-specific terms the model might otherwise mis-transcribe.
**Streaming (WebSocket) only** — not supported on the HTTP
`/pulse/get_text` endpoint.
**Format:** a single comma-separated string (not a JSON array).
Each entry is a word or phrase, optionally followed by
`:INTENSIFIER` — a numeric boost multiplier. Defaults to `1.0`
when omitted.
**Example:** `I:20,smiling:26`
- Phrases can include spaces (`small language model:3.5`).
- Intensifier accepts integers or decimals (`2`, `2.5`, `0.5`).
- Mixing entries with and without intensifiers is fine.
- Maximum 100 keywords per session.
**INTENSIFIER range:** `0` to `20`. Recommended value is `6`.
Higher values bias recognition more aggressively toward the
keyword, but also **increase the risk of hallucination and
repetition** in the transcript. Values of `10` or above are
not recommended — the model may insert the keyword even when
it was not spoken. Start around `3–6` and tune from there.
itn_normalizeenumOptionalDefaults to false
Enable Inverse Text Normalization to convert spoken-form entities (numbers, dates, currencies, phone numbers, etc.) into written form in finalized transcripts.
Allowed values:
finalize_on_wordsenumOptionalDefaults to true
When false, disables automatic word-count-based finalization. Use with itn_normalize for agentic pipelines where you control finalization via the finalize message.
Allowed values:
max_wordsintegerOptional
Maximum number of words before forced finalization. Useful for keeping ITN chunks short and accurate.
Send
sendAudioDatastringRequiredformat: "binary"
Stream audio data in chunks for real-time transcription
OR
sendFinalizeSignalobjectRequired
Force an immediate is_final transcript for pending speech without ending the session. Useful in agentic pipelines.
OR
sendCloseStreamSignalobjectRequired
Signal that audio streaming is complete. The server flushes remaining audio, delivers final transcripts, and responds with is_last=true.
Receive
receiveTranscriptionobjectRequired
Get real-time transcription results as audio is processed
Transcribe audio in real time over a persistent WebSocket. The fit-for-purpose path for live captioning, voice agents, and any flow where you need partial transcripts as the user is still speaking.
When to use this
Use this for live audio: microphone input, voice-agent turns, simultaneous interpretation, low-latency captioning. Partial results stream back while audio is still arriving.
Use the pre-recorded REST endpoint (POST /waves/v1/pulse/get_text) when you have a complete file. Single request, single response, less plumbing.
How it works
Open a WebSocket to wss://api.smallest.ai/waves/v1/pulse/get_text with Authorization: Bearer <key> and the session params (language, sample_rate, encoding, etc.) as query string.
Stream raw PCM (or your chosen encoding) over the socket as binary frames.
The server pushes back JSON transcriptionResponse messages — partial results (is_final: false) as you speak, finalized text (is_final: true) when an utterance closes.
Send a finalize message to force end-of-utterance, or close_stream to end the session.
Match sample_rate to your audio. The server will not resample for you — sending 44.1 kHz audio with sample_rate=16000 produces garbage transcripts.
finalize vs close_stream: finalize ends the current utterance and triggers a final transcript without closing the session. close_stream ends the session entirely.
keywords is WebSocket-only. Pass them on connect for proper-noun / jargon boosting; not available on the REST endpoint.
format/punctuate/capitalize are accepted at the wire level today. They currently return the same transcript regardless of value — pass them in your integration so it works as the behavior changes.
PII/PCI redaction (redact_pii, redact_pci) runs server-side on finalized transcripts only — partials may show the unredacted text briefly before being replaced.
JavaScript / TypeScript: the official smallestai npm package predates the Pulse model, so connect with the ws library directly as shown above.
Language code for transcription. Set explicitly to the known
language for best accuracy.
Use multi-eu for unknown European-language audio (auto-detects
across the European set: de, en, fr, it, nl, pt, ru, es). Use
multi for full multilingual auto-detection across all supported
languages.
Omitting language routes to multi-eu, which can mis-detect on
non-European audio (e.g., returning Russian for English input).
Always pass language explicitly when the source language is known.
Audio encoding of the bytes you stream over the socket. The
server uses this to decode incoming frames — set it to match
what your client is sending.
linear16, linear32 — raw PCM (16-bit and 32-bit). Pair
with the appropriate sample_rate.
alaw, mulaw — 8 kHz telephony codecs. Pair with
sample_rate=8000.
opus, ogg_opus — Opus compressed audio (raw and Ogg
container).
Streaming-only — the pre-recorded REST endpoint
(POST /pulse/get_text) auto-detects the format from the
file’s container header and ignores this parameter.
Convert spoken numerals into digit form (e.g., ‘twenty five’ to ‘25’). auto enables automatic detection based on context.
For new integrations we recommend itn_normalize=true instead — it covers digits as well as dates, currencies, phone numbers, and other spoken-form entities, and gives more consistent results across languages.
Master formatting switch for transcript responses. When false,
forces punctuate=false, capitalize=false, and also disables
Inverse Text Normalization (ITN) so it cannot silently reintroduce
punctuation or casing.
When true, the punctuate and capitalize params take effect
independently. Leave format=true and use those two to fine-tune.
Boost recognition of specific words or phrases for this session.
Useful for product names, jargon, proper nouns, and other
domain-specific terms the model might otherwise mis-transcribe.
Streaming (WebSocket) only — not supported on the HTTP
/pulse/get_text endpoint.
Format: a single comma-separated string (not a JSON array).
Each entry is a word or phrase, optionally followed by
:INTENSIFIER — a numeric boost multiplier. Defaults to 1.0
when omitted.
Example:I:20,smiling:26
Phrases can include spaces (small language model:3.5).
Intensifier accepts integers or decimals (2, 2.5, 0.5).
Mixing entries with and without intensifiers is fine.
Maximum 100 keywords per session.
INTENSIFIER range:0 to 20. Recommended value is 6.
Higher values bias recognition more aggressively toward the
keyword, but also increase the risk of hallucination and
repetition in the transcript. Values of 10 or above are
not recommended — the model may insert the keyword even when
it was not spoken. Start around 3–6 and tune from there.