Response Format | Smallest AI Docs

For every chunk sent on the WebSocket, the server responds with a JSON message. Users can structure response handling according to their needs. Users can choose to read quick responses with lower accuracy or wait until the server sends larger responses that are highly accurate.

Example Response

1 {
2   "type": "transcription",
3   "status": "success",
4   "session_id": "00000000-0000-0000-0000-000000000000",
5   "transcript": "Hello, how are you?",
6   "is_final": false,
7   "is_last": false
8 }

Response Fields

type: Message type identifier, set to "transcription" for transcription results.
status: Status of the transcription request, typically "success" for valid responses.
session_id: Unique identifier for the transcription session.
transcript: Partial or complete transcription text for the current segment.
is_final: Indicates if this is the final transcription for the current segment. false indicates a partial/interim transcript; true indicates a final transcript.
is_last: Indicates if this is the last transcription in the session. true when the session is complete.

Optional Fields

The following fields may be included in responses under certain conditions:

language: Detected primary language code. Only returned when is_final=true.
languages: Array of language codes detected in the audio. Only returned when is_final=true.
words: Array of word-level timestamps (only included when word_timestamps=true in query parameters). Each word object contains word, start, end, and confidence fields. When diarize=true, also includes speaker (integer ID) and speaker_confidence (0.0 to 1.0) fields.
utterances: Array of sentence-level timestamps (only included when sentence_timestamps=true in query parameters). Each utterance object contains text, start, and end fields. When diarize=true, also includes speaker (integer ID) field.
redacted_entities: Array of redacted entity placeholders (only included when redact_pii=true or redact_pci=true). Examples: [FIRSTNAME_1], [CREDITCARDCVV_1].

Handling Responses

We maintain an internal server-side buffer that collects chunked audio sent by the user. Once this buffer reaches a specific size, the server sends a special response with the is_final parameter set to true that contains the transcription of user audio collected since the last such response.

`is_final = true`

We recommend processing responses of this kind for optimal transcription accuracy. The internal buffer size is calibrated to optimize response times and accuracy.

1 {
2   "type": "transcription",
3   "status": "success",
4   "session_id": "00000000-0000-0000-0000-000000000000",
5   "transcript": "Should I do it? ",
6   "is_final": true,
7   "is_last": false,
8   "language": "en",
9   "languages": ["en"]
10 }

The language field is set to the specified language, or the detected language if the language parameter is set to multi or multi-eu. Other responses will not include the language field.

`is_final = false`

These are interim transcript responses sent for each chunk. They provide quick feedback for low latency use cases.

1 {
2   "type": "transcription",
3   "status": "success",
4   "session_id": "00000000-0000-0000-0000-000000000001",
5   "transcript": "Yeah.",
6   "is_final": false,
7   "is_last": false
8 }

These responses may provide inaccurate results for the most recent words. This occurs when the audio for these words is not fully sent to the server in the respective chunk.

`is_last = true`

This response is the final response received after the user sends the close-stream token {"type":"close_stream"}. When is_last=true, the server has finished processing all audio and the session is complete.

1 {
2   "type": "transcription",
3   "status": "success",
4   "session_id": "00000000-0000-0000-0000-000000000000",
5   "transcript": "Goodbye!",
6   "is_final": true,
7   "is_last": true,
8   "language": "en",
9   "languages": ["en"]
10 }

This is the last response of the live transcription session and contains all the fields of an is_final=true response.
{"type":"finalize"} does not trigger is_last=true. It only forces an immediate is_final=true transcript while keeping the session open. Use it for per-turn finalization in agentic pipelines.

Do not close the WebSocket connection immediately after sending {"type":"close_stream"}. Wait for this is_last=true response to ensure all audio has been processed and you receive the complete transcript.