Response Format
For every chunk sent on the WebSocket, the server responds with a JSON message. Users can structure response handling according to their needs. Users can choose to read quick responses with lower accuracy or wait until the server sends larger responses that are highly accurate.
Example Response
Response Fields
type: Message type identifier, set to"transcription"for transcription results.status: Status of the transcription request, typically"success"for valid responses.session_id: Unique identifier for the transcription session.transcript: Partial or complete transcription text for the current segment.is_final: Indicates if this is the final transcription for the current segment.falseindicates a partial/interim transcript;trueindicates a final transcript.is_last: Indicates if this is the last transcription in the session.truewhen the session is complete.
Optional Fields
The following fields may be included in responses under certain conditions:
language: Detected primary language code. Only returned whenis_final=true.languages: Array of language codes detected in the audio. Only returned whenis_final=true.words: Array of word-level timestamps (only included whenword_timestamps=truein query parameters). Each word object containsword,start,end, andconfidencefields. Whendiarize=true, also includesspeaker(integer ID) andspeaker_confidence(0.0 to 1.0) fields.utterances: Array of sentence-level timestamps (only included whensentence_timestamps=truein query parameters). Each utterance object containstext,start, andendfields. Whendiarize=true, also includesspeaker(integer ID) field.redacted_entities: Array of redacted entity placeholders (only included whenredact_pii=trueorredact_pci=true). Examples:[FIRSTNAME_1],[CREDITCARDCVV_1].
Handling Responses
We maintain an internal server-side buffer that collects chunked audio sent by the user. Once this buffer reaches a specific size, the server sends a special response with the is_final parameter set to true that contains the transcription of user audio collected since the last such response.
is_final = true
We recommend processing responses of this kind for optimal transcription accuracy. The internal buffer size is calibrated to optimize response times and accuracy.
- The
languagefield is set to the specified language, or the detected language if the language parameter is set tomultiormulti-eu. Other responses will not include thelanguagefield.
is_final = false
These are interim transcript responses sent for each chunk. They provide quick feedback for low latency use cases.
- These responses may provide inaccurate results for the most recent words. This occurs when the audio for these words is not fully sent to the server in the respective chunk.
is_last = true
This response is the final response received after the user sends the close-stream token {"type":"close_stream"}. When is_last=true, the server has finished processing all audio and the session is complete.
- This is the last response of the live transcription session and contains all the fields of an
is_final=trueresponse. {"type":"finalize"}does not triggeris_last=true. It only forces an immediateis_final=truetranscript while keeping the session open. Use it for per-turn finalization in agentic pipelines.
Do not close the WebSocket connection immediately after sending {"type":"close_stream"}. Wait for this is_last=true response to ensure all audio has been processed and you receive the complete transcript.

