Sentence-level timestamps
Sentence-level timestamps
Sentence-level timestamps
Sentence-level timestamps (utterances) are supported in both Pre-Recorded and Real-Time transcription APIs. The utterances array aggregates contiguous words into sentence-level segments, providing structured timing information for longer audio chunks.
For the Pre-Recorded API, set word_timestamps=true in your query parameters. When word timestamps are enabled, the response includes both words and utterances arrays.
Required dependency: word_timestamps=true must be enabled for utterances to appear in the response. Without it, the utterances array will be empty. For the Real-Time API, also set sentence_timestamps=true.
For the Real-Time WebSocket API, set sentence_timestamps=true as a query parameter when establishing the WebSocket connection.
Each utterances entry contains text, start, end, and optional speaker fields (when diarization is enabled). Use these sentence-level timestamps when you need to display readable captions, synchronize larger chunks of audio, or store structured call summaries.
This response has the speaker field due to diarize being enabled in the query.
When diarize=true is enabled, the utterances array also includes a speaker field (integer ID) for real-time API responses. For example: { "text": "Hello world.", "start": 0.0, "end": 0.9, "speaker": 0 }