Word-level timestamps
Word-level timestamps
Word-level timestamps
Pass word_timestamps: true on a Lightning TTS WebSocket request to receive per-word timing events interleaved with the audio stream. Each event tells you exactly when a word starts and ends in the generated audio.
If you need word timing, use the WebSocket path.
Word events are emitted only when the chosen voice family has the aligner checkpoint baked in.
For unsupported voice families the flag is accepted — audio works normally, but no word_timestamp frames are emitted.
word_timestamps defaults to false. Clients that don’t set the flag see no behavior change.
The server interleaves word_timestamp frames with chunk frames in audio-time order, followed by a single complete:
Each word_timestamp frame:
Pure addition. Existing integrations that don’t set word_timestamps see the same wire shape as before — same chunk frames, same complete terminator, no new event types to handle.