Word-level timestamps
Word-level timestamps
Pass word_timestamps: true on a Lightning TTS WebSocket request to receive per-word timing events interleaved with the audio stream. Each event tells you exactly when a word starts and ends in the generated audio.
When to use this
- Live captions — render words to the screen as they’re spoken.
- Karaoke-style highlighting — sync the active-word UI to playback.
- Avatar lip-sync — drive viseme transitions off word boundaries.
- Word-level analytics — log per-word latency or downstream events.
Endpoint support
If you need word timing, use the WebSocket path.
Voice support
Word events are emitted only when the chosen voice family has the aligner checkpoint baked in.
For unsupported voice families the flag is accepted — audio works normally, but no word_timestamp frames are emitted.
Request
word_timestamps defaults to false. Clients that don’t set the flag see no behavior change.
Response frames
The server interleaves word_timestamp frames with chunk frames in audio-time order, followed by a single complete:
Each word_timestamp frame:
Example
Backward compatibility
Pure addition. Existing integrations that don’t set word_timestamps see the same wire shape as before — same chunk frames, same complete terminator, no new event types to handle.
See also
- Lightning v3.1 model card — model context for the feature.
- Streaming TTS — full WebSocket setup and audio handling.

