Lightning v3.1 — per-word timestamps on WebSocket streaming
Lightning v3.1 — per-word timestamps on WebSocket streaming
Lightning v3.1 now exposes per-word timing events to WebSocket clients. Opt in with one flag — useful for captioning UIs, karaoke-style word highlighting, avatar lip-sync, and word-level analytics.
What changed
Two changes to a WebSocket request: add word_timestamps: true and handle the new status: "word_timestamp" frame.
word is the exact substring from the input text — un-normalized. "$100" stays "$100", "25th" stays "25th", "3" stays "3". Non-Latin scripts come back verbatim (e.g., Devanagari for Hindi).
start and end are floats in seconds, relative to the start of the audio stream. Frames interleave with chunk in audio-time order, then a single complete terminates the session.
Where it works
Voice + language support
For unsupported voice families the flag is accepted — audio works normally, but no word_timestamp frames are emitted. Detect this client-side by counting received word events after complete arrives.
Backward compatibility
word_timestamps defaults to false. Clients that don’t set the flag see no behavior change — same audio chunks, same completion frame, no new event type to handle. Purely opt-in.
Migration: none — pure addition. Existing integrations keep working untouched.
→ Word-level timestamps on the Lightning v3.1 model card — full wire spec, JS example, support matrix.

