Word-level timestamps

View as Markdown
Real-Time

Pass word_timestamps: true on a Lightning TTS WebSocket request to receive per-word timing events interleaved with the audio stream. Each event tells you exactly when a word starts and ends in the generated audio.

When to use this

  • Live captions — render words to the screen as they’re spoken.
  • Karaoke-style highlighting — sync the active-word UI to playback.
  • Avatar lip-sync — drive viseme transitions off word boundaries.
  • Word-level analytics — log per-word latency or downstream events.

Endpoint support

SurfaceWord timestamps
WSS /waves/v1/tts/live (unified)
WSS /waves/v1/lightning-v3.1/get_speech/stream (legacy, retiring 2026-07-14)
POST /waves/v1/tts (sync HTTP)❌ — flag accepted, silently ignored
POST /waves/v1/tts/live (HTTP SSE)❌ — same

If you need word timing, use the WebSocket path.

Voice support

Word events are emitted only when the chosen voice family has the aligner checkpoint baked in.

LanguageVoice familyWord events
English (en)Base-queue voices — meher, devansh, kartik, maithili, liam, avery
Hindi (hi)Base-queue voices (same list)
Marathi / Bengali / Gujarati / Punjabi / Odianorth-Indic family
Tamil / Telugu / Kannada / Malayalamsouth-Indic family

For unsupported voice families the flag is accepted — audio works normally, but no word_timestamp frames are emitted.

Request

1{
2 "voice_id": "meher",
3 "text": "I bought 3 cats for $100 on Dec 25th",
4 "model": "lightning_v3.1_pro",
5 "sample_rate": 44100,
6 "word_timestamps": true
7}

word_timestamps defaults to false. Clients that don’t set the flag see no behavior change.

Response frames

The server interleaves word_timestamp frames with chunk frames in audio-time order, followed by a single complete:

chunk → chunk → word_timestamp{id:0} → chunk → word_timestamp{id:1} → … → complete

Each word_timestamp frame:

1{
2 "session_id": "ws_550e8400-e29b",
3 "request_id": "task_6ba7b810-9dad",
4 "status": "word_timestamp",
5 "data": {
6 "id": 5,
7 "word": "$100",
8 "start": 1.12,
9 "end": 1.92
10 }
11}
FieldTypeDescription
data.idinteger0-indexed position of the word within the input text.
data.wordstringExact substring from the input — un-normalized. "$100" stays "$100", "25th" stays "25th", "3" stays "3". Non-Latin scripts are preserved verbatim (e.g. Devanagari for Hindi).
data.startfloat (seconds)Start of the word in the audio stream.
data.endfloat (seconds)End of the word.

Example

1# Requires: pip install websocket-client
2# Opt into word_timestamps on the v3.1 WebSocket, save the audio to WAV,
3# and dump every received word event with its timing.
4import base64
5import json
6import os
7import wave
8from websocket import WebSocketApp
9
10API_KEY = os.environ["SMALLEST_API_KEY"]
11WS_URL = "wss://api.smallest.ai/waves/v1/tts/live"
12SAMPLE_RATE = 44100
13TEXT = "I bought 3 cats for $100 on Dec 25th"
14VOICE_ID = "meher"
15
16audio_chunks: list[str] = []
17word_events: list[dict] = []
18
19
20def on_open(ws):
21 ws.send(json.dumps({
22 "voice_id": VOICE_ID,
23 "text": TEXT,
24 "sample_rate": SAMPLE_RATE,
25 "model": "lightning_v3.1_pro",
26 "word_timestamps": True,
27 }))
28
29
30def on_message(ws, raw):
31 msg = json.loads(raw)
32 status = msg.get("status")
33 if status == "chunk":
34 audio_chunks.append(msg["data"]["audio"])
35 elif status == "word_timestamp":
36 w = msg["data"]
37 word_events.append(w)
38 print(f" id={w['id']:<2} {w['word']!r:<15} {w['start']:.3f}s → {w['end']:.3f}s")
39 elif status == "complete":
40 ws.close()
41
42
43def on_close(_ws, *_):
44 pcm = b"".join(base64.b64decode(c) for c in audio_chunks)
45 with wave.open("out.wav", "wb") as f:
46 f.setnchannels(1); f.setsampwidth(2); f.setframerate(SAMPLE_RATE)
47 f.writeframes(pcm)
48 print(f"\nsaved out.wav ({len(audio_chunks)} chunks, {len(word_events)} word events)")
49
50
51WebSocketApp(
52 WS_URL,
53 header=[f"Authorization: Bearer {API_KEY}"],
54 on_open=on_open,
55 on_message=on_message,
56 on_close=on_close,
57).run_forever()

Backward compatibility

Pure addition. Existing integrations that don’t set word_timestamps see the same wire shape as before — same chunk frames, same complete terminator, no new event types to handle.

See also