Word-level timestamps | Smallest AI Docs

Real-Time

Pass word_timestamps: true on a Lightning TTS WebSocket request to receive per-word timing events interleaved with the audio stream. Each event tells you exactly when a word starts and ends in the generated audio.

When to use this

Live captions — render words to the screen as they’re spoken.
Karaoke-style highlighting — sync the active-word UI to playback.
Avatar lip-sync — drive viseme transitions off word boundaries.
Word-level analytics — log per-word latency or downstream events.

Endpoint support

Surface	Word timestamps
`WSS /waves/v1/tts/live` (unified)	✅
`WSS /waves/v1/lightning-v3.1/get_speech/stream` (legacy, retiring 2026-07-14)	✅
`POST /waves/v1/tts` (sync HTTP)	❌ — flag accepted, silently ignored
`POST /waves/v1/tts/live` (HTTP SSE)	❌ — same

If you need word timing, use the WebSocket path.

Voice support

Word events are emitted only when the chosen voice family has the aligner checkpoint baked in.

Language	Voice family	Word events
English (`en`)	Base-queue voices — `meher`, `devansh`, `kartik`, `maithili`, `liam`, `avery`	✅
Hindi (`hi`)	Base-queue voices (same list)	✅
Marathi / Bengali / Gujarati / Punjabi / Odia	north-Indic family	❌
Tamil / Telugu / Kannada / Malayalam	south-Indic family	❌

For unsupported voice families the flag is accepted — audio works normally, but no word_timestamp frames are emitted.

Request

1 {
2   "voice_id": "meher",
3   "text": "I bought 3 cats for $100 on Dec 25th",
4   "model": "lightning_v3.1_pro",
5   "sample_rate": 44100,
6   "word_timestamps": true
7 }

word_timestamps defaults to false. Clients that don’t set the flag see no behavior change.

Response frames

The server interleaves word_timestamp frames with chunk frames in audio-time order, followed by a single complete:

chunk → chunk → word_timestamp{id:0} → chunk → word_timestamp{id:1} → … → complete

Each word_timestamp frame:

1 {
2   "session_id": "ws_550e8400-e29b",
3   "request_id": "task_6ba7b810-9dad",
4   "status": "word_timestamp",
5   "data": {
6     "id": 5,
7     "word": "$100",
8     "start": 1.12,
9     "end": 1.92
10   }
11 }

Field	Type	Description
`data.id`	integer	0-indexed position of the word within the input text.
`data.word`	string	Exact substring from the input — un-normalized. `"$100"` stays `"$100"`, `"25th"` stays `"25th"`, `"3"` stays `"3"`. Non-Latin scripts are preserved verbatim (e.g. Devanagari for Hindi).
`data.start`	float (seconds)	Start of the word in the audio stream.
`data.end`	float (seconds)	End of the word.

Example

1 # Requires: pip install websocket-client
2 # Opt into word_timestamps on the v3.1 WebSocket, save the audio to WAV,
3 # and dump every received word event with its timing.
4 import base64
5 import json
6 import os
7 import wave
8 from websocket import WebSocketApp
9 
10 API_KEY = os.environ["SMALLEST_API_KEY"]
11 WS_URL = "wss://api.smallest.ai/waves/v1/tts/live"
12 SAMPLE_RATE = 44100
13 TEXT = "I bought 3 cats for $100 on Dec 25th"
14 VOICE_ID = "meher"
15 
16 audio_chunks: list[str] = []
17 word_events: list[dict] = []
18 
19 
20 def on_open(ws):
21     ws.send(json.dumps({
22         "voice_id": VOICE_ID,
23         "text": TEXT,
24         "sample_rate": SAMPLE_RATE,
25         "model": "lightning_v3.1_pro",
26         "word_timestamps": True,
27     }))
28 
29 
30 def on_message(ws, raw):
31     msg = json.loads(raw)
32     status = msg.get("status")
33     if status == "chunk":
34         audio_chunks.append(msg["data"]["audio"])
35     elif status == "word_timestamp":
36         w = msg["data"]
37         word_events.append(w)
38         print(f"  id={w['id']:<2}  {w['word']!r:<15}  {w['start']:.3f}s → {w['end']:.3f}s")
39     elif status == "complete":
40         ws.close()
41 
42 
43 def on_close(_ws, *_):
44     pcm = b"".join(base64.b64decode(c) for c in audio_chunks)
45     with wave.open("out.wav", "wb") as f:
46         f.setnchannels(1); f.setsampwidth(2); f.setframerate(SAMPLE_RATE)
47         f.writeframes(pcm)
48     print(f"\nsaved out.wav  ({len(audio_chunks)} chunks, {len(word_events)} word events)")
49 
50 
51 WebSocketApp(
52     WS_URL,
53     header=[f"Authorization: Bearer {API_KEY}"],
54     on_open=on_open,
55     on_message=on_message,
56     on_close=on_close,
57 ).run_forever()

Backward compatibility

Pure addition. Existing integrations that don’t set word_timestamps see the same wire shape as before — same chunk frames, same complete terminator, no new event types to handle.