For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Word Timestamps
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • Speech to Speech (Hydra)
    • Overview
    • Quickstart
    • WebSocket connection
    • Managing sessions
    • Audio I/O
    • Turn detection & barge-in
    • Tool calling
    • Prompting voice agents
    • Errors & reconnection
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • When to use this
  • Endpoint support
  • Voice support
  • Request
  • Response frames
  • Example
  • Backward compatibility
  • See also
Text to Speech (Lightning)

Word-level timestamps

||View as Markdown|
Was this page helpful?
Previous

Streaming

Next

Pronunciation Dictionaries

Built with
Real-Time

Pass word_timestamps: true on a Lightning TTS WebSocket request to receive per-word timing events interleaved with the audio stream. Each event tells you exactly when a word starts and ends in the generated audio.

When to use this

  • Live captions — render words to the screen as they’re spoken.
  • Karaoke-style highlighting — sync the active-word UI to playback.
  • Avatar lip-sync — drive viseme transitions off word boundaries.
  • Word-level analytics — log per-word latency or downstream events.

Endpoint support

SurfaceWord timestamps
WSS /waves/v1/tts/live (unified)✅
WSS /waves/v1/lightning-v3.1/get_speech/stream (legacy, retiring 2026-07-14)✅
POST /waves/v1/tts (sync HTTP)❌ — flag accepted, silently ignored
POST /waves/v1/tts/live (HTTP SSE)❌ — same

If you need word timing, use the WebSocket path.

Voice support

Word events are emitted only when the chosen voice family has the aligner checkpoint baked in.

LanguageVoice familyWord events
English (en)Base-queue voices — meher, devansh, kartik, maithili, liam, avery✅
Hindi (hi)Base-queue voices (same list)✅
Marathi / Bengali / Gujarati / Punjabi / Odianorth-Indic family❌
Tamil / Telugu / Kannada / Malayalamsouth-Indic family❌

For unsupported voice families the flag is accepted — audio works normally, but no word_timestamp frames are emitted.

Request

1{
2 "voice_id": "meher",
3 "text": "I bought 3 cats for $100 on Dec 25th",
4 "model": "lightning_v3.1",
5 "sample_rate": 44100,
6 "word_timestamps": true
7}

word_timestamps defaults to false. Clients that don’t set the flag see no behavior change.

Response frames

The server interleaves word_timestamp frames with chunk frames in audio-time order, followed by a single complete:

chunk → chunk → word_timestamp{id:0} → chunk → word_timestamp{id:1} → … → complete

Each word_timestamp frame:

1{
2 "session_id": "ws_550e8400-e29b",
3 "request_id": "task_6ba7b810-9dad",
4 "status": "word_timestamp",
5 "data": {
6 "id": 5,
7 "word": "$100",
8 "start": 1.12,
9 "end": 1.92
10 }
11}
FieldTypeDescription
data.idinteger0-indexed position of the word within the input text.
data.wordstringExact substring from the input — un-normalized. "$100" stays "$100", "25th" stays "25th", "3" stays "3". Non-Latin scripts are preserved verbatim (e.g. Devanagari for Hindi).
data.startfloat (seconds)Start of the word in the audio stream.
data.endfloat (seconds)End of the word.

Example

1# Requires: pip install websocket-client
2# Opt into word_timestamps on the v3.1 WebSocket, save the audio to WAV,
3# and dump every received word event with its timing.
4import base64
5import json
6import os
7import wave
8from websocket import WebSocketApp
9
10API_KEY = os.environ["SMALLEST_API_KEY"]
11WS_URL = "wss://api.smallest.ai/waves/v1/tts/live"
12SAMPLE_RATE = 44100
13TEXT = "I bought 3 cats for $100 on Dec 25th"
14VOICE_ID = "meher"
15
16audio_chunks: list[str] = []
17word_events: list[dict] = []
18
19
20def on_open(ws):
21 ws.send(json.dumps({
22 "voice_id": VOICE_ID,
23 "text": TEXT,
24 "sample_rate": SAMPLE_RATE,
25 "model": "lightning_v3.1",
26 "word_timestamps": True,
27 }))
28
29
30def on_message(ws, raw):
31 msg = json.loads(raw)
32 status = msg.get("status")
33 if status == "chunk":
34 audio_chunks.append(msg["data"]["audio"])
35 elif status == "word_timestamp":
36 w = msg["data"]
37 word_events.append(w)
38 print(f" id={w['id']:<2} {w['word']!r:<15} {w['start']:.3f}s → {w['end']:.3f}s")
39 elif status == "complete":
40 ws.close()
41
42
43def on_close(_ws, *_):
44 pcm = b"".join(base64.b64decode(c) for c in audio_chunks)
45 with wave.open("out.wav", "wb") as f:
46 f.setnchannels(1); f.setsampwidth(2); f.setframerate(SAMPLE_RATE)
47 f.writeframes(pcm)
48 print(f"\nsaved out.wav ({len(audio_chunks)} chunks, {len(word_events)} word events)")
49
50
51WebSocketApp(
52 WS_URL,
53 header=[f"Authorization: Bearer {API_KEY}"],
54 on_open=on_open,
55 on_message=on_message,
56 on_close=on_close,
57).run_forever()

Backward compatibility

Pure addition. Existing integrations that don’t set word_timestamps see the same wire shape as before — same chunk frames, same complete terminator, no new event types to handle.

See also

  • Lightning v3.1 model card — model context for the feature.
  • Streaming TTS — full WebSocket setup and audio handling.