> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Word-level timestamps

> Per-word timing events interleaved with audio chunks on the Lightning TTS WebSocket — for live captions, karaoke highlighting, and avatar lip-sync.

Real-Time

Pass `word_timestamps: true` on a Lightning TTS WebSocket request to receive per-word timing events interleaved with the audio stream. Each event tells you exactly when a word starts and ends in the generated audio.

## When to use this

* **Live captions** — render words to the screen as they're spoken.
* **Karaoke-style highlighting** — sync the active-word UI to playback.
* **Avatar lip-sync** — drive viseme transitions off word boundaries.
* **Word-level analytics** — log per-word latency or downstream events.

## Endpoint support

| Surface                                                                        | Word timestamps                     |
| ------------------------------------------------------------------------------ | ----------------------------------- |
| `WSS /waves/v1/tts/live` (unified)                                             | ✅                                   |
| `WSS /waves/v1/lightning-v3.1/get_speech/stream` (legacy, retiring 2026-07-14) | ✅                                   |
| `POST /waves/v1/tts` (sync HTTP)                                               | ❌ — flag accepted, silently ignored |
| `POST /waves/v1/tts/live` (HTTP SSE)                                           | ❌ — same                            |

If you need word timing, use the WebSocket path.

## Voice support

Word events are emitted only when the chosen voice family has the aligner checkpoint baked in.

| Language                                      | Voice family                                                                  | Word events |
| --------------------------------------------- | ----------------------------------------------------------------------------- | ----------- |
| English (`en`)                                | Base-queue voices — `meher`, `devansh`, `kartik`, `maithili`, `liam`, `avery` | ✅           |
| Hindi (`hi`)                                  | Base-queue voices (same list)                                                 | ✅           |
| Marathi / Bengali / Gujarati / Punjabi / Odia | north-Indic family                                                            | ❌           |
| Tamil / Telugu / Kannada / Malayalam          | south-Indic family                                                            | ❌           |

For unsupported voice families the flag is **accepted** — audio works normally, but no `word_timestamp` frames are emitted.

## Request

```json
{
  "voice_id": "meher",
  "text": "I bought 3 cats for $100 on Dec 25th",
  "model": "lightning_v3.1",
  "sample_rate": 44100,
  "word_timestamps": true
}
```

`word_timestamps` defaults to `false`. Clients that don't set the flag see no behavior change.

## Response frames

The server interleaves `word_timestamp` frames with `chunk` frames in audio-time order, followed by a single `complete`:

```
chunk → chunk → word_timestamp{id:0} → chunk → word_timestamp{id:1} → … → complete
```

Each `word_timestamp` frame:

```json
{
  "session_id": "ws_550e8400-e29b",
  "request_id": "task_6ba7b810-9dad",
  "status": "word_timestamp",
  "data": {
    "id": 5,
    "word": "$100",
    "start": 1.12,
    "end": 1.92
  }
}
```

| Field        | Type            | Description                                                                                                                                                                                |
| ------------ | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `data.id`    | integer         | 0-indexed position of the word within the input text.                                                                                                                                      |
| `data.word`  | string          | Exact substring from the input — un-normalized. `"$100"` stays `"$100"`, `"25th"` stays `"25th"`, `"3"` stays `"3"`. Non-Latin scripts are preserved verbatim (e.g. Devanagari for Hindi). |
| `data.start` | float (seconds) | Start of the word in the audio stream.                                                                                                                                                     |
| `data.end`   | float (seconds) | End of the word.                                                                                                                                                                           |

## Example

```python Python
# Requires: pip install websocket-client
# Opt into word_timestamps on the v3.1 WebSocket, save the audio to WAV,
# and dump every received word event with its timing.
import base64
import json
import os
import wave
from websocket import WebSocketApp

API_KEY = os.environ["SMALLEST_API_KEY"]
WS_URL = "wss://api.smallest.ai/waves/v1/tts/live"
SAMPLE_RATE = 44100
TEXT = "I bought 3 cats for $100 on Dec 25th"
VOICE_ID = "meher"

audio_chunks: list[str] = []
word_events: list[dict] = []


def on_open(ws):
    ws.send(json.dumps({
        "voice_id": VOICE_ID,
        "text": TEXT,
        "sample_rate": SAMPLE_RATE,
        "model": "lightning_v3.1",
        "word_timestamps": True,
    }))


def on_message(ws, raw):
    msg = json.loads(raw)
    status = msg.get("status")
    if status == "chunk":
        audio_chunks.append(msg["data"]["audio"])
    elif status == "word_timestamp":
        w = msg["data"]
        word_events.append(w)
        print(f"  id={w['id']:<2}  {w['word']!r:<15}  {w['start']:.3f}s → {w['end']:.3f}s")
    elif status == "complete":
        ws.close()


def on_close(_ws, *_):
    pcm = b"".join(base64.b64decode(c) for c in audio_chunks)
    with wave.open("out.wav", "wb") as f:
        f.setnchannels(1); f.setsampwidth(2); f.setframerate(SAMPLE_RATE)
        f.writeframes(pcm)
    print(f"\nsaved out.wav  ({len(audio_chunks)} chunks, {len(word_events)} word events)")


WebSocketApp(
    WS_URL,
    header=[f"Authorization: Bearer {API_KEY}"],
    on_open=on_open,
    on_message=on_message,
    on_close=on_close,
).run_forever()
```

```javascript JavaScript
// Requires: npm install ws
const WebSocket = require("ws");

const API_KEY = process.env.SMALLEST_API_KEY;
const ws = new WebSocket("wss://api.smallest.ai/waves/v1/tts/live", {
  headers: { Authorization: `Bearer ${API_KEY}` },
});

const captionTrack = [];
const audioChunks = [];

ws.on("open", () => {
  ws.send(JSON.stringify({
    voice_id: "meher",
    text: "I bought 3 cats for $100 on Dec 25th",
    model: "lightning_v3.1",
    sample_rate: 44100,
    output_format: "pcm",
    word_timestamps: true,
  }));
});

ws.on("message", (raw) => {
  const msg = JSON.parse(raw);
  switch (msg.status) {
    case "chunk":
      audioChunks.push(Buffer.from(msg.data.audio, "base64"));
      break;
    case "word_timestamp": {
      const { id, word, start, end } = msg.data;
      captionTrack.push({ id, word, startSec: start, endSec: end });
      break;
    }
    case "complete":
      console.log(`Got ${captionTrack.length} word events`);
      ws.close();
      break;
  }
});
```

## Backward compatibility

Pure addition. Existing integrations that don't set `word_timestamps` see the same wire shape as before — same `chunk` frames, same `complete` terminator, no new event types to handle.

## See also

* [Lightning v3.1 model card](/waves/model-cards/text-to-speech/lightning-v-3-1#word-level-timestamps) — model context for the feature.
* [Streaming TTS](/waves/documentation/text-to-speech-lightning/streaming) — full WebSocket setup and audio handling.