Synthesize speech and stream the audio back over Server-Sent Events. Same body as `/waves/v1/tts` — the only difference is the response is a stream of base64-encoded PCM chunks instead of one binary blob.
Pick the model with the `model` body parameter, same as the sync route.
<Note>
**The same URL serves the WebSocket endpoint.** `wss://api.smallest.ai/waves/v1/tts/live` accepts a WebSocket upgrade for streaming-text scenarios (LLM token streams, live captioning). The HTTP `POST` documented on this page returns SSE; use `wss://` to use the WebSocket protocol instead. See the [WebSocket reference](/waves/api-reference/api-reference/text-to-speech/tts).
</Note>
## When to use this
- **Use this** when you want playback to start before synthesis is complete — long passages, latency-sensitive UI, live narration.
- **Use sync `/waves/v1/tts`** when total latency doesn't matter and you'd rather get one buffer.
- **Use `/waves/v1/tts/live`** (WebSocket) when the *text* arrives incrementally (LLM token stream). SSE assumes you have the full text up front.
## How it works
1. POST your text + voice settings — same payload as `/waves/v1/tts`, plus optional `model`.
2. The response is `Content-Type: text/event-stream`. Each chunk frame is `event: audio\n` followed by `data: {"audio": "<base64-pcm>"}\n\n`.
3. Decode each chunk's `audio` field with base64 and feed the PCM bytes to your audio pipeline (browser `MediaSource`, ffmpeg pipe, raw PCM player, etc.).
4. A final `data: {"done": true}\n\n` frame marks end of stream.
## Examples
**cURL**
```bash
curl -N -X POST "https://api.smallest.ai/waves/v1/tts/live" \
-H "Authorization: Bearer $SMALLEST_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Streaming this paragraph chunk by chunk so playback can start sooner.",
"voice_id": "magnus",
"sample_rate": 24000,
"output_format": "pcm"
}'
```
## Common gotchas
- **Use a streaming-friendly client.** `curl -N`, Python `iter_lines`, or a `fetch` `ReadableStream` reader. Buffering clients will hide the latency win.
- **Audio is base64 inside the event payload**, not the raw event bytes. Decode the `data.audio` field per event.
- **`output_format=pcm`** gives the lowest overhead for streaming playback. `wav`/`mp3` work but add per-chunk framing bytes.
Request
This endpoint expects an object.
textstringRequiredDefaults to Hello from Waves TTS.
The text to convert to speech.
voice_idstringRequiredDefaults to magnus
The voice identifier to use for speech generation. See the model card for available voices per model.
modelenumOptionalDefaults to lightning_v3.1
TTS model to route the request to. Controls which model pool serves
this synthesis.
lightning_v3.1 (default) — standard Lightning v3.1.
lightning_v3.1_pro — Lightning v3.1 Pro pool. Improved audio
quality and naturalness, with a curated voice catalog. See the
Lightning v3.1 Pro model card
for supported voice IDs.
Same concurrency and latency profile across both. Other request
parameters behave identically.
sample_rateenumOptionalDefaults to 44100
The sample rate for the generated audio.
speeddoubleOptional0.5-2Defaults to 1
The speed of the generated speech.
languageenumOptionalDefaults to en
Language code for synthesis. Influences pronunciation, number/date
normalization, and phoneme selection.
Each voice has its own tags.language set in the voice catalog —
query GET /waves/v1/lightning-v3.1/get_voices. Pass a language
the voice was trained on; passing other codes is accepted by the
API but produces English-pronounced output.
On lightning_v3.1, the full 12-language catalog applies.
On lightning_v3.1_pro:
- Pass
en → UK + American accented English.
- Pass
hi → Indian accented English + Hindi (code-switching).
- Omit
language → defaults to en + hi (mixed Indian + Western English coverage).
output_formatenumOptionalDefaults to pcm
Format of the returned audio. pcm is the lowest-latency option
but requires a decoder to play; mp3 and wav are directly
playable in browsers and most media players. The server default
is pcm when the field is omitted — the API playground uses
mp3 so the generated audio is directly playable.
pronunciation_dictslist of stringsOptional
The IDs of the pronunciation dictionaries to use for speech generation. Available on both lightning_v3.1 and lightning_v3.1_pro.
word_timestampsbooleanOptionalDefaults to false
WebSocket-only feature. Accepted on this endpoint but ignored — no per-word timing information is returned in the sync HTTP or SSE response shape. To receive status: "word_timestamp" frames with per-word { id, word, start, end } data, use the WebSocket endpoint wss://api.smallest.ai/waves/v1/tts/live. See Word-level timestamps.
session_idstringOptionalformat: "^[a-zA-Z0-9_\-.]+$"<=128 characters
Optional client-provided session identifier for correlation. Only alphanumeric characters, hyphens, underscores, and dots are allowed. Max 128 characters. Echoed back in response headers as X-External-Session-Id.
request_idstringOptionalformat: "^[a-zA-Z0-9_\-.]+$"<=128 characters
Optional client-provided request identifier for correlation. Only alphanumeric characters, hyphens, underscores, and dots are allowed. Max 128 characters. Echoed back in response headers as X-External-Request-Id.
Response
Synthesized speech retrieved successfully.