Stream Speech (SSE)

View as Markdown
Synthesize speech and stream the audio back over Server-Sent Events. Same body as `/waves/v1/tts` — the only difference is the response is a stream of base64-encoded PCM chunks instead of one binary blob. Pick the model with the `model` body parameter, same as the sync route. <Note> **The same URL serves the WebSocket endpoint.** `wss://api.smallest.ai/waves/v1/tts/live` accepts a WebSocket upgrade for streaming-text scenarios (LLM token streams, live captioning). The HTTP `POST` documented on this page returns SSE; use `wss://` to use the WebSocket protocol instead. See the [WebSocket reference](/waves/api-reference/api-reference/text-to-speech/tts). </Note> ## When to use this - **Use this** when you want playback to start before synthesis is complete — long passages, latency-sensitive UI, live narration. - **Use sync `/waves/v1/tts`** when total latency doesn't matter and you'd rather get one buffer. - **Use `/waves/v1/tts/live`** (WebSocket) when the *text* arrives incrementally (LLM token stream). SSE assumes you have the full text up front. ## How it works 1. POST your text + voice settings — same payload as `/waves/v1/tts`, plus optional `model`. 2. The response is `Content-Type: text/event-stream`. Each chunk frame is `event: audio\n` followed by `data: {"audio": "<base64-pcm>"}\n\n`. 3. Decode each chunk's `audio` field with base64 and feed the PCM bytes to your audio pipeline (browser `MediaSource`, ffmpeg pipe, raw PCM player, etc.). 4. A final `data: {"done": true}\n\n` frame marks end of stream. ## Examples **cURL** ```bash curl -N -X POST "https://api.smallest.ai/waves/v1/tts/live" \ -H "Authorization: Bearer $SMALLEST_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "text": "Streaming this paragraph chunk by chunk so playback can start sooner.", "voice_id": "magnus", "sample_rate": 24000, "output_format": "pcm" }' ``` ## Common gotchas - **Use a streaming-friendly client.** `curl -N`, Python `iter_lines`, or a `fetch` `ReadableStream` reader. Buffering clients will hide the latency win. - **Audio is base64 inside the event payload**, not the raw event bytes. Decode the `data.audio` field per event. - **`output_format=pcm`** gives the lowest overhead for streaming playback. `wav`/`mp3` work but add per-chunk framing bytes.

Authentication

AuthorizationBearer

Header authentication of the form Bearer <token>

Request

This endpoint expects an object.
textstringRequiredDefaults to Hello from Waves TTS.
The text to convert to speech.
voice_idstringRequiredDefaults to magnus
The voice identifier to use for speech generation. See the model card for available voices per model.
modelenumOptionalDefaults to lightning_v3.1

TTS model to route the request to. Controls which model pool serves this synthesis.

  • lightning_v3.1 (default) — standard Lightning v3.1.
  • lightning_v3.1_pro — Lightning v3.1 Pro pool. Improved audio quality and naturalness, with a curated voice catalog. See the Lightning v3.1 Pro model card for supported voice IDs.

Same concurrency and latency profile across both. Other request parameters behave identically.

sample_rateenumOptionalDefaults to 44100
The sample rate for the generated audio.
speeddoubleOptional0.5-2Defaults to 1
The speed of the generated speech.
languageenumOptionalDefaults to en

Language code for synthesis. Influences pronunciation, number/date normalization, and phoneme selection.

Each voice has its own tags.language set in the voice catalog — query GET /waves/v1/lightning-v3.1/get_voices. Pass a language the voice was trained on; passing other codes is accepted by the API but produces English-pronounced output.

On lightning_v3.1, the full 12-language catalog applies.

On lightning_v3.1_pro:

  • Pass en → UK + American accented English.
  • Pass hi → Indian accented English + Hindi (code-switching).
  • Omit language → defaults to en + hi (mixed Indian + Western English coverage).
output_formatenumOptionalDefaults to pcm

Format of the returned audio. pcm is the lowest-latency option but requires a decoder to play; mp3 and wav are directly playable in browsers and most media players. The server default is pcm when the field is omitted — the API playground uses mp3 so the generated audio is directly playable.

pronunciation_dictslist of stringsOptional

The IDs of the pronunciation dictionaries to use for speech generation. Available on both lightning_v3.1 and lightning_v3.1_pro.

word_timestampsbooleanOptionalDefaults to false

WebSocket-only feature. Accepted on this endpoint but ignored — no per-word timing information is returned in the sync HTTP or SSE response shape. To receive status: "word_timestamp" frames with per-word { id, word, start, end } data, use the WebSocket endpoint wss://api.smallest.ai/waves/v1/tts/live. See Word-level timestamps.

session_idstringOptionalformat: "^[a-zA-Z0-9_\-.]+$"<=128 characters

Optional client-provided session identifier for correlation. Only alphanumeric characters, hyphens, underscores, and dots are allowed. Max 128 characters. Echoed back in response headers as X-External-Session-Id.

request_idstringOptionalformat: "^[a-zA-Z0-9_\-.]+$"<=128 characters

Optional client-provided request identifier for correlation. Only alphanumeric characters, hyphens, underscores, and dots are allowed. Max 128 characters. Echoed back in response headers as X-External-Request-Id.

Response headers

X-Session-Idstring

Internal session identifier (system-generated UUID).

X-Request-Idstring

Internal request identifier (system-generated UUID).

X-External-Session-Idstring

Echoed client-provided session_id (empty if not provided).

X-External-Request-Idstring

Echoed client-provided request_id (empty if not provided).

Response

Synthesized speech retrieved successfully.

Errors

400
Bad Request Error
401
Unauthorized Error
500
Internal Server Error