Streaming
Streaming TTS delivers audio chunks as they’re generated — playback starts immediately instead of waiting for the full file. First chunk arrives in ~100ms.
Streamed audio output:
WebSocket Streaming
Persistent connections for continuous, low-latency audio. Best for conversational AI and real-time apps.
Endpoint: wss://api.smallest.ai/waves/v1/tts/live
Pass word_timestamps: true on the WebSocket request to receive per-word timing events (status: "word_timestamp") interleaved with the audio chunks — useful for live captions, karaoke-style highlighting, and avatar lip-sync. Words come back verbatim from the input text ("$100" stays "$100", not normalized). Supported on English + Hindi base-queue voices. See Word-level timestamps on the Lightning v3.1 model card for the wire shape, voice support matrix, and a worked example.
SSE Streaming
Server-Sent Events over HTTP — simpler to set up, no persistent connection needed.
Endpoint: POST https://api.smallest.ai/waves/v1/tts/live
Streaming Text Input (SDK)
For real-time applications where text arrives incrementally (e.g., from an LLM), the SDK supports streaming text input:
WebSocket vs SSE
Use WebSocket when sending multiple TTS requests over time (conversations, voice bots). Use SSE for simple one-shot streaming where you don’t need a persistent connection.
Response Format
The two transports emit different JSON shapes — match your parser to the transport you’re using.
WebSocket — each message is a nested envelope:
Access audio at data["data"]["audio"]; terminator is data["status"] == "complete".
SSE — each data: line is a flat object:
Access audio at data["audio"]; terminator is data["done"] == true. SSE frames are prefixed with event: audio\n followed by data: {...}\n\n.
Configuration Parameters
For concurrency limits and connection management, see Concurrency and Limits.
Full runnable source: streaming-python.py

