Streaming | Smallest AI Docs

Streaming TTS delivers audio chunks as they’re generated — playback starts immediately instead of waiting for the full file. First chunk arrives in ~100ms.

Streamed audio output:

WebSocket Streaming

Persistent connections for continuous, low-latency audio. Best for conversational AI and real-time apps.

Endpoint: wss://api.smallest.ai/waves/v1/tts/live

Pass word_timestamps: true on the WebSocket request to receive per-word timing events (status: "word_timestamp") interleaved with the audio chunks — useful for live captions, karaoke-style highlighting, and avatar lip-sync. Words come back verbatim from the input text ("$100" stays "$100", not normalized). Supported on English + Hindi base-queue voices. See Word-level timestamps on the Lightning v3.1 model card for the wire shape, voice support matrix, and a worked example.

1 # ci:skip — requires sounddevice + an audio device, not available in CI
2 # Plays audio chunks as they arrive AND saves the full stream to streamed.wav.
3 # Requires: pip install websockets sounddevice numpy
4 import asyncio
5 import base64
6 import json
7 import os
8 import wave
9 
10 import numpy as np
11 import sounddevice as sd
12 import websockets
13 
14 API_KEY = os.environ["SMALLEST_API_KEY"]
15 WS_URL = "wss://api.smallest.ai/waves/v1/tts/live"
16 SAMPLE_RATE = 24000
17 
18 async def stream_tts(text):
19     audio_chunks = []
20     # Open a non-blocking output stream so each chunk plays as it arrives.
21     stream = sd.OutputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16")
22     try:
23         stream.start()
24         async with websockets.connect(
25             WS_URL,
26             additional_headers={"Authorization": f"Bearer {API_KEY}"},
27         ) as ws:
28             await ws.send(json.dumps({
29                 "text": text,
30                 "voice_id": "meher",
31                 "model": "lightning_v3.1_pro",
32                 "sample_rate": SAMPLE_RATE,
33             }))
34 
35             while True:
36                 data = json.loads(await ws.recv())
37                 status = data.get("status")
38                 if status == "chunk":
39                     pcm = base64.b64decode(data["data"]["audio"])
40                     audio_chunks.append(pcm)
41                     # Play immediately (16-bit mono PCM).
42                     stream.write(np.frombuffer(pcm, dtype=np.int16))
43                 elif status == "complete":
44                     break
45     finally:
46         # Always release the audio device, even if the WS errored mid-stream.
47         stream.stop(); stream.close()
48 
49     with wave.open("streamed.wav", "wb") as wf:
50         wf.setnchannels(1); wf.setsampwidth(2); wf.setframerate(SAMPLE_RATE)
51         wf.writeframes(b"".join(audio_chunks))
52     print(f"Saved streamed.wav ({len(audio_chunks)} chunks)")
53 
54 asyncio.run(stream_tts("Streaming delivers audio in real-time for voice assistants and chatbots."))

SSE Streaming

Server-Sent Events over HTTP — simpler to set up, no persistent connection needed.

Endpoint: POST https://api.smallest.ai/waves/v1/tts/live

1 # ci:skip — requires sounddevice + an audio device, not available in CI
2 # Plays audio chunks as they arrive AND saves the full stream to sse_output.wav.
3 # Requires: pip install requests sounddevice numpy
4 import base64
5 import json
6 import os
7 import wave
8 
9 import numpy as np
10 import requests
11 import sounddevice as sd
12 
13 API_KEY = os.environ["SMALLEST_API_KEY"]
14 SAMPLE_RATE = 24000
15 
16 response = requests.post(
17     "https://api.smallest.ai/waves/v1/tts/live",
18     headers={
19         "Authorization": f"Bearer {API_KEY}",
20         "Content-Type": "application/json",
21         "Accept": "text/event-stream",
22     },
23     json={
24         "text": "SSE streaming is simpler to set up than WebSocket.",
25         "voice_id": "meher",
26         "model": "lightning_v3.1_pro",
27         "sample_rate": SAMPLE_RATE,
28     },
29     stream=True,
30 )
31 response.raise_for_status()
32 
33 audio_chunks = []
34 stream = sd.OutputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16")
35 try:
36     stream.start()
37     for line in response.iter_lines():
38         if not line:
39             continue
40         decoded = line.decode()
41         if not decoded.startswith("data: "):
42             continue
43         data = json.loads(decoded[6:])
44         if data.get("done"):
45             break
46         if data.get("audio"):
47             pcm = base64.b64decode(data["audio"])
48             audio_chunks.append(pcm)
49             # Play immediately as each chunk arrives.
50             stream.write(np.frombuffer(pcm, dtype=np.int16))
51 finally:
52     # Always release the audio device, even if the stream errored mid-flight.
53     stream.stop(); stream.close()
54 
55 with wave.open("sse_output.wav", "wb") as wf:
56     wf.setnchannels(1); wf.setsampwidth(2); wf.setframerate(SAMPLE_RATE)
57     wf.writeframes(b"".join(audio_chunks))
58 print(f"Saved sse_output.wav ({len(audio_chunks)} chunks)")

Streaming Text Input (SDK)

For real-time applications where text arrives incrementally (e.g., from an LLM), the SDK supports streaming text input:

1 # ci:skip — requires sounddevice + an audio device, not available in CI
2 # Requires `smallestai>=4.4.0`, plus: pip install sounddevice numpy
3 import os
4 
5 import numpy as np
6 import sounddevice as sd
7 from smallestai.waves import TTSConfig, WavesStreamingTTS
8 
9 SAMPLE_RATE = 24000
10 config = TTSConfig(voice_id="magnus", api_key=os.environ["SMALLEST_API_KEY"], sample_rate=SAMPLE_RATE)
11 streaming_tts = WavesStreamingTTS(config)
12 
13 def text_stream():
14     """Simulates text arriving word by word (e.g., from an LLM)."""
15     text = "Streaming synthesis with chunked text input."
16     for word in text.split():
17         yield word + " "
18 
19 stream = sd.OutputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16")
20 try:
21     stream.start()
22     for chunk in streaming_tts.synthesize_streaming(text_stream()):
23         # Each chunk is raw 16-bit PCM. Play it immediately.
24         stream.write(np.frombuffer(chunk, dtype=np.int16))
25 finally:
26     # Always release the audio device, even if synthesis errored.
27     stream.stop(); stream.close()

WebSocket vs SSE

	WebSocket	SSE
Connection	Persistent, bidirectional	New HTTP request each time
Multiple messages	Reuse same connection	New request per message
Best for	Voice assistants, chatbots	Simple one-off streaming
Latency	Lowest (no reconnect overhead)	Slightly higher
Concurrency	Up to 5 connections per unit	Per-request

Use WebSocket when sending multiple TTS requests over time (conversations, voice bots). Use SSE for simple one-shot streaming where you don’t need a persistent connection.

Response Format

The two transports emit different JSON shapes — match your parser to the transport you’re using.

WebSocket — each message is a nested envelope:

1 // audio chunk
2 { "status": "chunk", "data": { "audio": "base64_encoded_pcm_data" } }
3 
4 // stream complete
5 { "status": "complete", "message": "All chunks sent", "done": true }

Access audio at data["data"]["audio"]; terminator is data["status"] == "complete".

SSE — each data: line is a flat object:

1 // audio chunk
2 { "audio": "base64_encoded_pcm_data" }
3 
4 // stream complete
5 { "done": true }

Access audio at data["audio"]; terminator is data["done"] == true. SSE frames are prefixed with event: audio\n followed by data: {...}\n\n.

Configuration Parameters

Parameter	Default	Description
`voice_id`	required	Voice identifier
`model`	`lightning_v3.1`	TTS pool to use. Pass `lightning_v3.1_pro` to route to the Pro pool.
`sample_rate`	`44100`	Audio sample rate (8000–44100 Hz)
`speed`	`1.0`	Speech speed (0.5–2.0)
`language`	`en`	Language code matching the voice. Each voice’s `tags.language` constrains what works — see `GET /waves/v1/lightning-v3.1/get_voices`.
`output_format`	`pcm`	`pcm`, `mp3`, `wav`, `ulaw`, or `alaw`

For concurrency limits and connection management, see Concurrency and Limits.

Full runnable source: streaming-python.py