Streaming

View as Markdown

Streaming TTS delivers audio chunks as they’re generated — playback starts immediately instead of waiting for the full file. First chunk arrives in ~100ms.

Streamed audio output:

WebSocket Streaming

Persistent connections for continuous, low-latency audio. Best for conversational AI and real-time apps.

Endpoint: wss://api.smallest.ai/waves/v1/tts/live

Pass word_timestamps: true on the WebSocket request to receive per-word timing events (status: "word_timestamp") interleaved with the audio chunks — useful for live captions, karaoke-style highlighting, and avatar lip-sync. Words come back verbatim from the input text ("$100" stays "$100", not normalized). Supported on English + Hindi base-queue voices. See Word-level timestamps on the Lightning v3.1 model card for the wire shape, voice support matrix, and a worked example.

1# ci:skip — requires sounddevice + an audio device, not available in CI
2# Plays audio chunks as they arrive AND saves the full stream to streamed.wav.
3# Requires: pip install websockets sounddevice numpy
4import asyncio
5import base64
6import json
7import os
8import wave
9
10import numpy as np
11import sounddevice as sd
12import websockets
13
14API_KEY = os.environ["SMALLEST_API_KEY"]
15WS_URL = "wss://api.smallest.ai/waves/v1/tts/live"
16SAMPLE_RATE = 24000
17
18async def stream_tts(text):
19 audio_chunks = []
20 # Open a non-blocking output stream so each chunk plays as it arrives.
21 stream = sd.OutputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16")
22 try:
23 stream.start()
24 async with websockets.connect(
25 WS_URL,
26 additional_headers={"Authorization": f"Bearer {API_KEY}"},
27 ) as ws:
28 await ws.send(json.dumps({
29 "text": text,
30 "voice_id": "meher",
31 "model": "lightning_v3.1_pro",
32 "sample_rate": SAMPLE_RATE,
33 }))
34
35 while True:
36 data = json.loads(await ws.recv())
37 status = data.get("status")
38 if status == "chunk":
39 pcm = base64.b64decode(data["data"]["audio"])
40 audio_chunks.append(pcm)
41 # Play immediately (16-bit mono PCM).
42 stream.write(np.frombuffer(pcm, dtype=np.int16))
43 elif status == "complete":
44 break
45 finally:
46 # Always release the audio device, even if the WS errored mid-stream.
47 stream.stop(); stream.close()
48
49 with wave.open("streamed.wav", "wb") as wf:
50 wf.setnchannels(1); wf.setsampwidth(2); wf.setframerate(SAMPLE_RATE)
51 wf.writeframes(b"".join(audio_chunks))
52 print(f"Saved streamed.wav ({len(audio_chunks)} chunks)")
53
54asyncio.run(stream_tts("Streaming delivers audio in real-time for voice assistants and chatbots."))

SSE Streaming

Server-Sent Events over HTTP — simpler to set up, no persistent connection needed.

Endpoint: POST https://api.smallest.ai/waves/v1/tts/live

1# ci:skip — requires sounddevice + an audio device, not available in CI
2# Plays audio chunks as they arrive AND saves the full stream to sse_output.wav.
3# Requires: pip install requests sounddevice numpy
4import base64
5import json
6import os
7import wave
8
9import numpy as np
10import requests
11import sounddevice as sd
12
13API_KEY = os.environ["SMALLEST_API_KEY"]
14SAMPLE_RATE = 24000
15
16response = requests.post(
17 "https://api.smallest.ai/waves/v1/tts/live",
18 headers={
19 "Authorization": f"Bearer {API_KEY}",
20 "Content-Type": "application/json",
21 "Accept": "text/event-stream",
22 },
23 json={
24 "text": "SSE streaming is simpler to set up than WebSocket.",
25 "voice_id": "meher",
26 "model": "lightning_v3.1_pro",
27 "sample_rate": SAMPLE_RATE,
28 },
29 stream=True,
30)
31response.raise_for_status()
32
33audio_chunks = []
34stream = sd.OutputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16")
35try:
36 stream.start()
37 for line in response.iter_lines():
38 if not line:
39 continue
40 decoded = line.decode()
41 if not decoded.startswith("data: "):
42 continue
43 data = json.loads(decoded[6:])
44 if data.get("done"):
45 break
46 if data.get("audio"):
47 pcm = base64.b64decode(data["audio"])
48 audio_chunks.append(pcm)
49 # Play immediately as each chunk arrives.
50 stream.write(np.frombuffer(pcm, dtype=np.int16))
51finally:
52 # Always release the audio device, even if the stream errored mid-flight.
53 stream.stop(); stream.close()
54
55with wave.open("sse_output.wav", "wb") as wf:
56 wf.setnchannels(1); wf.setsampwidth(2); wf.setframerate(SAMPLE_RATE)
57 wf.writeframes(b"".join(audio_chunks))
58print(f"Saved sse_output.wav ({len(audio_chunks)} chunks)")

Streaming Text Input (SDK)

For real-time applications where text arrives incrementally (e.g., from an LLM), the SDK supports streaming text input:

1# ci:skip — requires sounddevice + an audio device, not available in CI
2# Requires `smallestai>=4.4.0`, plus: pip install sounddevice numpy
3import os
4
5import numpy as np
6import sounddevice as sd
7from smallestai.waves import TTSConfig, WavesStreamingTTS
8
9SAMPLE_RATE = 24000
10config = TTSConfig(voice_id="magnus", api_key=os.environ["SMALLEST_API_KEY"], sample_rate=SAMPLE_RATE)
11streaming_tts = WavesStreamingTTS(config)
12
13def text_stream():
14 """Simulates text arriving word by word (e.g., from an LLM)."""
15 text = "Streaming synthesis with chunked text input."
16 for word in text.split():
17 yield word + " "
18
19stream = sd.OutputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16")
20try:
21 stream.start()
22 for chunk in streaming_tts.synthesize_streaming(text_stream()):
23 # Each chunk is raw 16-bit PCM. Play it immediately.
24 stream.write(np.frombuffer(chunk, dtype=np.int16))
25finally:
26 # Always release the audio device, even if synthesis errored.
27 stream.stop(); stream.close()

WebSocket vs SSE

WebSocketSSE
ConnectionPersistent, bidirectionalNew HTTP request each time
Multiple messagesReuse same connectionNew request per message
Best forVoice assistants, chatbotsSimple one-off streaming
LatencyLowest (no reconnect overhead)Slightly higher
ConcurrencyUp to 5 connections per unitPer-request

Use WebSocket when sending multiple TTS requests over time (conversations, voice bots). Use SSE for simple one-shot streaming where you don’t need a persistent connection.

Response Format

The two transports emit different JSON shapes — match your parser to the transport you’re using.

WebSocket — each message is a nested envelope:

1// audio chunk
2{ "status": "chunk", "data": { "audio": "base64_encoded_pcm_data" } }
3
4// stream complete
5{ "status": "complete", "message": "All chunks sent", "done": true }

Access audio at data["data"]["audio"]; terminator is data["status"] == "complete".

SSE — each data: line is a flat object:

1// audio chunk
2{ "audio": "base64_encoded_pcm_data" }
3
4// stream complete
5{ "done": true }

Access audio at data["audio"]; terminator is data["done"] == true. SSE frames are prefixed with event: audio\n followed by data: {...}\n\n.

Configuration Parameters

ParameterDefaultDescription
voice_idrequiredVoice identifier
modellightning_v3.1TTS pool to use. Pass lightning_v3.1_pro to route to the Pro pool.
sample_rate44100Audio sample rate (8000–44100 Hz)
speed1.0Speech speed (0.5–2.0)
languageenLanguage code matching the voice. Each voice’s tags.language constrains what works — see GET /waves/v1/lightning-v3.1/get_voices.
output_formatpcmpcm, mp3, wav, ulaw, or alaw

For concurrency limits and connection management, see Concurrency and Limits.

Full runnable source: streaming-python.py