How to use Streaming Text to Speech (TTS) with websockets

Real-time Text to Speech Synthesis

The WavesStreamingTTS class provides high-performance text-to-speech conversion with configurable streaming parameters. This implementation is optimized for low-latency applications where immediate audio feedback is critical, such as voice assistants, live narration, or interactive applications.

Configuration Setup

The streaming TTS system uses a TTSConfig object to manage synthesis parameters:

1 from smallestai.waves import TTSConfig, WavesStreamingTTS
2 
3 config = TTSConfig(
4     voice_id="aditi",
5     api_key="YOUR_SMALLEST_API_KEY", 
6     sample_rate=24000,
7     speed=1.0,
8     max_buffer_flush_ms=100
9 )
10 
11 streaming_tts = WavesStreamingTTS(config)

Basic Text Synthesis

For straightforward text-to-speech conversion, use the synthesize method:

1 text = "Hello world, this is a test of the Smallest AI streaming TTS SDK."
2 audio_chunks = []
3 
4 for chunk in streaming_tts.synthesize(text):
5     audio_chunks.append(chunk)

Streaming Text Input

For real-time applications where text arrives incrementally, use synthesize_streaming:

1 def text_stream():
2     text = "Streaming synthesis with chunked text input for Smallest SDK."
3     for word in text.split():
4         yield word + " "
5 
6 audio_chunks = []
7 for chunk in streaming_tts.synthesize_streaming(text_stream()):
8     audio_chunks.append(chunk)

Saving Audio to WAV File

Convert the raw PCM audio chunks to a standard WAV file:

1 import wave
2 from io import BytesIO
3 
4 def save_audio_chunks_to_wav(audio_chunks, filename="output.wav"):
5     with wave.open(filename, 'wb') as wf:
6         wf.setnchannels(1)        # Mono
7         wf.setsampwidth(2)        # 16-bit
8         wf.setframerate(24000)    # 24kHz
9         wf.writeframes(b''.join(audio_chunks))
10 
11 text = "Your text to synthesize here."
12 audio_chunks = list(streaming_tts.synthesize(text))
13 save_audio_chunks_to_wav(audio_chunks, "speech_output.wav")

Configuration Parameters

voice_id: Voice identifier (e.g., “aditi”, “male-1”, “female-2”)
api_key: Your Smallest AI API key
language: Language code for synthesis (default: “en”)
sample_rate: Audio sample rate in Hz (default: 24000)
speed: Speech speed multiplier (default: 1.0 - normal speed, 0.5 = half speed, 2.0 = double speed)
consistency: Voice consistency parameter (default: 0.5, range: 0.0-1.0)
enhancement: Audio enhancement level (default: 1)
similarity: Voice similarity parameter (default: 0, range: 0.0-1.0)
max_buffer_flush_ms: Maximum buffer time in milliseconds before forcing audio output (default: 0)

Output Format

The streaming TTS returns raw PCM audio data as bytes objects. Each chunk represents a portion of the synthesized audio that can be:

Played directly through audio hardware
Saved to audio files (WAV, MP3, etc.)
Streamed over network protocols
Processed with additional audio effects

The raw format ensures minimal latency and maximum flexibility for real-time applications where immediate audio feedback is essential.