How to use Streaming Text to Speech (TTS) with websockets

View as MarkdownOpen in Claude

Real-time Text to Speech Synthesis

The WavesStreamingTTS class provides high-performance text-to-speech conversion with configurable streaming parameters. This implementation is optimized for low-latency applications where immediate audio feedback is critical, such as voice assistants, live narration, or interactive applications.

Configuration Setup

The streaming TTS system uses a TTSConfig object to manage synthesis parameters:

1from smallestai.waves import TTSConfig, WavesStreamingTTS
2
3config = TTSConfig(
4 voice_id="aditi",
5 api_key="YOUR_SMALLEST_API_KEY",
6 sample_rate=24000,
7 speed=1.0,
8 max_buffer_flush_ms=100
9)
10
11streaming_tts = WavesStreamingTTS(config)

Basic Text Synthesis

For straightforward text-to-speech conversion, use the synthesize method:

1text = "Hello world, this is a test of the Smallest AI streaming TTS SDK."
2audio_chunks = []
3
4for chunk in streaming_tts.synthesize(text):
5 audio_chunks.append(chunk)

Streaming Text Input

For real-time applications where text arrives incrementally, use synthesize_streaming:

1def text_stream():
2 text = "Streaming synthesis with chunked text input for Smallest SDK."
3 for word in text.split():
4 yield word + " "
5
6audio_chunks = []
7for chunk in streaming_tts.synthesize_streaming(text_stream()):
8 audio_chunks.append(chunk)

Saving Audio to WAV File

Convert the raw PCM audio chunks to a standard WAV file:

1import wave
2from io import BytesIO
3
4def save_audio_chunks_to_wav(audio_chunks, filename="output.wav"):
5 with wave.open(filename, 'wb') as wf:
6 wf.setnchannels(1) # Mono
7 wf.setsampwidth(2) # 16-bit
8 wf.setframerate(24000) # 24kHz
9 wf.writeframes(b''.join(audio_chunks))
10
11text = "Your text to synthesize here."
12audio_chunks = list(streaming_tts.synthesize(text))
13save_audio_chunks_to_wav(audio_chunks, "speech_output.wav")

Configuration Parameters

  • voice_id: Voice identifier (e.g., “aditi”, “male-1”, “female-2”)
  • api_key: Your Smallest AI API key
  • language: Language code for synthesis (default: “en”)
  • sample_rate: Audio sample rate in Hz (default: 24000)
  • speed: Speech speed multiplier (default: 1.0 - normal speed, 0.5 = half speed, 2.0 = double speed)
  • consistency: Voice consistency parameter (default: 0.5, range: 0.0-1.0)
  • enhancement: Audio enhancement level (default: 1)
  • similarity: Voice similarity parameter (default: 0, range: 0.0-1.0)
  • max_buffer_flush_ms: Maximum buffer time in milliseconds before forcing audio output (default: 0)

Output Format

The streaming TTS returns raw PCM audio data as bytes objects. Each chunk represents a portion of the synthesized audio that can be:

  • Played directly through audio hardware
  • Saved to audio files (WAV, MP3, etc.)
  • Streamed over network protocols
  • Processed with additional audio effects

The raw format ensures minimal latency and maximum flexibility for real-time applications where immediate audio feedback is essential.