WebSocket Support for TTS API

Our Text-to-Speech (TTS) API supports WebSocket communication, providing a real-time, low-latency streaming experience for applications that require instant speech synthesis. WebSockets allow continuous data exchange, making them ideal for use cases that demand uninterrupted audio generation.

When to Use WebSockets

1. Real-Time Streaming

WebSockets are perfect for applications that need real-time speech synthesis, eliminating the delays associated with traditional HTTP requests.

2. Interactive Applications

For voice assistants, chatbots, and live transcription services, WebSockets ensure smooth, uninterrupted audio playback and response times.

3. Reduced Latency

A persistent WebSocket connection reduces the need for repeated request-response cycles, significantly improving performance for applications requiring rapid audio generation.

How It Works

Establish a Connection: The client opens a WebSocket connection to our TTS API.
Send Text Data: The client sends the text payload to be synthesized.
Process in Chunks: The API breaks the text into chunks and processes them individually.
Receive Audio Stream: As each chunk is processed, it is sent back to the client as a base64-encoded audio buffer.
Completion: Once all chunks are processed, a complete message is sent to indicate the end of the stream.

Timeout Behavior

By default, the WebSocket connection enforces a 20-second inactivity timeout. This means that if the client does not send any data within 20 seconds, the server will automatically close the connection to free up resources.

To support longer sessions for use cases where clients need more time (e.g., long pauses between messages), the timeout can be extended up to 60 seconds.

To extend the timeout:

You can include the timeout parameter in the WebSocket URL like so:

1 wss://api.smallest.ai/waves/v1/lightning-v2/get_speech/stream?timeout=60

This sets the inactivity timeout to 60 seconds. Valid values range from 20 (default) to 60 seconds.

Implementation Details

The WebSocket TTS API is optimized to handle real-time text-to-speech conversions efficiently. Key aspects include:

Input Validation: Ensures the provided text and voice ID are valid before processing.
Chunk Processing: Long texts are split into smaller chunks (e.g., 240 characters) to optimize processing.
Voice Caching: The API fetches and caches voice configurations to reduce redundant database queries.
Task Queue System: Tasks are pushed to a Redis-based queue for efficient processing and real-time audio generation.
Error Handling: If any chunk fails, an error message is logged and sent to the client.

Example Request Flow

The client sends a WebSocket message:

1 {
2   "text": "Hello, world!",
3   "voice_id": "12345",
4   "speed": 1.0,
5   "sample_rate": 24000
6 }

The API validates the request and retrieves the voice settings.
The text is split into chunks and processed in the background.
The client receives responses like:

1 {
2   "request_id": "047c9091-b770-41d8-b96b-907d1c8406c0",
3   "status": "chunk",
4   "data": {
5     "audio": "<base64_encoded_audio_chunk>"
6   }
7 }

Once all chunks are sent, a final message is returned:

1 {
2   "request_id": "047c9091-b770-41d8-b96b-907d1c8406c0",
3   "status": "comp",
4   "message": "All chunks sent",
5   "done": true
6 }

For implementation details, check our WebSocket API documentation.