> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Stream Speech (WebSocket)

GET /waves/v1/tts/live

# Live TTS WebSocket — `/waves/v1/tts/live`

Real-time text-to-speech over a persistent WebSocket connection. The
`model` field in the request payload selects which Lightning pool serves
the synthesis.

## When to use this

- **Use this** when text arrives incrementally (LLM token streams, live
  captioning, conversational pipelines where playback should start as
  soon as the first chunk is ready).
- POST to `/waves/v1/tts/live` (SSE) when you have the full text up
  front but still want chunked playback. (Same URL, different
  protocol — HTTP POST gets you SSE; WSS connect gets you WebSocket.)
- Use `/waves/v1/tts` (sync) when total latency doesn't matter.

## Selecting the model

Pass `"model": "lightning_v3.1"` (default) or
`"model": "lightning_v3.1_pro"` on each request. Concurrency and latency
are identical across both. Voice catalogs differ — see the
[Lightning v3.1](/waves/model-cards/text-to-speech/lightning-v-3-1) and
[Lightning v3.1 Pro](/waves/model-cards/text-to-speech/lightning-v-3-1-pro)
model cards for the per-model catalog.

## Optional features

Set `word_timestamps: true` to receive per-word timing events
interleaved with the audio chunks (`status: "word_timestamp"`).
Supported on English + Hindi base-queue voices. See
[Word-level timestamps](/waves/documentation/text-to-speech-lightning/word-timestamps).

## Connection timeout

The server closes idle WebSocket connections to free resources. The
default idle timeout is **60 seconds** — if your client does not send
a message within that window the server closes the connection with:

```json
{"status": "error", "message": "Connection timed out after 60 seconds of inactivity"}
```

Override the value with the `timeout` query parameter on the URL:

```
wss://api.smallest.ai/waves/v1/tts/live?timeout=120
```

Pass any positive integer (seconds). Smaller values are honored
verbatim (e.g. `?timeout=5` closes after 5 s of silence). Use a larger
value when your application has known pauses between turns — voice
agents with long human-thinking windows, agentic pipelines waiting on
an LLM round-trip, etc.

The timeout is reset on every message you send (binary audio in, JSON
control in), so keep-alive traffic restarts the clock.

## Migrating from `/waves/v1/lightning-v3.1/get_speech/stream`

Same protocol, same payload shape — only the URL changes. Existing
clients should:

1. Update the WebSocket URL to `wss://api.smallest.ai/waves/v1/tts/live`.
2. Optionally add `"model": "lightning_v3.1_pro"` to route to the Pro
   pool. Omitting `model` keeps the existing standard-pool behavior.

Voice IDs, sample rates, auth, and the response/streaming format are
unchanged, so downstream audio handling, jitter buffers, and barge-in
logic stay the same.


Reference: https://docs.smallest.ai/waves/api-reference/api-reference/text-to-speech/tts

## AsyncAPI Specification

```yaml
asyncapi: 2.6.0
info:
  title: TTS
  version: subpackage_tts.TTS
  description: >
    # Live TTS WebSocket — `/waves/v1/tts/live`


    Real-time text-to-speech over a persistent WebSocket connection. The

    `model` field in the request payload selects which Lightning pool serves

    the synthesis.


    ## When to use this


    - **Use this** when text arrives incrementally (LLM token streams, live
      captioning, conversational pipelines where playback should start as
      soon as the first chunk is ready).
    - POST to `/waves/v1/tts/live` (SSE) when you have the full text up
      front but still want chunked playback. (Same URL, different
      protocol — HTTP POST gets you SSE; WSS connect gets you WebSocket.)
    - Use `/waves/v1/tts` (sync) when total latency doesn't matter.


    ## Selecting the model


    Pass `"model": "lightning_v3.1"` (default) or

    `"model": "lightning_v3.1_pro"` on each request. Concurrency and latency

    are identical across both. Voice catalogs differ — see the

    [Lightning v3.1](/waves/model-cards/text-to-speech/lightning-v-3-1) and

    [Lightning v3.1 Pro](/waves/model-cards/text-to-speech/lightning-v-3-1-pro)

    model cards for the per-model catalog.


    ## Optional features


    Set `word_timestamps: true` to receive per-word timing events

    interleaved with the audio chunks (`status: "word_timestamp"`).

    Supported on English + Hindi base-queue voices. See

    [Word-level
    timestamps](/waves/documentation/text-to-speech-lightning/word-timestamps).


    ## Connection timeout


    The server closes idle WebSocket connections to free resources. The

    default idle timeout is **60 seconds** — if your client does not send

    a message within that window the server closes the connection with:


    ```json

    {"status": "error", "message": "Connection timed out after 60 seconds of
    inactivity"}

    ```


    Override the value with the `timeout` query parameter on the URL:


    ```

    wss://api.smallest.ai/waves/v1/tts/live?timeout=120

    ```


    Pass any positive integer (seconds). Smaller values are honored

    verbatim (e.g. `?timeout=5` closes after 5 s of silence). Use a larger

    value when your application has known pauses between turns — voice

    agents with long human-thinking windows, agentic pipelines waiting on

    an LLM round-trip, etc.


    The timeout is reset on every message you send (binary audio in, JSON

    control in), so keep-alive traffic restarts the clock.


    ## Migrating from `/waves/v1/lightning-v3.1/get_speech/stream`


    Same protocol, same payload shape — only the URL changes. Existing

    clients should:


    1. Update the WebSocket URL to `wss://api.smallest.ai/waves/v1/tts/live`.

    2. Optionally add `"model": "lightning_v3.1_pro"` to route to the Pro
       pool. Omitting `model` keeps the existing standard-pool behavior.

    Voice IDs, sample rates, auth, and the response/streaming format are

    unchanged, so downstream audio handling, jitter buffers, and barge-in

    logic stay the same.
channels:
  /waves/v1/tts/live:
    description: >
      # Live TTS WebSocket — `/waves/v1/tts/live`


      Real-time text-to-speech over a persistent WebSocket connection. The

      `model` field in the request payload selects which Lightning pool serves

      the synthesis.


      ## When to use this


      - **Use this** when text arrives incrementally (LLM token streams, live
        captioning, conversational pipelines where playback should start as
        soon as the first chunk is ready).
      - POST to `/waves/v1/tts/live` (SSE) when you have the full text up
        front but still want chunked playback. (Same URL, different
        protocol — HTTP POST gets you SSE; WSS connect gets you WebSocket.)
      - Use `/waves/v1/tts` (sync) when total latency doesn't matter.


      ## Selecting the model


      Pass `"model": "lightning_v3.1"` (default) or

      `"model": "lightning_v3.1_pro"` on each request. Concurrency and latency

      are identical across both. Voice catalogs differ — see the

      [Lightning v3.1](/waves/model-cards/text-to-speech/lightning-v-3-1) and

      [Lightning v3.1
      Pro](/waves/model-cards/text-to-speech/lightning-v-3-1-pro)

      model cards for the per-model catalog.


      ## Optional features


      Set `word_timestamps: true` to receive per-word timing events

      interleaved with the audio chunks (`status: "word_timestamp"`).

      Supported on English + Hindi base-queue voices. See

      [Word-level
      timestamps](/waves/documentation/text-to-speech-lightning/word-timestamps).


      ## Connection timeout


      The server closes idle WebSocket connections to free resources. The

      default idle timeout is **60 seconds** — if your client does not send

      a message within that window the server closes the connection with:


      ```json

      {"status": "error", "message": "Connection timed out after 60 seconds of
      inactivity"}

      ```


      Override the value with the `timeout` query parameter on the URL:


      ```

      wss://api.smallest.ai/waves/v1/tts/live?timeout=120

      ```


      Pass any positive integer (seconds). Smaller values are honored

      verbatim (e.g. `?timeout=5` closes after 5 s of silence). Use a larger

      value when your application has known pauses between turns — voice

      agents with long human-thinking windows, agentic pipelines waiting on

      an LLM round-trip, etc.


      The timeout is reset on every message you send (binary audio in, JSON

      control in), so keep-alive traffic restarts the clock.


      ## Migrating from `/waves/v1/lightning-v3.1/get_speech/stream`


      Same protocol, same payload shape — only the URL changes. Existing

      clients should:


      1. Update the WebSocket URL to `wss://api.smallest.ai/waves/v1/tts/live`.

      2. Optionally add `"model": "lightning_v3.1_pro"` to route to the Pro
         pool. Omitting `model` keeps the existing standard-pool behavior.

      Voice IDs, sample rates, auth, and the response/streaming format are

      unchanged, so downstream audio handling, jitter buffers, and barge-in

      logic stay the same.
    publish:
      operationId: tts-publish
      summary: TtsResponse
      description: Receive audio data chunks and completion status from the server.
      message:
        name: TtsResponse
        title: TtsResponse
        description: Receive audio data chunks and completion status from the server.
        payload:
          $ref: '#/components/schemas/ttsStream_ttsResponse.message'
    subscribe:
      operationId: tts-subscribe
      summary: TtsRequest
      description: >-
        Send a JSON message with `voice_id`, `text`, and optional parameters
        (including `model`) to generate speech audio.
      message:
        name: TtsRequest
        title: TtsRequest
        description: >-
          Send a JSON message with `voice_id`, `text`, and optional parameters
          (including `model`) to generate speech audio.
        payload:
          $ref: '#/components/schemas/ttsStream_ttsRequest.message'
servers:
  waves:
    url: wss://api.smallest.ai/
    protocol: wss
components:
  schemas:
    ChannelsTtsStreamMessagesTtsResponseMessageStatus:
      type: string
      enum:
        - chunk
        - word_timestamp
        - complete
      description: >
        Frame type discriminator:

        - `chunk` — base64-encoded audio chunk in `data.audio`.

        - `word_timestamp` — per-word timing event in
        `data.{id,word,start,end}`. Only emitted when the request set
        `word_timestamps: true` and the voice family supports it.

        - `complete` — terminal frame; the server closes the WebSocket after
        this.
      title: ChannelsTtsStreamMessagesTtsResponseMessageStatus
    ChannelsTtsStreamMessagesTtsResponseMessageData:
      type: object
      properties:
        audio:
          type: string
          description: 'Base64-encoded audio chunk (present on `status: "chunk"` frames).'
        id:
          type: integer
          description: >-
            0-indexed position of the word within the input text (present on
            `status: "word_timestamp"` frames).
        word:
          type: string
          description: >-
            Exact substring from the input text, un-normalized — `"$100"` stays
            `"$100"`, `"25th"` stays `"25th"` (present on `status:
            "word_timestamp"` frames).
        start:
          type: number
          format: double
          description: >-
            Start of the word in seconds, relative to the start of the audio
            stream (present on `status: "word_timestamp"` frames).
        end:
          type: number
          format: double
          description: >-
            End of the word in seconds, relative to the start of the audio
            stream (present on `status: "word_timestamp"` frames).
      description: >-
        Frame-specific payload. Shape depends on `status` — see the per-frame
        examples below.
      title: ChannelsTtsStreamMessagesTtsResponseMessageData
    ttsStream_ttsResponse.message:
      type: object
      properties:
        session_id:
          type: string
          description: >-
            Internal session identifier (system-generated, stable for the
            WebSocket connection lifetime).
        request_id:
          type: string
          description: >-
            Internal request identifier (system-generated UUID, unique per TTS
            synthesis).
        external_session_id:
          type: string
          description: Echoed client-provided session_id (omitted if not provided).
        external_request_id:
          type: string
          description: Echoed client-provided request_id (omitted if not provided).
        status:
          $ref: >-
            #/components/schemas/ChannelsTtsStreamMessagesTtsResponseMessageStatus
          description: >
            Frame type discriminator:

            - `chunk` — base64-encoded audio chunk in `data.audio`.

            - `word_timestamp` — per-word timing event in
            `data.{id,word,start,end}`. Only emitted when the request set
            `word_timestamps: true` and the voice family supports it.

            - `complete` — terminal frame; the server closes the WebSocket after
            this.
        data:
          $ref: '#/components/schemas/ChannelsTtsStreamMessagesTtsResponseMessageData'
          description: >-
            Frame-specific payload. Shape depends on `status` — see the
            per-frame examples below.
      title: ttsStream_ttsResponse.message
    ChannelsTtsStreamMessagesTtsRequestMessageModel:
      type: string
      enum:
        - lightning_v3.1
        - lightning_v3.1_pro
      default: lightning_v3.1
      description: |
        TTS model to route the request to. Controls which model pool
        serves this synthesis.

        - `lightning_v3.1` (default) — standard Lightning v3.1.
        - `lightning_v3.1_pro` — Lightning v3.1 Pro pool with a
          curated voice catalog. See the
          [Pro model card](/waves/model-cards/text-to-speech/lightning-v-3-1-pro).

        Same concurrency and latency profile across both. Other
        request fields behave identically.
      title: ChannelsTtsStreamMessagesTtsRequestMessageModel
    ChannelsTtsStreamMessagesTtsRequestMessageLanguage:
      type: string
      enum:
        - en
        - hi
        - mr
        - kn
        - ta
        - bn
        - gu
        - te
        - ml
        - pa
        - or
        - es
      default: en
      description: >
        Language code for synthesis. Influences pronunciation,

        number/date normalization, and phoneme selection.


        Each voice has its own `tags.language` set in the voice catalog —

        query `GET /waves/v1/lightning-v3.1/get_voices`. Pass a language

        the voice was trained on; passing other codes is accepted by the

        API but produces English-pronounced output.


        **On `lightning_v3.1`**, the full 12-language catalog applies.


        **On `lightning_v3.1_pro`**:

        - Pass `en` → UK + American accented English.

        - Pass `hi` → Indian accented English + Hindi (code-switching).

        - Omit `language` → defaults to `en + hi` (mixed Indian + Western
        English coverage).
      title: ChannelsTtsStreamMessagesTtsRequestMessageLanguage
    ttsStream_ttsRequest.message:
      type: object
      properties:
        voice_id:
          type: string
          description: >-
            The ID of the voice to use. See the model card for available voices
            per model.
        text:
          type: string
          description: The text to convert to speech.
        model:
          $ref: '#/components/schemas/ChannelsTtsStreamMessagesTtsRequestMessageModel'
          default: lightning_v3.1
          description: |
            TTS model to route the request to. Controls which model pool
            serves this synthesis.

            - `lightning_v3.1` (default) — standard Lightning v3.1.
            - `lightning_v3.1_pro` — Lightning v3.1 Pro pool with a
              curated voice catalog. See the
              [Pro model card](/waves/model-cards/text-to-speech/lightning-v-3-1-pro).

            Same concurrency and latency profile across both. Other
            request fields behave identically.
        max_buffer_flush_ms:
          type: integer
          default: 0
          description: >-
            The maximum time (in ms) to wait for more input before generating
            output. It flushes when either this time is reached or enough input
            is received for optimal output—whichever comes first. This is useful
            for input streams. Defaults to 0
        continue:
          type: boolean
          default: false
          description: >-
            This setting controls whether the system should buffer and wait for
            more input after receiving the current one. If not set, it assumes
            no more input is coming.
        flush:
          type: boolean
          default: false
          description: >-
            This setting controls whether the system should flush the current
            buffer.
        complete_backoff_ms:
          type: number
          format: double
          default: 4000
          description: >-
            The time in ms to wait after the last chunk is sent before sending
            the complete response. Default is 4000ms. Maximum is 10000ms.
        language:
          $ref: >-
            #/components/schemas/ChannelsTtsStreamMessagesTtsRequestMessageLanguage
          default: en
          description: >
            Language code for synthesis. Influences pronunciation,

            number/date normalization, and phoneme selection.


            Each voice has its own `tags.language` set in the voice catalog —

            query `GET /waves/v1/lightning-v3.1/get_voices`. Pass a language

            the voice was trained on; passing other codes is accepted by the

            API but produces English-pronounced output.


            **On `lightning_v3.1`**, the full 12-language catalog applies.


            **On `lightning_v3.1_pro`**:

            - Pass `en` → UK + American accented English.

            - Pass `hi` → Indian accented English + Hindi (code-switching).

            - Omit `language` → defaults to `en + hi` (mixed Indian + Western
            English coverage).
        sample_rate:
          type: integer
          default: 44100
          description: 'Audio sample rate in Hz. Supported values: 8000, 16000, 24000, 44100'
        speed:
          type: number
          format: double
          default: 1
          description: Speaking speed multiplier
        session_id:
          type: string
          description: >-
            Optional client-provided session identifier for correlation. Only
            alphanumeric characters, hyphens, underscores, and dots allowed. Max
            128 characters. Echoed back in responses as `external_session_id`.
        request_id:
          type: string
          description: >-
            Optional client-provided request identifier for correlation. Only
            alphanumeric characters, hyphens, underscores, and dots allowed. Max
            128 characters. Echoed back in responses as `external_request_id`.
        word_timestamps:
          type: boolean
          default: false
          description: >
            Opt in to per-word timing events for the synthesized audio. When
            `true`, the server interleaves `status: "word_timestamp"` frames
            with the audio `chunk` frames; each carries `data: { id, word,
            start, end }` where `start`/`end` are floats in seconds relative to
            the start of the audio stream, and `word` is verbatim from the input
            text (un-normalized — `"$100"` stays `"$100"`, not `"one hundred
            dollars"`). Supported on base-queue English + Hindi voices (`meher`,
            `devansh`, `kartik`, `maithili`, `liam`, `avery`); other voice
            families silently emit no word events (audio still works). Defaults
            to `false` so existing integrations see no change.
      required:
        - voice_id
        - text
      title: ttsStream_ttsRequest.message

```