For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • Speech to Speech (Hydra)
    • Overview
    • Quickstart
    • WebSocket connection
    • Managing sessions
    • Audio I/O
    • Turn detection & barge-in
    • Tool calling
    • Prompting voice agents
    • Errors & reconnection
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Common use cases
  • What’s on the wire
  • Next
  • Related
Speech to Speech (Hydra)

Hydra — Speech to Speech

||View as Markdown|
Was this page helpful?
Previous

Evaluation Walkthrough

Next

Quickstart

Built with

Hydra is a realtime speech-to-speech model. The client streams microphone audio over a WebSocket, the model returns synthesised speech in the same socket, and turn-taking is handled server-side. There is no transcript on the wire — audio bytes are the payload.

If you’ve used the OpenAI Realtime API, Hydra fills the same role on the Smallest AI stack.

Common use cases

Phone agents

Sub-second latency from end-of-user-speech to first audio chunk. Drop-in for outbound and inbound voice flows.

In-product voice copilots

Hands-free assistants embedded in web and mobile apps — barge-in handled by the model.

Kiosks & in-car

One WebSocket, predictable failure modes, no STT/LLM/TTS glue to maintain.

Voice-first consumer apps

Companions, audio diaries, language tutors — natural turn-taking out of the box.

What’s on the wire

Two things to know up front:

  • Stream audio continuously — no manual commit or end-of-turn. Hydra detects turn boundaries on its own.
  • Full-duplex — the user can speak over the model. The in-flight response cancels automatically with status: "cancelled", reason: "interrupted".

Next

Quickstart

Clone the reference client, paste your API key, and talk to Hydra in your browser.

WebSocket connection

Connect URL, auth, idle timeout, close codes. Python + Node + Browser snippets.

Managing sessions

Session lifecycle, persona, voice, mid-session updates, conversation items.

Turn detection & barge-in

How the model detects speech, how to handle barge-in cleanly on the client.

Tool calling

Declare tools, stream arguments, post results back, narrate the answer.

Prompting voice agents

System prompts, voice identity, length discipline, tool-call prompting.

Related

  • Model card — Hydra — voices, performance, pricing
  • Reference client (Next.js) — production-grade browser client with barge-in, multi-agent presets, tool execution