Hydra — Speech to Speech | Smallest AI Docs

Hydra is a realtime speech-to-speech model. The client streams microphone audio over a WebSocket, the model returns synthesised speech in the same socket, and turn-taking is handled server-side. There is no transcript on the wire — audio bytes are the payload.

If you’ve used the OpenAI Realtime API, Hydra fills the same role on the Smallest AI stack.

Common use cases

Phone agents

Sub-second latency from end-of-user-speech to first audio chunk. Drop-in for outbound and inbound voice flows.

In-product voice copilots

Hands-free assistants embedded in web and mobile apps — barge-in handled by the model.

Kiosks & in-car

One WebSocket, predictable failure modes, no STT/LLM/TTS glue to maintain.

Voice-first consumer apps

Companions, audio diaries, language tutors — natural turn-taking out of the box.

What’s on the wire

Two things to know up front:

Stream audio continuously — no manual commit or end-of-turn. Hydra detects turn boundaries on its own.
Full-duplex — the user can speak over the model. The in-flight response cancels automatically with status: "cancelled", reason: "interrupted".

Quickstart

Clone the reference client, paste your API key, and talk to Hydra in your browser.

WebSocket connection

Connect URL, auth, idle timeout, close codes. Python + Node + Browser snippets.

Managing sessions

Session lifecycle, persona, voice, mid-session updates, conversation items.

Turn detection & barge-in

How the model detects speech, how to handle barge-in cleanly on the client.

Tool calling

Declare tools, stream arguments, post results back, narrate the answer.

Prompting voice agents

System prompts, voice identity, length discipline, tool-call prompting.

Model card — Hydra — voices, performance, pricing
Reference client (Next.js) — production-grade browser client with barge-in, multi-agent presets, tool execution

Common use cases

What’s on the wire

Next

Related