Hydra — Speech to Speech

View as Markdown

Hydra is a realtime speech-to-speech model. The client streams microphone audio over a WebSocket, the model returns synthesised speech in the same socket, and turn-taking is handled server-side. There is no transcript on the wire — audio bytes are the payload.

If you’ve used the OpenAI Realtime API, Hydra fills the same role on the Smallest AI stack.

Common use cases

Phone agents

Sub-second latency from end-of-user-speech to first audio chunk. Drop-in for outbound and inbound voice flows.

In-product voice copilots

Hands-free assistants embedded in web and mobile apps — barge-in handled by the model.

Kiosks & in-car

One WebSocket, predictable failure modes, no STT/LLM/TTS glue to maintain.

Voice-first consumer apps

Companions, audio diaries, language tutors — natural turn-taking out of the box.

What’s on the wire

Two things to know up front:

  • Stream audio continuously — no manual commit or end-of-turn. Hydra detects turn boundaries on its own.
  • Full-duplex — the user can speak over the model. The in-flight response cancels automatically with status: "cancelled", reason: "interrupted".

Next