For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • Speech to Speech (Hydra)
    • Overview
    • Quickstart
    • WebSocket connection
    • Managing sessions
    • Audio I/O
    • Turn detection & barge-in
    • Tool calling
    • Prompting voice agents
    • Errors & reconnection
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Events per turn
  • Barge-in
  • Dropping scheduled audio on barge-in
  • Programmatic cancellation
  • Common gotchas
  • Next
Speech to Speech (Hydra)

Turn detection & barge-in

||View as Markdown|
Was this page helpful?
Previous

Audio I/O

Next

Tool calling

Built with

Hydra does turn detection server-side. The client streams audio continuously — even while the model is speaking — and the server emits events as it detects the start and end of each user turn.

There’s nothing you need to do for turn detection to work. The events below are how you observe it.

Events per turn

client: input_audio_buffer.append (continuous)
server: input_audio_buffer.speech_started { audio_start_ms, item_id }
server: conversation.item.added (role=user, in_progress)
server: input_audio_buffer.speech_stopped { audio_end_ms, item_id }
server: conversation.item.done (role=user, completed)
server: response.created { response: { id } }
server: conversation.item.added (role=assistant, in_progress)
server: response.output_audio.delta (N audio chunks)
server: response.output_audio.done
server: conversation.item.done (role=assistant, completed)
server: response.done (status=completed)
EventMeaning
input_audio_buffer.speech_startedVAD fired — the user started talking. audio_start_ms is the offset from session start.
input_audio_buffer.speech_stoppedThe user finished. The audio between the two events becomes a user message item.
response.createdThe model is generating a reply. Use this as your client-side “the assistant is about to talk” hook.
response.doneThe turn is over. Inspect status (completed, cancelled, incomplete, failed).

You don’t send anything between user turns. Hydra decides when a turn ends.

Barge-in

Because the channel is full-duplex, the user can speak while the model is speaking. When that happens:

  1. The server emits input_audio_buffer.speech_started for the new user turn.

  2. The in-flight response is cancelled — response.done arrives with:

    1{
    2 "type": "response.done",
    3 "response": {
    4 "id": "resp_…",
    5 "status": "cancelled",
    6 "status_details": { "reason": "interrupted" }
    7 }
    8}
  3. The new user turn proceeds normally.

You don’t have to send anything to opt into this. It’s automatic.

Dropping scheduled audio on barge-in

The hard part is client-side: when response.created arrives for a new turn, any audio chunks from the previous response that you’ve already scheduled for playback will continue to play unless you drop them. Closing the AudioContext alone leaves a tail.

The cleanest pattern in the browser is to reset your playback cursor on every fresh response.created. (playPCM16 and b64ToInt16 are defined in Audio I/O.)

1let playCursor = 0;
2
3ws.onmessage = (ev) => {
4 const evt = JSON.parse(ev.data);
5 if (evt.type === "response.created") {
6 // Fresh response — drop anything still scheduled from the previous one.
7 playCursor = playCtx.currentTime;
8 } else if (evt.type === "response.output_audio.delta") {
9 playPCM16(b64ToInt16(evt.delta));
10 }
11};

If you’ve kept references to AudioBufferSourceNodes, also call stop() on each one — playCursor only controls future schedules, it doesn’t cancel buffers already started.

In Python with sounddevice, drain your playback queue on response.created — shown here as the relevant branch of the event handler:

1async for raw in ws:
2 evt = json.loads(raw)
3 if evt["type"] == "response.created":
4 # Fresh response — drop anything still scheduled in the playback queue.
5 while not play_queue.empty():
6 try:
7 play_queue.get_nowait()
8 except asyncio.QueueEmpty:
9 break

Programmatic cancellation

To cancel an in-flight response without the user speaking — e.g. you got new info from your backend and want the model to stop — send response.cancel:

1{ "type": "response.cancel" }

The server replies with response.done carrying status: "cancelled" and status_details.reason: "client_cancelled". The frame is a no-op if no response is in flight.

Common gotchas

  • Audio keeps playing after barge-in — you didn’t reset playCursor (browser) or didn’t drain your playback queue (Python).
  • speech_started fires from background noise — Hydra’s VAD is conservative but not perfect. Discarded turns arrive as conversation.item.done with status: "incomplete" and should be ignored.
  • Sending input_audio_buffer.append only when the user is talking — don’t. Stream continuously. Hydra needs the silence to detect turn boundaries.

Next

  • Tool calling — what happens between speech_stopped and audio output when the model decides to call a function
  • Prompting voice agents — how to write instructions so the model handles turn-taking gracefully