Turn detection & barge-in | Smallest AI Docs

Hydra does turn detection server-side. The client streams audio continuously — even while the model is speaking — and the server emits events as it detects the start and end of each user turn.

There’s nothing you need to do for turn detection to work. The events below are how you observe it.

Events per turn

client: input_audio_buffer.append (continuous)
server: input_audio_buffer.speech_started      { audio_start_ms, item_id }
server: conversation.item.added                (role=user, in_progress)
server: input_audio_buffer.speech_stopped      { audio_end_ms, item_id }
server: conversation.item.done                 (role=user, completed)
server: response.created                       { response: { id } }
server: conversation.item.added                (role=assistant, in_progress)
server: response.output_audio.delta            (N audio chunks)
server: response.output_audio.done
server: conversation.item.done                 (role=assistant, completed)
server: response.done                          (status=completed)

Event	Meaning
`input_audio_buffer.speech_started`	VAD fired — the user started talking. `audio_start_ms` is the offset from session start.
`input_audio_buffer.speech_stopped`	The user finished. The audio between the two events becomes a user message item.
`response.created`	The model is generating a reply. Use this as your client-side “the assistant is about to talk” hook.
`response.done`	The turn is over. Inspect `status` (`completed`, `cancelled`, `incomplete`, `failed`).

You don’t send anything between user turns. Hydra decides when a turn ends.

Barge-in

Because the channel is full-duplex, the user can speak while the model is speaking. When that happens:

The server emits input_audio_buffer.speech_started for the new user turn.

The in-flight response is cancelled — response.done arrives with:

1 {
2   "type": "response.done",
3   "response": {
4     "id": "resp_…",
5     "status": "cancelled",
6     "status_details": { "reason": "interrupted" }
7   }
8 }

The new user turn proceeds normally.

You don’t have to send anything to opt into this. It’s automatic.

Dropping scheduled audio on barge-in

The hard part is client-side: when response.created arrives for a new turn, any audio chunks from the previous response that you’ve already scheduled for playback will continue to play unless you drop them. Closing the AudioContext alone leaves a tail.

The cleanest pattern in the browser is to reset your playback cursor on every fresh response.created. (playPCM16 and b64ToInt16 are defined in Audio I/O.)

1 let playCursor = 0;
2 
3 ws.onmessage = (ev) => {
4   const evt = JSON.parse(ev.data);
5   if (evt.type === "response.created") {
6     // Fresh response — drop anything still scheduled from the previous one.
7     playCursor = playCtx.currentTime;
8   } else if (evt.type === "response.output_audio.delta") {
9     playPCM16(b64ToInt16(evt.delta));
10   }
11 };

If you’ve kept references to AudioBufferSourceNodes, also call stop() on each one — playCursor only controls future schedules, it doesn’t cancel buffers already started.

In Python with sounddevice, drain your playback queue on response.created — shown here as the relevant branch of the event handler:

1 async for raw in ws:
2     evt = json.loads(raw)
3     if evt["type"] == "response.created":
4         # Fresh response — drop anything still scheduled in the playback queue.
5         while not play_queue.empty():
6             try:
7                 play_queue.get_nowait()
8             except asyncio.QueueEmpty:
9                 break

Programmatic cancellation

To cancel an in-flight response without the user speaking — e.g. you got new info from your backend and want the model to stop — send response.cancel:

1 { "type": "response.cancel" }

The server replies with response.done carrying status: "cancelled" and status_details.reason: "client_cancelled". The frame is a no-op if no response is in flight.

Common gotchas

Audio keeps playing after barge-in — you didn’t reset playCursor (browser) or didn’t drain your playback queue (Python).
speech_started fires from background noise — Hydra’s VAD is conservative but not perfect. Discarded turns arrive as conversation.item.done with status: "incomplete" and should be ignored.
Sending input_audio_buffer.append only when the user is talking — don’t. Stream continuously. Hydra needs the silence to detect turn boundaries.

Tool calling — what happens between speech_stopped and audio output when the model decides to call a function
Prompting voice agents — how to write instructions so the model handles turn-taking gracefully

Events per turn

Barge-in

Dropping scheduled audio on barge-in

Programmatic cancellation

Common gotchas

Next