Turn detection & barge-in
Turn detection & barge-in
Turn detection & barge-in
Hydra does turn detection server-side. The client streams audio continuously — even while the model is speaking — and the server emits events as it detects the start and end of each user turn.
There’s nothing you need to do for turn detection to work. The events below are how you observe it.
You don’t send anything between user turns. Hydra decides when a turn ends.
Because the channel is full-duplex, the user can speak while the model is speaking. When that happens:
The server emits input_audio_buffer.speech_started for the new user turn.
The in-flight response is cancelled — response.done arrives with:
The new user turn proceeds normally.
You don’t have to send anything to opt into this. It’s automatic.
The hard part is client-side: when response.created arrives for a new turn, any audio chunks from the previous response that you’ve already scheduled for playback will continue to play unless you drop them. Closing the AudioContext alone leaves a tail.
The cleanest pattern in the browser is to reset your playback cursor on every fresh response.created. (playPCM16 and b64ToInt16 are defined in Audio I/O.)
If you’ve kept references to AudioBufferSourceNodes, also call stop() on each one — playCursor only controls future schedules, it doesn’t cancel buffers already started.
In Python with sounddevice, drain your playback queue on response.created — shown here as the relevant branch of the event handler:
To cancel an in-flight response without the user speaking — e.g. you got new info from your backend and want the model to stop — send response.cancel:
The server replies with response.done carrying status: "cancelled" and status_details.reason: "client_cancelled". The frame is a no-op if no response is in flight.
playCursor (browser) or didn’t drain your playback queue (Python).speech_started fires from background noise — Hydra’s VAD is conservative but not perfect. Discarded turns arrive as conversation.item.done with status: "incomplete" and should be ignored.input_audio_buffer.append only when the user is talking — don’t. Stream continuously. Hydra needs the silence to detect turn boundaries.speech_stopped and audio output when the model decides to call a functioninstructions so the model handles turn-taking gracefully