This page defines what each metric on the Performance page actually measures.
A turn is counted as “passing” if the model’s reply is (a) semantically aligned with the user’s prompt, (b) factually correct given the available context, and (c) delivered as audio without the response being cancelled, dropped, or replaced with an error. Hallucinated content, off-topic replies, silent failures, and turns that error out all count as misses.
Pass rate is the integrative quality metric — it rolls turn-taking accuracy, instruction following, tool-call correctness, and audio synthesis reliability into a single number.
The clock starts when Hydra emits input_audio_buffer.speech_stopped and ends when the first response.output_audio.delta arrives on the client. This is the lower-bound conversational latency: how fast the model starts talking when no external system is in the loop.
Reported as median (typical experience) and max (worst-case 1-in-N).
This is the headline metric for voice agents that do something — book a table, look up an order, query an account. The clock starts at speech_stopped and ends at the first response.output_audio.delta of the narration response (the second response.created in the turn, after the tool call). Reported as mean because the distribution is skewed by tool execution time itself.
Internally, this metric decomposes into:
The first two sub-steps and the last one are server-controlled; the middle two are client-controlled. Hydra’s lead on this metric is driven by the server-controlled portions — particularly the streaming-args behaviour that lets the client begin tool execution before arguments finish streaming.
AIEWF S2S is run head-to-head, single-session, fixed prompts. Every model gets the same inputs under the same conditions; differences in the numbers are differences in the models, not differences in the test setup.
The methodology was designed to surface differences in tool-calling latency (which is where realtime voice agents tend to fall apart) without conflating with prompt-engineering or harness differences.