Metrics Overview

View as Markdown

This page defines what each metric on the Performance page actually measures.

Pass rate

What it measuresFraction of turns where the model produced a usable, on-topic response to the test prompt
Range0 % – 100 % (higher is better)
Computed fromThe full transcript of all turns across all runs
Sample size in current benchmark30 turns × 10 runs = 300 turns per model

A turn is counted as “passing” if the model’s reply is (a) semantically aligned with the user’s prompt, (b) factually correct given the available context, and (c) delivered as audio without the response being cancelled, dropped, or replaced with an error. Hallucinated content, off-topic replies, silent failures, and turns that error out all count as misses.

Pass rate is the integrative quality metric — it rolls turn-taking accuracy, instruction following, tool-call correctness, and audio synthesis reliability into a single number.

Non-tool voice-to-voice (V2V) latency

What it measuresTime from end-of-user-speech to first audio chunk of the model’s reply, on turns that do not invoke a tool
Lower is betterYes
Computed fromtranscript.jsonl timestamps
Sample size in current benchmarkn = 224 non-tool turns (across all 10 runs)

The clock starts when Hydra emits input_audio_buffer.speech_stopped and ends when the first response.output_audio.delta arrives on the client. This is the lower-bound conversational latency: how fast the model starts talking when no external system is in the loop.

Reported as median (typical experience) and max (worst-case 1-in-N).

Tool voice-to-voice (V2V) latency

What it measuresTime from end-of-user-speech to start of the model narrating the tool result, on turns that invoke a tool
Lower is betterYes
Computed fromtranscript.jsonl timestamps
Sample size in current benchmarkn = 64 tool turns (across all 10 runs)

This is the headline metric for voice agents that do something — book a table, look up an order, query an account. The clock starts at speech_stopped and ends at the first response.output_audio.delta of the narration response (the second response.created in the turn, after the tool call). Reported as mean because the distribution is skewed by tool execution time itself.

Internally, this metric decomposes into:

Sub-stepWhat’s happening
speech_stoppedresponse.function_call_arguments.deltaHydra deciding to call a tool and starting to stream args
Streaming delta → response.function_call_arguments.doneArguments JSON fully streamed
doneconversation.item.createClient executes the tool
conversation.item.createresponse.createClient requests narration
response.create → first response.output_audio.deltaModel produces narration audio

The first two sub-steps and the last one are server-controlled; the middle two are client-controlled. Hydra’s lead on this metric is driven by the server-controlled portions — particularly the streaming-args behaviour that lets the client begin tool execution before arguments finish streaming.

What’s NOT measured

  • Glass-to-glass latency. Doesn’t include mic capture, network egress, or speaker playback. Add 100–300 ms on a real device.
  • Audio quality. The benchmark scores semantic correctness, not naturalness, prosody, or speaker similarity. Listener evaluations are a separate effort.
  • Cost per call. Not part of this comparison.
  • Concurrency under load. Each run is single-session; concurrent-session behaviour is measured separately.

Why this benchmark

AIEWF S2S is run head-to-head, single-session, fixed prompts. Every model gets the same inputs under the same conditions; differences in the numbers are differences in the models, not differences in the test setup.

The methodology was designed to surface differences in tool-calling latency (which is where realtime voice agents tend to fall apart) without conflating with prompt-engineering or harness differences.

Next