Metrics Overview | Smallest AI Docs

This page defines what each metric on the Performance page actually measures.

Pass rate


What it measures	Fraction of turns where the model produced a usable, on-topic response to the test prompt
Range	0 % – 100 % (higher is better)
Computed from	The full transcript of all turns across all runs
Sample size in current benchmark	30 turns × 10 runs = 300 turns per model

A turn is counted as “passing” if the model’s reply is (a) semantically aligned with the user’s prompt, (b) factually correct given the available context, and (c) delivered as audio without the response being cancelled, dropped, or replaced with an error. Hallucinated content, off-topic replies, silent failures, and turns that error out all count as misses.

Pass rate is the integrative quality metric — it rolls turn-taking accuracy, instruction following, tool-call correctness, and audio synthesis reliability into a single number.

Non-tool voice-to-voice (V2V) latency


What it measures	Time from end-of-user-speech to first audio chunk of the model’s reply, on turns that do not invoke a tool
Lower is better	Yes
Computed from	`transcript.jsonl` timestamps
Sample size in current benchmark	n = 224 non-tool turns (across all 10 runs)

The clock starts when Hydra emits input_audio_buffer.speech_stopped and ends when the first response.output_audio.delta arrives on the client. This is the lower-bound conversational latency: how fast the model starts talking when no external system is in the loop.

Reported as median (typical experience) and max (worst-case 1-in-N).

Tool voice-to-voice (V2V) latency


What it measures	Time from end-of-user-speech to start of the model narrating the tool result, on turns that invoke a tool
Lower is better	Yes
Computed from	`transcript.jsonl` timestamps
Sample size in current benchmark	n = 64 tool turns (across all 10 runs)

This is the headline metric for voice agents that do something — book a table, look up an order, query an account. The clock starts at speech_stopped and ends at the first response.output_audio.delta of the narration response (the second response.created in the turn, after the tool call). Reported as mean because the distribution is skewed by tool execution time itself.

Internally, this metric decomposes into:

Sub-step	What’s happening
`speech_stopped` → `response.function_call_arguments.delta`	Hydra deciding to call a tool and starting to stream args
Streaming delta → `response.function_call_arguments.done`	Arguments JSON fully streamed
`done` → `conversation.item.create`	Client executes the tool
`conversation.item.create` → `response.create`	Client requests narration
`response.create` → first `response.output_audio.delta`	Model produces narration audio

The first two sub-steps and the last one are server-controlled; the middle two are client-controlled. Hydra’s lead on this metric is driven by the server-controlled portions — particularly the streaming-args behaviour that lets the client begin tool execution before arguments finish streaming.

What’s NOT measured

Glass-to-glass latency. Doesn’t include mic capture, network egress, or speaker playback. Add 100–300 ms on a real device.
Audio quality. The benchmark scores semantic correctness, not naturalness, prosody, or speaker similarity. Listener evaluations are a separate effort.
Cost per call. Not part of this comparison.
Concurrency under load. Each run is single-session; concurrent-session behaviour is measured separately.

Why this benchmark

AIEWF S2S is run head-to-head, single-session, fixed prompts. Every model gets the same inputs under the same conditions; differences in the numbers are differences in the models, not differences in the test setup.

The methodology was designed to surface differences in tool-calling latency (which is where realtime voice agents tend to fall apart) without conflating with prompt-engineering or harness differences.

Performance

The head-to-head comparison table and analysis.

Tool calling

How Hydra’s streaming function-call arguments work — the mechanism behind the V2V tool latency win.

Pass rate

Non-tool voice-to-voice (V2V) latency

Tool voice-to-voice (V2V) latency

What’s NOT measured

Why this benchmark

Next