For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • Speech to Speech (Hydra)
    • Overview
    • Quickstart
    • WebSocket connection
    • Managing sessions
    • Audio I/O
    • Turn detection & barge-in
    • Tool calling
    • Prompting voice agents
    • Errors & reconnection
      • Performance
      • Metrics Overview
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Pass rate
  • Non-tool voice-to-voice (V2V) latency
  • Tool voice-to-voice (V2V) latency
  • What’s NOT measured
  • Why this benchmark
  • Next
Speech to Speech (Hydra)Benchmarks

Metrics Overview

||View as Markdown|
Was this page helpful?
Previous

Performance

Next

Quickstart

Built with

This page defines what each metric on the Performance page actually measures.

Pass rate

What it measuresFraction of turns where the model produced a usable, on-topic response to the test prompt
Range0 % – 100 % (higher is better)
Computed fromThe full transcript of all turns across all runs
Sample size in current benchmark30 turns × 10 runs = 300 turns per model

A turn is counted as “passing” if the model’s reply is (a) semantically aligned with the user’s prompt, (b) factually correct given the available context, and (c) delivered as audio without the response being cancelled, dropped, or replaced with an error. Hallucinated content, off-topic replies, silent failures, and turns that error out all count as misses.

Pass rate is the integrative quality metric — it rolls turn-taking accuracy, instruction following, tool-call correctness, and audio synthesis reliability into a single number.

Non-tool voice-to-voice (V2V) latency

What it measuresTime from end-of-user-speech to first audio chunk of the model’s reply, on turns that do not invoke a tool
Lower is betterYes
Computed fromtranscript.jsonl timestamps
Sample size in current benchmarkn = 224 non-tool turns (across all 10 runs)

The clock starts when Hydra emits input_audio_buffer.speech_stopped and ends when the first response.output_audio.delta arrives on the client. This is the lower-bound conversational latency: how fast the model starts talking when no external system is in the loop.

Reported as median (typical experience) and max (worst-case 1-in-N).

Tool voice-to-voice (V2V) latency

What it measuresTime from end-of-user-speech to start of the model narrating the tool result, on turns that invoke a tool
Lower is betterYes
Computed fromtranscript.jsonl timestamps
Sample size in current benchmarkn = 64 tool turns (across all 10 runs)

This is the headline metric for voice agents that do something — book a table, look up an order, query an account. The clock starts at speech_stopped and ends at the first response.output_audio.delta of the narration response (the second response.created in the turn, after the tool call). Reported as mean because the distribution is skewed by tool execution time itself.

Internally, this metric decomposes into:

Sub-stepWhat’s happening
speech_stopped → response.function_call_arguments.deltaHydra deciding to call a tool and starting to stream args
Streaming delta → response.function_call_arguments.doneArguments JSON fully streamed
done → conversation.item.createClient executes the tool
conversation.item.create → response.createClient requests narration
response.create → first response.output_audio.deltaModel produces narration audio

The first two sub-steps and the last one are server-controlled; the middle two are client-controlled. Hydra’s lead on this metric is driven by the server-controlled portions — particularly the streaming-args behaviour that lets the client begin tool execution before arguments finish streaming.

What’s NOT measured

  • Glass-to-glass latency. Doesn’t include mic capture, network egress, or speaker playback. Add 100–300 ms on a real device.
  • Audio quality. The benchmark scores semantic correctness, not naturalness, prosody, or speaker similarity. Listener evaluations are a separate effort.
  • Cost per call. Not part of this comparison.
  • Concurrency under load. Each run is single-session; concurrent-session behaviour is measured separately.

Why this benchmark

AIEWF S2S is run head-to-head, single-session, fixed prompts. Every model gets the same inputs under the same conditions; differences in the numbers are differences in the models, not differences in the test setup.

The methodology was designed to surface differences in tool-calling latency (which is where realtime voice agents tend to fall apart) without conflating with prompt-engineering or harness differences.

Next

Performance

The head-to-head comparison table and analysis.

Tool calling

How Hydra’s streaming function-call arguments work — the mechanism behind the V2V tool latency win.