Performance

View as Markdown

Hydra is benchmarked against eight other production-grade voice / realtime models on the AIEWF S2S benchmark. For each metric, the comparison set is identical across runs and Hydra is the only model under test from Smallest AI.

AIEWF S2S — 10 runs × 30 turns, aiwf_medium_context

The same fixed prompt + transcript distribution is replayed against every model, ten times each. Latency numbers are computed from transcript.jsonl (n = 224 non-tool turns, n = 64 tool turns). Pass rate is the fraction of turns that completed the expected interaction.

ModelPass rateNon-tool V2V medianNon-tool V2V maxTool V2V mean
ultravox-v0.797.7 %864 ms1888 ms2406 ms
gpt-realtime-2 (low)96.0 %1728 ms4032 ms2005 ms
Hydra95.9 %864 ms1984 ms1624 ms
grok-voice-think-fast-1.095.3 %2336 ms4800 ms2753 ms
gpt-realtime-1.593.3 %1152 ms2304 ms2251 ms
gemini-3.1-flash-live-preview91.7 %1632 ms5664 ms3172 ms
gpt-realtime86.7 %1536 ms4672 ms2199 ms
gemini-live86.0 %2624 ms30000 ms4082 ms
nova-2-sonic1280 ms3232 ms1689 ms

Where Hydra leads

Tool V2V mean

1624 ms — fastest of 9. Beats nova-2-sonic by 65 ms, gpt-realtime-2 low by 381 ms, ultravox by 782 ms.

Non-tool V2V median

864 ms — tied-fastest with ultravox. Beats gpt-realtime-2 low by 864 ms and gemini-live by 1760 ms.

Pass rate

95.9 % — #3 of 8 (nova-2-sonic did not report pass rate). Within ~2 pp of the leader.

How to read this

  • Pass rate is the fraction of turns where the model produced a usable, on-topic response to the test prompt. It’s the integrative quality metric.
  • Non-tool V2V latency is end-of-user-speech to first audio chunk on turns where the model replies directly (no tool call). This is the lower-bound conversational latency you’d see on a phone call where the user just asks something.
  • Tool V2V latency is the same end-to-end timing but on turns that route through a tool — measured from end-of-user-speech to the start of the model narrating the tool result. This is the metric that matters for voice agents that do anything real: phone bookings, account lookups, order placement.

See Metrics Overview for the exact definitions.

Why Hydra wins tool V2V

The model is purpose-built for realtime voice agents that call tools. Three architectural choices show up in the numbers:

  1. Streaming tool-call arguments. Hydra emits response.function_call_arguments.delta as soon as the arguments start materialising — the client can begin executing tools before the arguments JSON has finished streaming. Most other models wait until arguments are fully formed before emitting them.
  2. Server-side VAD tuned for voice agents. Hydra detects end-of-turn without waiting for a fixed silence window; it adapts to the speaker’s cadence. Models with conservative VAD wait longer on short utterances.
  3. No intermediate text serialisation. Audio in → model → audio out, all on a single socket. No STT → text → LLM → text → TTS hop.

Next