Performance
Hydra is benchmarked against eight other production-grade voice / realtime models on the AIEWF S2S benchmark. For each metric, the comparison set is identical across runs and Hydra is the only model under test from Smallest AI.
AIEWF S2S — 10 runs × 30 turns, aiwf_medium_context
The same fixed prompt + transcript distribution is replayed against every model, ten times each. Latency numbers are computed from transcript.jsonl (n = 224 non-tool turns, n = 64 tool turns). Pass rate is the fraction of turns that completed the expected interaction.
Where Hydra leads
1624 ms — fastest of 9. Beats nova-2-sonic by 65 ms, gpt-realtime-2 low by 381 ms, ultravox by 782 ms.
864 ms — tied-fastest with ultravox. Beats gpt-realtime-2 low by 864 ms and gemini-live by 1760 ms.
95.9 % — #3 of 8 (nova-2-sonic did not report pass rate). Within ~2 pp of the leader.
How to read this
- Pass rate is the fraction of turns where the model produced a usable, on-topic response to the test prompt. It’s the integrative quality metric.
- Non-tool V2V latency is end-of-user-speech to first audio chunk on turns where the model replies directly (no tool call). This is the lower-bound conversational latency you’d see on a phone call where the user just asks something.
- Tool V2V latency is the same end-to-end timing but on turns that route through a tool — measured from end-of-user-speech to the start of the model narrating the tool result. This is the metric that matters for voice agents that do anything real: phone bookings, account lookups, order placement.
See Metrics Overview for the exact definitions.
Why Hydra wins tool V2V
The model is purpose-built for realtime voice agents that call tools. Three architectural choices show up in the numbers:
- Streaming tool-call arguments. Hydra emits
response.function_call_arguments.deltaas soon as the arguments start materialising — the client can begin executing tools before the arguments JSON has finished streaming. Most other models wait until arguments are fully formed before emitting them. - Server-side VAD tuned for voice agents. Hydra detects end-of-turn without waiting for a fixed silence window; it adapts to the speaker’s cadence. Models with conservative VAD wait longer on short utterances.
- No intermediate text serialisation. Audio in → model → audio out, all on a single socket. No STT → text → LLM → text → TTS hop.

