For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • Speech to Speech (Hydra)
    • Overview
    • Quickstart
    • WebSocket connection
    • Managing sessions
    • Audio I/O
    • Turn detection & barge-in
    • Tool calling
    • Prompting voice agents
    • Errors & reconnection
      • Performance
      • Metrics Overview
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • AIEWF S2S — 10 runs × 30 turns, aiwf_medium_context
  • Where Hydra leads
  • How to read this
  • Why Hydra wins tool V2V
  • Next
Speech to Speech (Hydra)Benchmarks

Performance

||View as Markdown|
Was this page helpful?
Previous

Errors & reconnection

Next

Metrics Overview

Built with

Hydra is benchmarked against eight other production-grade voice / realtime models on the AIEWF S2S benchmark. For each metric, the comparison set is identical across runs and Hydra is the only model under test from Smallest AI.

AIEWF S2S — 10 runs × 30 turns, aiwf_medium_context

The same fixed prompt + transcript distribution is replayed against every model, ten times each. Latency numbers are computed from transcript.jsonl (n = 224 non-tool turns, n = 64 tool turns). Pass rate is the fraction of turns that completed the expected interaction.

ModelPass rateNon-tool V2V medianNon-tool V2V maxTool V2V mean
ultravox-v0.797.7 %864 ms1888 ms2406 ms
gpt-realtime-2 (low)96.0 %1728 ms4032 ms2005 ms
Hydra95.9 %864 ms1984 ms1624 ms
grok-voice-think-fast-1.095.3 %2336 ms4800 ms2753 ms
gpt-realtime-1.593.3 %1152 ms2304 ms2251 ms
gemini-3.1-flash-live-preview91.7 %1632 ms5664 ms3172 ms
gpt-realtime86.7 %1536 ms4672 ms2199 ms
gemini-live86.0 %2624 ms30000 ms4082 ms
nova-2-sonic—1280 ms3232 ms1689 ms

Where Hydra leads

Tool V2V mean

1624 ms — fastest of 9. Beats nova-2-sonic by 65 ms, gpt-realtime-2 low by 381 ms, ultravox by 782 ms.

Non-tool V2V median

864 ms — tied-fastest with ultravox. Beats gpt-realtime-2 low by 864 ms and gemini-live by 1760 ms.

Pass rate

95.9 % — #3 of 8 (nova-2-sonic did not report pass rate). Within ~2 pp of the leader.

How to read this

  • Pass rate is the fraction of turns where the model produced a usable, on-topic response to the test prompt. It’s the integrative quality metric.
  • Non-tool V2V latency is end-of-user-speech to first audio chunk on turns where the model replies directly (no tool call). This is the lower-bound conversational latency you’d see on a phone call where the user just asks something.
  • Tool V2V latency is the same end-to-end timing but on turns that route through a tool — measured from end-of-user-speech to the start of the model narrating the tool result. This is the metric that matters for voice agents that do anything real: phone bookings, account lookups, order placement.

See Metrics Overview for the exact definitions.

Why Hydra wins tool V2V

The model is purpose-built for realtime voice agents that call tools. Three architectural choices show up in the numbers:

  1. Streaming tool-call arguments. Hydra emits response.function_call_arguments.delta as soon as the arguments start materialising — the client can begin executing tools before the arguments JSON has finished streaming. Most other models wait until arguments are fully formed before emitting them.
  2. Server-side VAD tuned for voice agents. Hydra detects end-of-turn without waiting for a fixed silence window; it adapts to the speaker’s cadence. Models with conservative VAD wait longer on short utterances.
  3. No intermediate text serialisation. Audio in → model → audio out, all on a single socket. No STT → text → LLM → text → TTS hop.

Next

Metrics Overview

Exact definitions of pass rate, non-tool V2V, and tool V2V.

Model card

Capabilities, voices, audio formats, pricing.