> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Metrics Overview

> Definitions of the metrics Hydra is benchmarked on — pass rate, voice-to-voice latency (tool and non-tool), and how each is computed.

This page defines what each metric on the [Performance](/waves/documentation/speech-to-speech-hydra/benchmarks/performance) page actually measures.

## Pass rate

|                                  |                                                                                           |
| -------------------------------- | ----------------------------------------------------------------------------------------- |
| What it measures                 | Fraction of turns where the model produced a usable, on-topic response to the test prompt |
| Range                            | 0 % – 100 % (higher is better)                                                            |
| Computed from                    | The full transcript of all turns across all runs                                          |
| Sample size in current benchmark | 30 turns × 10 runs = 300 turns per model                                                  |

A turn is counted as "passing" if the model's reply is **(a)** semantically aligned with the user's prompt, **(b)** factually correct given the available context, and **(c)** delivered as audio without the response being cancelled, dropped, or replaced with an error. Hallucinated content, off-topic replies, silent failures, and turns that error out all count as misses.

Pass rate is the **integrative quality metric** — it rolls turn-taking accuracy, instruction following, tool-call correctness, and audio synthesis reliability into a single number.

## Non-tool voice-to-voice (V2V) latency

|                                  |                                                                                                                |
| -------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| What it measures                 | Time from end-of-user-speech to first audio chunk of the model's reply, **on turns that do not invoke a tool** |
| Lower is better                  | Yes                                                                                                            |
| Computed from                    | `transcript.jsonl` timestamps                                                                                  |
| Sample size in current benchmark | n = 224 non-tool turns (across all 10 runs)                                                                    |

The clock starts when Hydra emits `input_audio_buffer.speech_stopped` and ends when the first `response.output_audio.delta` arrives on the client. This is the lower-bound conversational latency: how fast the model starts talking when no external system is in the loop.

Reported as **median** (typical experience) and **max** (worst-case 1-in-N).

## Tool voice-to-voice (V2V) latency

|                                  |                                                                                                               |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------- |
| What it measures                 | Time from end-of-user-speech to start of the model narrating the tool result, **on turns that invoke a tool** |
| Lower is better                  | Yes                                                                                                           |
| Computed from                    | `transcript.jsonl` timestamps                                                                                 |
| Sample size in current benchmark | n = 64 tool turns (across all 10 runs)                                                                        |

This is the headline metric for voice agents that *do* something — book a table, look up an order, query an account. The clock starts at `speech_stopped` and ends at the first `response.output_audio.delta` of the narration response (the second `response.created` in the turn, after the tool call). Reported as **mean** because the distribution is skewed by tool execution time itself.

Internally, this metric decomposes into:

| Sub-step                                                    | What's happening                                          |
| ----------------------------------------------------------- | --------------------------------------------------------- |
| `speech_stopped` → `response.function_call_arguments.delta` | Hydra deciding to call a tool and starting to stream args |
| Streaming delta → `response.function_call_arguments.done`   | Arguments JSON fully streamed                             |
| `done` → `conversation.item.create`                         | Client executes the tool                                  |
| `conversation.item.create` → `response.create`              | Client requests narration                                 |
| `response.create` → first `response.output_audio.delta`     | Model produces narration audio                            |

The first two sub-steps and the last one are *server-controlled*; the middle two are *client-controlled*. Hydra's lead on this metric is driven by the server-controlled portions — particularly the streaming-args behaviour that lets the client begin tool execution before arguments finish streaming.

## What's NOT measured

* **Glass-to-glass latency.** Doesn't include mic capture, network egress, or speaker playback. Add 100–300 ms on a real device.
* **Audio quality.** The benchmark scores semantic correctness, not naturalness, prosody, or speaker similarity. Listener evaluations are a separate effort.
* **Cost per call.** Not part of this comparison.
* **Concurrency under load.** Each run is single-session; concurrent-session behaviour is measured separately.

## Why this benchmark

AIEWF S2S is run head-to-head, single-session, fixed prompts. Every model gets the same inputs under the same conditions; differences in the numbers are differences in the models, not differences in the test setup.

The methodology was designed to surface differences in tool-calling latency (which is where realtime voice agents tend to fall apart) without conflating with prompt-engineering or harness differences.

## Next

The head-to-head comparison table and analysis.

How Hydra's streaming function-call arguments work — the mechanism behind the V2V tool latency win.