Hydra

View as Markdown
Latest Release

Hydra is Smallest AI’s speech-to-speech model. A single WebSocket carries microphone audio from your client to the model and streams synthesised response audio back. There is no STT → LLM → TTS pipeline in the middle, and no transcript on the wire.

Audio in, audio out

One model. One socket. No glue code.

Full-duplex

Barge-in handled by the model. In-flight responses cancel automatically.

Tool calling

Standard JSON-schema function calls, executed on your side.

Bot speaks first

generate_initial_response: true for greetings.

Model Overview

Developed bySmallest AI
Model typeFull-duplex speech-to-speech
API surfaceWebSocket (wss://api.smallest.ai/waves/v1/s2s)
Model ID (query param)model=hydra
Wire versionv1
LicenseProprietary, hosted API

Key capabilities

6 voices

wren, sloane, marlowe, reed, knox, tate.

Tool calling

JSON-schema tools, streamed args, client-side execution.

Mid-session updates

Live-patch tools without reconnecting via session.update.

Audio formats

DirectionFormatSample rateChannelsEncoding
Input (client → server)PCM16, signed little-endian16 kHzmonobase64 inside input_audio_buffer.append
Output (server → client)PCM16, signed little-endian48 kHzmonobase64 inside response.output_audio.delta

Languages

Hydra currently supports English only. Additional languages are on the roadmap.

LanguageISO codeStatus
Englishen✅ Production

Voices

Six voice IDs are accepted on session.configure.session.voice and frozen at handshake.

Voice IDNotes
wrenDefault (server-side default if voice is omitted).
sloane
marlowe
reed
knox
tate

Performance & benchmarks

Hydra is benchmarked against eight other production-grade voice / realtime models. Full methodology and metric definitions live on the dedicated Performance and Metrics Overview pages.

AIEWF S2S — 10 runs × 30 turns, aiwf_medium_context

ModelPass rateNon-tool V2V medianNon-tool V2V maxTool V2V mean
ultravox-v0.797.7 %864 ms1888 ms2406 ms
gpt-realtime-2 (low)96.0 %1728 ms4032 ms2005 ms
Hydra95.9 %864 ms1984 ms1624 ms
grok-voice-think-fast-1.095.3 %2336 ms4800 ms2753 ms
gpt-realtime-1.593.3 %1152 ms2304 ms2251 ms
gemini-3.1-flash-live-preview91.7 %1632 ms5664 ms3172 ms
gpt-realtime86.7 %1536 ms4672 ms2199 ms
gemini-live86.0 %2624 ms30000 ms4082 ms
nova-2-sonic1280 ms3232 ms1689 ms

Reading the table:

  • Tool V2V mean latency: Hydra is the fastest of 9 (1624 ms — beats nova-2-sonic at 1689 ms, gpt-realtime-2 low at 2005 ms, ultravox at 2406 ms).
  • Non-tool V2V median latency: tied-fastest (864 ms with ultravox).
  • Pass rate: #3 of 8 (within ~2 pp of the leader; nova-2-sonic did not report pass rate).

Latency numbers are computed from transcript.jsonl across all 10 runs (n = 224 non-tool turns, n = 64 tool turns). Pass rate is the fraction of turns that completed the expected interaction.

Hydra is evaluated on voice-agent axes — voice-to-voice latency, turn-taking accuracy, barge-in handling, and tool-call reliability under realistic conditions. Generic LLM benchmarks (MMLU, IFEval) target a different objective and aren’t the right yardstick for a realtime voice model.

Operational metrics

MetricValue
Idle timeout~30 s with no traffic from either side. Keep streaming audio (silence frames are fine) to hold the connection.

Capacity & rate limits

  • One voice session per WebSocket connection.
  • Concurrency follows your plan’s WebSocket pool. Excess connections receive error with code: "server_full" followed by close code 1013 — back off with jitter and retry.

For higher limits, contact your account manager.

Pricing

Contact your Smallest AI account manager for current pricing. Hydra is billed by session minute; usage per turn is reported on response.done when available.

Use cases

Direct use

  • Realtime voice assistants — companion apps, concierge bots, in-app tutors.
  • Phone agents — restaurant reservations, banking concierges, customer support.
  • Voice copilots embedded in web and mobile apps.
  • Accessibility — voice-first interfaces for visually-impaired users.
  • Voice-controlled IoT and games — kiosks, in-car assistants, gaming companions.

Downstream use

  • Conversational analytics over recorded phone-call audio (transcribe the captured audio with Pulse STT afterwards).
  • Multi-agent voice systems where Hydra is one specialised speaker.
  • Hybrid voice + text agents where a text fallback is needed for compliance.

Specs & API surface

Endpointwss://api.smallest.ai/waves/v1/s2s?model=hydra&api_key=<KEY>
Frame formatJSON (UTF-8 text frames). No binary frames.
Authenticationapi_key query parameter (browser clients should mint a short-lived token server-side).
Close codes1000 normal, 1013 server full — auth failures are HTTP 401 during the WS handshake (no close code)

See the documentation hub for the full event catalog, session config, tool calling, interruption handling, and errors.

Safety & responsible use

Hydra is intended for voice-agent and conversational workloads. Customers building user-facing applications should layer their own content moderation, prompt-injection defenses, and PII handling appropriate to their domain. Hydra does not currently apply content moderation server-side — outputs reflect the model’s training and the prompts you provide.

For voice-agent applications handling regulated content (financial, healthcare), the standard pattern applies: keep PII out of prompts where practical, apply post-processing redaction on outputs, and — if you need a transcript for compliance — transcribe the PCM you sent/received via the Pulse STT API and store that transcript with your moderation log.