Hydra — full-duplex speech-to-speech model

Hydra, Smallest AI’s in-house speech-to-speech model, is now live. A single WebSocket carries microphone audio from your client to the model and streams synthesised response audio back — no STT → LLM → TTS pipeline in the middle.

1 import asyncio, base64, json, os, wave
2 import websockets
3 
4 API_KEY = os.environ["SMALLEST_API_KEY"]
5 URL = f"wss://api.smallest.ai/waves/v1/s2s?model=hydra&api_key={API_KEY}"
6 
7 async def main():
8     async with websockets.connect(URL, max_size=None) as ws:
9         async for raw in ws:
10             evt = json.loads(raw)
11             if evt["type"] == "session.created":
12                 await ws.send(json.dumps({
13                     "type": "session.configure",
14                     "session": {"instructions": "Be brief.", "voice": "wren"},
15                 }))
16             elif evt["type"] == "response.output_audio.delta":
17                 ...  # decode base64 PCM16 and play
18 
19 asyncio.run(main())

What’s in this launch:

Single WebSocket endpoint at wss://api.smallest.ai/waves/v1/s2s?model=hydra&api_key=.... JSON text frames; no binary frames.
Full-duplex with server-side VAD — stream input_audio_buffer.append continuously while the mic is open. Hydra detects turn boundaries on its own. If the user speaks while the model is responding, the in-flight response is cancelled (response.done with reason: "interrupted") and a fresh turn begins.
Six voices: wren, sloane, marlowe, reed, knox, tate.
Tool calling — declare JSON-schema tools in session.configure, Hydra streams arguments via response.function_call_arguments.delta, your client executes the tool and posts the result back via conversation.item.create + response.create.
Bot speaks first — set generate_initial_response: true on session.configure for greetings and concierge openers.
Mid-session updates — live-patch tools via session.update without reconnecting.
Audio formats: input PCM16 mono 16 kHz; output PCM16 mono 48 kHz.

When to use Hydra vs the three-model stack:

Use Hydra when latency-to-voice matters above all else — phone agents, kiosks, in-car assistants.
Use Pulse → Electron → Lightning v3.1 when you need explicit text in the middle: analytics, custom retrieval, regulated content moderation, BYOM.

Docs:

Quickstart — clone the reference client and talk to Hydra in your browser.
Overview — full event reference, session config, tool calling, interruption, errors.
Model Card — voices, performance, pricing.
Reference client (Next.js) — production-grade browser client with live wire-log, multi-agent presets, tool execution.