The Python “Try it” snippets attached to the Lightning v3.1 TTS and Pulse STT API reference operations used the pre-4.4.5 SDK constructor signature:

1 client = SmallestAI(token="YOUR_API_KEY")

The SDK kwarg was renamed in 4.4.5. Customers copy-pasting the old snippet hit:

TypeError: SmallestAI.__init__() got an unexpected keyword argument 'token'

All affected Python snippets in lightning-v3.1-openapi.yaml, pulse-stt-openapi.yaml, and their v4 overrides now use SmallestAI(api_key="YOUR_API_KEY"). No wire-protocol change. Existing customer code on smallestai >= 4.4.5 is unaffected.

June 3, 2026

WebSocket auth and default URL fixes

Fixed authentication configuration and default server URLs across the WebSocket specs for Lightning TTS, Pulse STT, and related endpoints.

Additional endpoints for v2.2.0 and v3.0.1 are now marked deprecated in the API reference.

May 23, 2026

Hydra — full-duplex speech-to-speech model

Hydra, Smallest AI’s in-house speech-to-speech model, is now live. A single WebSocket carries microphone audio from your client to the model and streams synthesised response audio back — no STT → LLM → TTS pipeline in the middle.

1 import asyncio, base64, json, os, wave
2 import websockets
3 
4 API_KEY = os.environ["SMALLEST_API_KEY"]
5 URL = f"wss://api.smallest.ai/waves/v1/s2s?model=hydra&api_key={API_KEY}"
6 
7 async def main():
8     async with websockets.connect(URL, max_size=None) as ws:
9         async for raw in ws:
10             evt = json.loads(raw)
11             if evt["type"] == "session.created":
12                 await ws.send(json.dumps({
13                     "type": "session.configure",
14                     "session": {"instructions": "Be brief.", "voice": "wren"},
15                 }))
16             elif evt["type"] == "response.output_audio.delta":
17                 ...  # decode base64 PCM16 and play
18 
19 asyncio.run(main())

What’s in this launch:

Single WebSocket endpoint at wss://api.smallest.ai/waves/v1/s2s?model=hydra&api_key=.... JSON text frames; no binary frames.
Full-duplex with server-side VAD — stream input_audio_buffer.append continuously while the mic is open. Hydra detects turn boundaries on its own. If the user speaks while the model is responding, the in-flight response is cancelled (response.done with reason: "interrupted") and a fresh turn begins.
Six voices: wren, sloane, marlowe, reed, knox, tate.
Tool calling — declare JSON-schema tools in session.configure, Hydra streams arguments via response.function_call_arguments.delta, your client executes the tool and posts the result back via conversation.item.create + response.create.
Bot speaks first — set generate_initial_response: true on session.configure for greetings and concierge openers.
Mid-session updates — live-patch tools via session.update without reconnecting.
Audio formats: input PCM16 mono 16 kHz; output PCM16 mono 48 kHz.

When to use Hydra vs the three-model stack:

Use Hydra when latency-to-voice matters above all else — phone agents, kiosks, in-car assistants.
Use Pulse → Electron → Lightning v3.1 when you need explicit text in the middle: analytics, custom retrieval, regulated content moderation, BYOM.

Docs:

Quickstart — clone the reference client and talk to Hydra in your browser.
Overview — full event reference, session config, tool calling, interruption, errors.
Model Card — voices, performance, pricing.
Reference client (Next.js) — production-grade browser client with live wire-log, multi-agent presets, tool execution.

May 22, 2026

Electron LLM — chat completions on the Waves API

Electron, Smallest AI’s in-house language model, is now generally available on the Waves API. Use it as a drop-in replacement for OpenAI’s chat completions — point the OpenAI SDK at https://api.smallest.ai/waves/v1 and pass "model": "electron".

1 import os
2 from openai import OpenAI
3 
4 client = OpenAI(
5     base_url="https://api.smallest.ai/waves/v1",
6     api_key=os.environ["SMALLEST_API_KEY"],
7 )
8 
9 response = client.chat.completions.create(
10     model="electron",
11     messages=[{"role": "user", "content": "Say hello in one short sentence."}],
12 )
13 print(response.choices[0].message.content)

What’s in this launch:

OpenAI-compatible endpoint — POST /waves/v1/chat/completions. Same wire format as api.openai.com/v1/chat/completions. Streaming (SSE with optional final usage chunk), tool/function calling, JSON mode, multi-turn — all work via standard OpenAI request bodies. The official OpenAI SDKs (Python / JavaScript / Go / Java / Ruby) work with no code changes beyond the base URL and API key.
Sub-300 ms time-to-first-token on warm connections.
32,768-token context (combined input + output).
70 languages with first-class Indic support — Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Urdu — plus broad coverage across Western/Eastern Europe, Middle East, East/Southeast/South/Central Asia, and Africa. See the Electron model card for the full list.
Voice-agent-optimized tool calling — with a voice-agent-style system prompt, Electron emits a short filler phrase in content alongside tool_calls (e.g. “Let me check that for you…”) so a downstream TTS layer can mask tool-call latency. See Tool Calling for the voice-agent pattern.
Automatic prefix caching — cached input tokens billed at a discounted rate vs normal input. Reported on every response as usage.prompt_tokens_details.cached_tokens so you can audit cache hits. See Prefix Caching.
Cookbook: Voice Agent (Electron + Pulse + Lightning) wires Pulse (STT) + Electron (LLM + tools) + Lightning (TTS) into an end-to-end voice pipeline.

Pricing: Contact your Smallest AI account manager for the current rate card.

Plan limits: Standard 10 RPM / 3 concurrent; Enterprise 200 RPM / 20 concurrent.

Rejected parameters (vs OpenAI): n > 1 and prompt_logprobs — both return HTTP 400 with invalid_request_error.

No vision, no audio in/out on the public API — Electron is text-only.

→ Quickstart · Overview · Chat Completions API · Migrate from OpenAI · Model card

May 22, 2026

Fern CLI 5.28.2 + Python SDK generator 5.12.12

The Fern CLI has been updated to 5.28.2 and the Python SDK generator (fernapi/fern-python-sdk) bumped from 4.61.3 to 5.12.12.

Why: CLI 5.28.2 ships an updated @fern-api/replay that fixes a regression where customer commits made directly on a fern-bot regeneration PR branch could be silently dropped on the next regen if the PR was merged via the GitHub merge-commit button. Pinning the generator to the latest stable (5.12.12) aligns automated regenerations going forward.

Impact for SDK users: Next regeneration will use the v5 generator line, which restructured a few API surfaces compared to v4 (e.g., client.waves.transcribe_pulse → client.waves.speech_to_text.pulse; the format, punctuate, and capitalize query parameters were dropped from the v5 spec).

May 12, 2026

API reference rewrite — Lightning v3.1 + Pulse

The Lightning v3.1 (sync + SSE + WebSocket) and Pulse (REST + WebSocket) endpoints in the API reference have been rewritten end-to-end. Each page now opens with a one-paragraph what, a when to use this comparison against the alternatives, a how-it-works walkthrough, copy-paste examples for cURL / Python / JavaScript, and a common gotchas section.

A note for JavaScript users: the official smallestai npm package (v1.0.1) predates Lightning v3.1 and Pulse, so the docs show fetch (REST/SSE) and ws (WebSocket) examples for those endpoints — they work in Node and the browser with no SDK install.

May 7, 2026

Waves API spec — corrected `output_format` enum and resynced base ↔ v4 overrides

The output_format enum on Lightning v3.1 (POST /waves/v1/lightning-v3.1/get_speech and /stream) is now correctly documented as ["pcm", "mp3", "wav", "ulaw", "alaw"]. The previously listed mulaw is rejected by the platform with an invalid_enum_value error and has been removed. Verified live against api.smallest.ai.

The sample_rate enum on the same endpoints now correctly includes 44100 (the model’s native rate) on both the SDK schema and the rendered API reference.

Internally, the base API specs at fern/apis/waves/{openapi,asyncapi}/* and the v4 docs override at fern/apis/waves-v4/overrides/* have been resynced — descriptions, error response shapes, and channel summaries that had drifted between the two layers are now identical. A new CI workflow (spec-drift-check) blocks any future PR that edits one layer without the other.

No customer-facing request shape, response field, or default value changes beyond the output_format/sample_rate corrections.

April 22, 2026

OpenWhispr integration guide

Added a documentation page for using Smallest AI’s Pulse model as the speech-to-text engine inside OpenWhispr, the open-source desktop dictation app for macOS, Windows, and Linux.

→ OpenWhispr integration

April 20, 2026

Unified voice cloning API

Voice cloning now has a single cross-model endpoint: POST /waves/v1/voice-cloning. Upload a short audio sample, get back a voice_... ID that works across supported Lightning TTS models.

The legacy Lightning-large clone endpoint is deprecated and will be removed in a future release.

→ Voice Cloning guide

Hydra — full-duplex speech-to-speech model

1 import asyncio, base64, json, os, wave
2 import websockets
3 
4 API_KEY = os.environ["SMALLEST_API_KEY"]
5 URL = f"wss://api.smallest.ai/waves/v1/s2s?model=hydra&api_key={API_KEY}"
6 
7 async def main():
8     async with websockets.connect(URL, max_size=None) as ws:
9         async for raw in ws:
10             evt = json.loads(raw)
11             if evt["type"] == "session.created":
12                 await ws.send(json.dumps({
13                     "type": "session.configure",
14                     "session": {"instructions": "Be brief.", "voice": "wren"},
15                 }))
16             elif evt["type"] == "response.output_audio.delta":
17                 ...  # decode base64 PCM16 and play
18 
19 asyncio.run(main())

What’s in this launch:

Single WebSocket endpoint at wss://api.smallest.ai/waves/v1/s2s?model=hydra&api_key=.... JSON text frames; no binary frames.
Full-duplex with server-side VAD — stream input_audio_buffer.append continuously while the mic is open. Hydra detects turn boundaries on its own. If the user speaks while the model is responding, the in-flight response is cancelled (response.done with reason: "interrupted") and a fresh turn begins.
Six voices: wren, sloane, marlowe, reed, knox, tate.
Tool calling — declare JSON-schema tools in session.configure, Hydra streams arguments via response.function_call_arguments.delta, your client executes the tool and posts the result back via conversation.item.create + response.create.
Bot speaks first — set generate_initial_response: true on session.configure for greetings and concierge openers.
Mid-session updates — live-patch tools via session.update without reconnecting.
Audio formats: input PCM16 mono 16 kHz; output PCM16 mono 48 kHz.

When to use Hydra vs the three-model stack:

Use Hydra when latency-to-voice matters above all else — phone agents, kiosks, in-car assistants.
Use Pulse → Electron → Lightning v3.1 when you need explicit text in the middle: analytics, custom retrieval, regulated content moderation, BYOM.

Docs:

Quickstart — clone the reference client and talk to Hydra in your browser.
Overview — full event reference, session config, tool calling, interruption, errors.
Model Card — voices, performance, pricing.
Reference client (Next.js) — production-grade browser client with live wire-log, multi-agent presets, tool execution.

General

Lightning v3.1 + Pulse STT API-ref Python snippets — `SmallestAI(api_key=...)` (was `token=...`)

WebSocket auth and default URL fixes

Hydra — full-duplex speech-to-speech model

Electron LLM — chat completions on the Waves API

Fern CLI 5.28.2 + Python SDK generator 5.12.12

API reference rewrite — Lightning v3.1 + Pulse

Waves API spec — corrected `output_format` enum and resynced base ↔ v4 overrides

OpenWhispr integration guide

Unified voice cloning API

Lightning v3.1 + Pulse STT API-ref Python snippets — `SmallestAI(api_key=...)` (was `token=...`)

Hydra — full-duplex speech-to-speech model

Electron LLM — chat completions on the Waves API

Fern CLI 5.28.2 + Python SDK generator 5.12.12

API reference rewrite — Lightning v3.1 + Pulse

Waves API spec — corrected `output_format` enum and resynced base ↔ v4 overrides

OpenWhispr integration guide

Unified voice cloning API

1	import asyncio, base64, json, os, wave
2	import websockets
3
4	API_KEY = os.environ["SMALLEST_API_KEY"]
5	URL = f"wss://api.smallest.ai/waves/v1/s2s?model=hydra&api_key={API_KEY}"
6
7	async def main():
8	async with websockets.connect(URL, max_size=None) as ws:
9	async for raw in ws:
10	evt = json.loads(raw)
11	if evt["type"] == "session.created":
12	await ws.send(json.dumps({
13	"type": "session.configure",
14	"session": {"instructions": "Be brief.", "voice": "wren"},
15	}))
16	elif evt["type"] == "response.output_audio.delta":
17	... # decode base64 PCM16 and play
18
19	asyncio.run(main())

1	import os
2	from openai import OpenAI
3
4	client = OpenAI(
5	base_url="https://api.smallest.ai/waves/v1",
6	api_key=os.environ["SMALLEST_API_KEY"],
7	)
8
9	response = client.chat.completions.create(
10	model="electron",
11	messages=[{"role": "user", "content": "Say hello in one short sentence."}],
12	)
13	print(response.choices[0].message.content)