> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Voice Agent (Electron + Pulse + Lightning)

> Build an end-to-end voice agent on Smallest AI: Pulse for transcription, Electron for the LLM brain (with tool calling), Lightning for speech output.

This cookbook wires together all three Smallest AI products to build a working voice agent:

```
Mic audio  →  Pulse (STT)  →  Electron (LLM + tools)  →  Lightning (TTS)  →  Speaker
```

The same pattern underlies most production voice agents — customer support, sales calls, voice-driven UIs. Each piece is independently optimizable; this guide shows the minimum viable wiring.

If you want a full voice-agent platform with built-in telephony, campaigns, knowledge base, and call analytics, see [**Atoms**](/atoms/atoms-platform/get-started/quick-start) — it's built on top of this exact stack. Use this cookbook when you want to build the pipeline yourself.

## Architecture

Streaming STT. Audio chunks in → partial + final transcripts out. Supports 38 languages with auto-detection.

Chat completions + tool calling. Generates a filler phrase before tool calls so the user hears natural speech while tools run.

Streaming TTS. 44.1 kHz audio, \~200 ms TTFB, 12 TTS languages including Indic. See the model card for the full latency profile.

## End-to-end flow

1. **Capture audio** from the user (mic, telephony, WebRTC, etc.).
2. **Send to Pulse WebSocket** in real-time. Receive partial transcripts as the user speaks, and a final transcript when they pause (end-of-utterance).
3. **Send the final transcript to Electron** as a `user` message in your ongoing conversation. Stream the response.
4. **As Electron streams content**, feed it to Lightning for TTS immediately — don't wait for the full response.
5. **If Electron returns `tool_calls`**: the filler in `content` is spoken via Lightning **while you run the tool in parallel**. When the tool returns, append the tool result and continue the conversation.
6. **Play Lightning's audio** back to the user (speaker, telephony, WebRTC).

## Minimal implementation (Python)

This is a sketch — production code needs proper async coordination, jitter handling, and error recovery, but the wiring shape is real.

```python
import asyncio
import json
import os
import websockets
from openai import OpenAI

SMALLEST = os.environ["SMALLEST_API_KEY"]

client = OpenAI(
    base_url="https://api.smallest.ai/waves/v1",
    api_key=SMALLEST,
)

SYSTEM_PROMPT = """You are a friendly customer support agent for Acme Corp.
Keep responses concise — under three sentences.
Use the get_account_info tool when the user asks about their account.
"""

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_account_info",
            "description": "Look up account information for the current caller.",
            "parameters": {
                "type": "object",
                "properties": {
                    "account_id": {"type": "string"},
                },
                "required": ["account_id"],
            },
        },
    }
]

conversation = [{"role": "system", "content": SYSTEM_PROMPT}]


async def transcribe_loop(audio_in, transcripts_out):
    """Stream audio to Pulse, push final transcripts onto the queue."""
    url = "wss://api.smallest.ai/waves/v1/pulse/get_text?language=en"
    async with websockets.connect(
        url, additional_headers={"Authorization": f"Bearer {SMALLEST}"}
    ) as ws:
        async def send():
            async for chunk in audio_in:
                await ws.send(chunk)

        async def recv():
            async for msg in ws:
                event = json.loads(msg)
                # Pulse WS frames: {"type": "transcription", "is_final": true|false, "transcript": "..."}
                # See /waves/documentation/speech-to-text-pulse/realtime-web-socket/response-format
                if event.get("type") == "transcription" and event.get("is_final"):
                    await transcripts_out.put(event["transcript"])

        await asyncio.gather(send(), recv())


async def respond(user_text):
    """Send to Electron, stream tokens. Returns the assistant message + any tool calls."""
    conversation.append({"role": "user", "content": user_text})

    stream = client.chat.completions.create(
        model="electron",
        messages=conversation,
        tools=TOOLS,
        stream=True,
        stream_options={"include_usage": True},
    )

    content = ""
    tool_calls = []   # accumulated
    for chunk in stream:
        if not chunk.choices:
            continue
        delta = chunk.choices[0].delta
        if delta.content:
            content += delta.content
            # 🎙️ feed delta.content to Lightning for TTS immediately
            await tts_speak_delta(delta.content)
        if delta.tool_calls:
            # accumulate tool_calls (function name + arguments stream as deltas)
            accumulate_tool_calls(tool_calls, delta.tool_calls)

    # Whatever Electron emitted gets recorded into history.
    assistant_msg = {"role": "assistant", "content": content or None}
    if tool_calls:
        assistant_msg["tool_calls"] = tool_calls
    conversation.append(assistant_msg)
    return assistant_msg


async def handle_tools(assistant_msg):
    """Run any tool calls, append results, re-prompt Electron."""
    if not assistant_msg.get("tool_calls"):
        return

    for call in assistant_msg["tool_calls"]:
        args = json.loads(call["function"]["arguments"])
        result = run_tool(call["function"]["name"], args)   # your implementation
        conversation.append({
            "role": "tool",
            "tool_call_id": call["id"],
            "content": json.dumps(result),
        })

    # Re-prompt with tool results — this gives the final spoken response.
    await respond_continuation()


async def respond_continuation():
    stream = client.chat.completions.create(
        model="electron",
        messages=conversation,
        tools=TOOLS,
        stream=True,
        stream_options={"include_usage": True},
    )
    content = ""
    for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            delta = chunk.choices[0].delta.content
            content += delta
            await tts_speak_delta(delta)
    conversation.append({"role": "assistant", "content": content})


async def turn_loop(audio_in):
    """One iteration per user utterance."""
    transcripts = asyncio.Queue()
    asyncio.create_task(transcribe_loop(audio_in, transcripts))

    while True:
        user_text = await transcripts.get()
        assistant_msg = await respond(user_text)
        await handle_tools(assistant_msg)
```

## The voice-agent latency win

This is the part to internalize: **Electron's filler phrase + parallel tool execution.**

When the user asks *"What's my account balance?"*, this is what happens in milliseconds:

| Time      | Event                                                                 |
| --------- | --------------------------------------------------------------------- |
| 0 ms      | User finishes speaking; Pulse emits final transcript                  |
| \~5 ms    | Transcript sent to Electron                                           |
| \~250 ms  | Electron emits first `delta.content` — *"Let me check your account…"* |
| \~250 ms  | Lightning starts TTSing the filler **immediately**                    |
| \~280 ms  | Electron emits `delta.tool_calls` — `get_account_info(...)`           |
| \~280 ms  | You start the tool call **in parallel** with TTS                      |
| \~600 ms  | User hears *"Let me check your account…"* through their speaker       |
| \~800 ms  | Tool returns the balance                                              |
| \~850 ms  | Electron emits final response — *"Your balance is ₹12,450."*          |
| \~900 ms  | Lightning TTSes the final response                                    |
| \~1200 ms | User hears the answer                                                 |

Without the filler-phrase pattern, the user would hear silence from 0 ms to \~1100 ms. With it, they hear natural speech start at \~600 ms — feels conversational instead of robotic.

## Production checklist

* [ ] **Stream everything.** Pulse WebSocket for STT, Electron `stream: true` for LLM, Lightning streaming for TTS. Any non-streaming hop adds hundreds of milliseconds.
* [ ] **`stream_options.include_usage: true`** on Electron so you bill accurately on disconnects.
* [ ] **Run tool calls in parallel with TTS** — never serialize the filler-then-tool path.
* [ ] **Capture `X-Request-Id`** from every Electron response for support traceability.
* [ ] **Match Pulse `language`, Electron prompts, and Lightning voice language.** A Hindi caller should be transcribed in Hindi, prompted in Hindi, and synthesized with a Hindi voice. Don't translate in the middle of the pipeline — Electron and Lightning both handle Indic natively.
* [ ] **Cap conversation history** so prompts don't grow unbounded. Truncate older turns once you exceed your token budget; prefix caching keeps recent turns cheap.
* [ ] **Set a per-utterance timeout** (\~10 s on the full STT→LLM→TTS round). Voice users won't wait longer; better to fall back to *"Sorry, can you repeat?"* than to hang silently.

## Related

Streaming STT setup.

Streaming TTS setup.

Filler-phrase pattern in depth.

Skip the wiring and use the platform.