For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Request
  • SSE format
  • Final usage chunk (with include_usage: true)
  • Consuming the stream
  • Client disconnect behavior
  • Latency profile
LLM (Electron)

Streaming

||View as Markdown|
Was this page helpful?
Previous

Chat Completions

Next

Tool / Function Calling

Built with

Set "stream": true and the response becomes Server-Sent Events (SSE). Each token (or small group of tokens) arrives as a data: {...} line. The stream ends with a data: [DONE] marker.

This is the same wire format OpenAI uses, so OpenAI client SDKs work without changes.

Request

1{
2 "model": "electron",
3 "messages": [{"role": "user", "content": "Tell me a one-sentence fun fact."}],
4 "stream": true,
5 "stream_options": { "include_usage": true }
6}

Always send stream_options.include_usage: true if you bill or log per-token. With it, the server emits a final usage chunk so you get exact token counts even when the client disconnects mid-stream.

SSE format

Each delta chunk:

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1740000000,"model":"electron","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

Field-by-field:

FieldMeaning
choices[0].delta.roleSent once, on the first chunk ("assistant").
choices[0].delta.contentPartial content. Concatenate across chunks to reconstruct the full message.
choices[0].delta.tool_callsPartial function-call payload — see Tool Calling: Streaming.
choices[0].finish_reasonnull until the final content delta, then "stop" / "length" / "tool_calls".

Final usage chunk (with include_usage: true)

After the final content delta and before [DONE], the server emits one extra chunk with empty choices and a usage object:

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1740000000,"model":"electron","choices":[],"usage":{"prompt_tokens":20,"completion_tokens":8,"total_tokens":28,"prompt_tokens_details":{"cached_tokens":0}}}
data: [DONE]

Consuming the stream

1import os
2from openai import OpenAI
3
4client = OpenAI(
5 base_url="https://api.smallest.ai/waves/v1",
6 api_key=os.environ["SMALLEST_API_KEY"],
7)
8
9stream = client.chat.completions.create(
10 model="electron",
11 messages=[{"role": "user", "content": "Tell me a fun fact."}],
12 stream=True,
13 stream_options={"include_usage": True},
14)
15
16text = ""
17usage = None
18for chunk in stream:
19 if chunk.choices and chunk.choices[0].delta.content:
20 delta = chunk.choices[0].delta.content
21 print(delta, end="", flush=True)
22 text += delta
23 if chunk.usage is not None:
24 usage = chunk.usage
25
26print(f"\n\ntokens: {usage}")

Client disconnect behavior

If the client closes the connection mid-stream, the server still finalizes usage and records the tokens that were actually generated. You’re billed only for what the model produced before the disconnect — no double-charge on retry.

Latency profile

  • Time to first token (TTFT): typically under 300 ms warm.
  • Per-token interval after the first token: a few milliseconds; full sentences arrive in tens of milliseconds.

For voice-agent pipelines, start your TTS engine on the first delta.content chunk to mask end-to-end latency. See Best Practices and the Voice Agent cookbook.