Streaming | Smallest AI Docs

Set "stream": true and the response becomes Server-Sent Events (SSE). Each token (or small group of tokens) arrives as a data: {...} line. The stream ends with a data: [DONE] marker.

This is the same wire format OpenAI uses, so OpenAI client SDKs work without changes.

Request

1 {
2   "model": "electron",
3   "messages": [{"role": "user", "content": "Tell me a one-sentence fun fact."}],
4   "stream": true,
5   "stream_options": { "include_usage": true }
6 }

Always send stream_options.include_usage: true if you bill or log per-token. With it, the server emits a final usage chunk so you get exact token counts even when the client disconnects mid-stream.

SSE format

Each delta chunk:

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1740000000,"model":"electron","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

Field-by-field:

Field	Meaning
`choices[0].delta.role`	Sent once, on the first chunk (`"assistant"`).
`choices[0].delta.content`	Partial content. Concatenate across chunks to reconstruct the full message.
`choices[0].delta.tool_calls`	Partial function-call payload — see Tool Calling: Streaming.
`choices[0].finish_reason`	`null` until the final content delta, then `"stop"` / `"length"` / `"tool_calls"`.

Final usage chunk (with `include_usage: true`)

After the final content delta and before [DONE], the server emits one extra chunk with empty choices and a usage object:

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1740000000,"model":"electron","choices":[],"usage":{"prompt_tokens":20,"completion_tokens":8,"total_tokens":28,"prompt_tokens_details":{"cached_tokens":0}}}
data: [DONE]

Consuming the stream

1 import os
2 from openai import OpenAI
3 
4 client = OpenAI(
5     base_url="https://api.smallest.ai/waves/v1",
6     api_key=os.environ["SMALLEST_API_KEY"],
7 )
8 
9 stream = client.chat.completions.create(
10     model="electron",
11     messages=[{"role": "user", "content": "Tell me a fun fact."}],
12     stream=True,
13     stream_options={"include_usage": True},
14 )
15 
16 text = ""
17 usage = None
18 for chunk in stream:
19     if chunk.choices and chunk.choices[0].delta.content:
20         delta = chunk.choices[0].delta.content
21         print(delta, end="", flush=True)
22         text += delta
23     if chunk.usage is not None:
24         usage = chunk.usage
25 
26 print(f"\n\ntokens: {usage}")

Client disconnect behavior

If the client closes the connection mid-stream, the server still finalizes usage and records the tokens that were actually generated. You’re billed only for what the model produced before the disconnect — no double-charge on retry.

Latency profile

Time to first token (TTFT): typically under 300 ms warm.
Per-token interval after the first token: a few milliseconds; full sentences arrive in tens of milliseconds.

For voice-agent pipelines, start your TTS engine on the first delta.content chunk to mask end-to-end latency. See Best Practices and the Voice Agent cookbook.