Streaming

View as Markdown

Set "stream": true and the response becomes Server-Sent Events (SSE). Each token (or small group of tokens) arrives as a data: {...} line. The stream ends with a data: [DONE] marker.

This is the same wire format OpenAI uses, so OpenAI client SDKs work without changes.

Request

1{
2 "model": "electron",
3 "messages": [{"role": "user", "content": "Tell me a one-sentence fun fact."}],
4 "stream": true,
5 "stream_options": { "include_usage": true }
6}

Always send stream_options.include_usage: true if you bill or log per-token. With it, the server emits a final usage chunk so you get exact token counts even when the client disconnects mid-stream.

SSE format

Each delta chunk:

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1740000000,"model":"electron","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

Field-by-field:

FieldMeaning
choices[0].delta.roleSent once, on the first chunk ("assistant").
choices[0].delta.contentPartial content. Concatenate across chunks to reconstruct the full message.
choices[0].delta.tool_callsPartial function-call payload — see Tool Calling: Streaming.
choices[0].finish_reasonnull until the final content delta, then "stop" / "length" / "tool_calls".

Final usage chunk (with include_usage: true)

After the final content delta and before [DONE], the server emits one extra chunk with empty choices and a usage object:

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1740000000,"model":"electron","choices":[],"usage":{"prompt_tokens":20,"completion_tokens":8,"total_tokens":28,"prompt_tokens_details":{"cached_tokens":0}}}
data: [DONE]

Consuming the stream

1import os
2from openai import OpenAI
3
4client = OpenAI(
5 base_url="https://api.smallest.ai/waves/v1",
6 api_key=os.environ["SMALLEST_API_KEY"],
7)
8
9stream = client.chat.completions.create(
10 model="electron",
11 messages=[{"role": "user", "content": "Tell me a fun fact."}],
12 stream=True,
13 stream_options={"include_usage": True},
14)
15
16text = ""
17usage = None
18for chunk in stream:
19 if chunk.choices and chunk.choices[0].delta.content:
20 delta = chunk.choices[0].delta.content
21 print(delta, end="", flush=True)
22 text += delta
23 if chunk.usage is not None:
24 usage = chunk.usage
25
26print(f"\n\ntokens: {usage}")

Client disconnect behavior

If the client closes the connection mid-stream, the server still finalizes usage and records the tokens that were actually generated. You’re billed only for what the model produced before the disconnect — no double-charge on retry.

Latency profile

  • Time to first token (TTFT): typically under 300 ms warm.
  • Per-token interval after the first token: a few milliseconds; full sentences arrive in tens of milliseconds.

For voice-agent pipelines, start your TTS engine on the first delta.content chunk to mask end-to-end latency. See Best Practices and the Voice Agent cookbook.