Streaming | Smallest AI Docs

Streaming sends response chunks to the user as they’re generated, rather than waiting for the complete response. For voice agents, this is essential — users hear the first words immediately instead of waiting in silence.

Why Streaming Matters

Approach	Time to First Audio	User Experience
Non-streaming	800–1500ms	Awkward silence, then full response
Streaming	200–400ms	Natural conversation flow

Basic Streaming

Set stream=True and yield each chunk.content for instant TTS playback.

1 async def generate_response(self):
2     response = await self.llm.chat(
3         messages=self.context.messages,
4         stream=True  # Required for streaming
5     )
6     
7     async for chunk in response:
8         if chunk.content:
9             yield chunk.content  # Sent to TTS immediately

Streaming with Tools

Collect tool calls while streaming, execute them, then stream the follow-up response.

1 async def generate_response(self):
2     response = await self.llm.chat(
3         messages=self.context.messages,
4         stream=True,
5         tools=self.tool_schemas
6     )
7     
8     tool_calls = []
9     
10     # Stream first response
11     async for chunk in response:
12         if chunk.content:
13             yield chunk.content
14         if chunk.tool_calls:
15             tool_calls.extend(chunk.tool_calls)
16     
17     # Handle tools if present
18     if tool_calls:
19         results = await self.tool_registry.execute(
20             tool_calls=tool_calls, parallel=True
21         )
22         
23         # Add tool calls and results to context
24         self.context.add_messages([
25             {
26                 "role": "assistant",
27                 "content": "",
28                 "tool_calls": [
29                     {
30                         "id": tc.id,
31                         "type": "function",
32                         "function": {"name": tc.name, "arguments": str(tc.arguments)},
33                     }
34                     for tc in tool_calls
35                 ],
36             },
37             *[
38                 {"role": "tool", "tool_call_id": tc.id, "content": str(result)}
39                 for tc, result in zip(tool_calls, results)
40             ],
41         ])
42         
43         # Stream follow-up response
44         final_response = await self.llm.chat(
45             messages=self.context.messages, stream=True
46         )
47         
48         async for chunk in final_response:
49             if chunk.content:
50                 yield chunk.content

Chunking Strategies

Word-by-Word (Default)

LLMs typically stream tokens, which map roughly to words or word fragments:

1 async for chunk in response:
2     if chunk.content:
3         yield chunk.content

Sentence Buffering

Buffer complete sentences for more natural speech boundaries:

1 async def generate_response(self):
2     buffer = ""
3     
4     async for chunk in response:
5         if chunk.content:
6             buffer += chunk.content
7             
8             # Yield on sentence boundaries
9             while any(end in buffer for end in [". ", "! ", "? "]):
10                 for end in [". ", "! ", "? "]:
11                     if end in buffer:
12                         sentence, buffer = buffer.split(end, 1)
13                         yield sentence + end.strip() + " "
14                         break
15     
16     # Yield remaining content
17     if buffer.strip():
18         yield buffer

Phrase Buffering

Buffer by phrase for smoother speech rhythm:

1 import re
2 
3 async def generate_response(self):
4     buffer = ""
5     min_phrase_length = 20  # Characters
6     
7     async for chunk in response:
8         if chunk.content:
9             buffer += chunk.content
10             
11             # Yield on commas, periods, or at minimum length
12             if len(buffer) >= min_phrase_length:
13                 if re.search(r'[,.:;!?]\s', buffer):
14                     match = re.search(r'[,.:;!?]\s', buffer)
15                     phrase = buffer[:match.end()]
16                     buffer = buffer[match.end():]
17                     yield phrase
18     
19     if buffer.strip():
20         yield buffer

Intermediate Feedback

Provide feedback while processing long operations:

1 async def generate_response(self):
2     response = await self.llm.chat(...)
3     
4     tool_calls = []
5     async for chunk in response:
6         if chunk.content:
7             yield chunk.content
8         if chunk.tool_calls:
9             tool_calls.extend(chunk.tool_calls)
10     
11     if tool_calls:
12         # Immediate feedback
13         yield "One moment while I look that up."
14         
15         # Long operation
16         results = await self.tool_registry.execute(tool_calls, parallel=True)
17         
18         # Continue with results
19         # ...

Streaming Best Practices

Do's and Don'ts

Do: Always set stream=True for LLM calls. Yield chunks as soon as they’re available. Provide intermediate feedback during long operations. Keep responses concise since shorter means faster to speak.

Don’t: Buffer the entire response before yielding. Never leave users in silence for more than 2 seconds. Avoid yielding empty strings or whitespace-only chunks.

Measuring Stream Performance

Track time-to-first-chunk:

1 import time
2 
3 async def generate_response(self):
4     start = time.time()
5     first_chunk_sent = False
6     
7     response = await self.llm.chat(
8         messages=self.context.messages,
9         stream=True
10     )
11     
12     async for chunk in response:
13         if chunk.content:
14             if not first_chunk_sent:
15                 ttfc = (time.time() - start) * 1000
16                 logger.info(f"Time to first chunk: {ttfc:.0f}ms")
17                 first_chunk_sent = True
18             
19             yield chunk.content