Streaming

View as MarkdownOpen in Claude

Streaming sends response chunks to the user as they’re generated, rather than waiting for the complete response. For voice agents, this is essential — users hear the first words immediately instead of waiting in silence.

Why Streaming Matters

ApproachTime to First AudioUser Experience
Non-streaming800–1500msAwkward silence, then full response
Streaming200–400msNatural conversation flow

Basic Streaming

Set stream=True and yield each chunk.content for instant TTS playback.

1async def generate_response(self):
2 response = await self.llm.chat(
3 messages=self.context.messages,
4 stream=True # Required for streaming
5 )
6
7 async for chunk in response:
8 if chunk.content:
9 yield chunk.content # Sent to TTS immediately

Streaming with Tools

Collect tool calls while streaming, execute them, then stream the follow-up response.

1async def generate_response(self):
2 response = await self.llm.chat(
3 messages=self.context.messages,
4 stream=True,
5 tools=self.tool_schemas
6 )
7
8 tool_calls = []
9
10 # Stream first response
11 async for chunk in response:
12 if chunk.content:
13 yield chunk.content
14 if chunk.tool_calls:
15 tool_calls.extend(chunk.tool_calls)
16
17 # Handle tools if present
18 if tool_calls:
19 results = await self.tool_registry.execute(
20 tool_calls=tool_calls, parallel=True
21 )
22
23 # Add tool calls and results to context
24 self.context.add_messages([
25 {
26 "role": "assistant",
27 "content": "",
28 "tool_calls": [
29 {
30 "id": tc.id,
31 "type": "function",
32 "function": {"name": tc.name, "arguments": str(tc.arguments)},
33 }
34 for tc in tool_calls
35 ],
36 },
37 *[
38 {"role": "tool", "tool_call_id": tc.id, "content": str(result)}
39 for tc, result in zip(tool_calls, results)
40 ],
41 ])
42
43 # Stream follow-up response
44 final_response = await self.llm.chat(
45 messages=self.context.messages, stream=True
46 )
47
48 async for chunk in final_response:
49 if chunk.content:
50 yield chunk.content

Chunking Strategies

Word-by-Word (Default)

LLMs typically stream tokens, which map roughly to words or word fragments:

1async for chunk in response:
2 if chunk.content:
3 yield chunk.content

Sentence Buffering

Buffer complete sentences for more natural speech boundaries:

1async def generate_response(self):
2 buffer = ""
3
4 async for chunk in response:
5 if chunk.content:
6 buffer += chunk.content
7
8 # Yield on sentence boundaries
9 while any(end in buffer for end in [". ", "! ", "? "]):
10 for end in [". ", "! ", "? "]:
11 if end in buffer:
12 sentence, buffer = buffer.split(end, 1)
13 yield sentence + end.strip() + " "
14 break
15
16 # Yield remaining content
17 if buffer.strip():
18 yield buffer

Phrase Buffering

Buffer by phrase for smoother speech rhythm:

1import re
2
3async def generate_response(self):
4 buffer = ""
5 min_phrase_length = 20 # Characters
6
7 async for chunk in response:
8 if chunk.content:
9 buffer += chunk.content
10
11 # Yield on commas, periods, or at minimum length
12 if len(buffer) >= min_phrase_length:
13 if re.search(r'[,.:;!?]\s', buffer):
14 match = re.search(r'[,.:;!?]\s', buffer)
15 phrase = buffer[:match.end()]
16 buffer = buffer[match.end():]
17 yield phrase
18
19 if buffer.strip():
20 yield buffer

Intermediate Feedback

Provide feedback while processing long operations:

1async def generate_response(self):
2 response = await self.llm.chat(...)
3
4 tool_calls = []
5 async for chunk in response:
6 if chunk.content:
7 yield chunk.content
8 if chunk.tool_calls:
9 tool_calls.extend(chunk.tool_calls)
10
11 if tool_calls:
12 # Immediate feedback
13 yield "One moment while I look that up."
14
15 # Long operation
16 results = await self.tool_registry.execute(tool_calls, parallel=True)
17
18 # Continue with results
19 # ...

Streaming Best Practices

Do: Always set stream=True for LLM calls. Yield chunks as soon as they’re available. Provide intermediate feedback during long operations. Keep responses concise since shorter means faster to speak.

Don’t: Buffer the entire response before yielding. Never leave users in silence for more than 2 seconds. Avoid yielding empty strings or whitespace-only chunks.

Measuring Stream Performance

Track time-to-first-chunk:

1import time
2
3async def generate_response(self):
4 start = time.time()
5 first_chunk_sent = False
6
7 response = await self.llm.chat(
8 messages=self.context.messages,
9 stream=True
10 )
11
12 async for chunk in response:
13 if chunk.content:
14 if not first_chunk_sent:
15 ttfc = (time.time() - start) * 1000
16 logger.info(f"Time to first chunk: {ttfc:.0f}ms")
17 first_chunk_sent = True
18
19 yield chunk.content