***

title: Streaming
description: Stream responses to minimize latency and improve user experience.
------------------------------------------------------------------------------

Streaming sends response chunks to the user as they're generated, rather than waiting for the complete response. For voice agents, this is essential — users hear the first words immediately instead of waiting in silence.

## Why Streaming Matters

| Approach      | Time to First Audio | User Experience                     |
| ------------- | ------------------- | ----------------------------------- |
| Non-streaming | 800–1500ms          | Awkward silence, then full response |
| Streaming     | 200–400ms           | Natural conversation flow           |

## Basic Streaming

Set `stream=True` and `yield` each `chunk.content` for instant TTS playback.

```python
async def generate_response(self):
    response = await self.llm.chat(
        messages=self.context.messages,
        stream=True  # Required for streaming
    )
    
    async for chunk in response:
        if chunk.content:
            yield chunk.content  # Sent to TTS immediately
```

## Streaming with Tools

Collect tool calls while streaming, execute them, then stream the follow-up response.

```python
async def generate_response(self):
    response = await self.llm.chat(
        messages=self.context.messages,
        stream=True,
        tools=self.tool_schemas
    )
    
    tool_calls = []
    
    # Stream first response
    async for chunk in response:
        if chunk.content:
            yield chunk.content
        if chunk.tool_calls:
            tool_calls.extend(chunk.tool_calls)
    
    # Handle tools if present
    if tool_calls:
        results = await self.tool_registry.execute(
            tool_calls=tool_calls, parallel=True
        )
        
        # Add tool calls and results to context
        self.context.add_messages([
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "id": tc.id,
                        "type": "function",
                        "function": {"name": tc.name, "arguments": str(tc.arguments)},
                    }
                    for tc in tool_calls
                ],
            },
            *[
                {"role": "tool", "tool_call_id": tc.id, "content": str(result)}
                for tc, result in zip(tool_calls, results)
            ],
        ])
        
        # Stream follow-up response
        final_response = await self.llm.chat(
            messages=self.context.messages, stream=True
        )
        
        async for chunk in final_response:
            if chunk.content:
                yield chunk.content
```

## Chunking Strategies

### Word-by-Word (Default)

LLMs typically stream tokens, which map roughly to words or word fragments:

```python
async for chunk in response:
    if chunk.content:
        yield chunk.content
```

### Sentence Buffering

Buffer complete sentences for more natural speech boundaries:

```python
async def generate_response(self):
    buffer = ""
    
    async for chunk in response:
        if chunk.content:
            buffer += chunk.content
            
            # Yield on sentence boundaries
            while any(end in buffer for end in [". ", "! ", "? "]):
                for end in [". ", "! ", "? "]:
                    if end in buffer:
                        sentence, buffer = buffer.split(end, 1)
                        yield sentence + end.strip() + " "
                        break
    
    # Yield remaining content
    if buffer.strip():
        yield buffer
```

### Phrase Buffering

Buffer by phrase for smoother speech rhythm:

```python
import re

async def generate_response(self):
    buffer = ""
    min_phrase_length = 20  # Characters
    
    async for chunk in response:
        if chunk.content:
            buffer += chunk.content
            
            # Yield on commas, periods, or at minimum length
            if len(buffer) >= min_phrase_length:
                if re.search(r'[,.:;!?]\s', buffer):
                    match = re.search(r'[,.:;!?]\s', buffer)
                    phrase = buffer[:match.end()]
                    buffer = buffer[match.end():]
                    yield phrase
    
    if buffer.strip():
        yield buffer
```

## Intermediate Feedback

Provide feedback while processing long operations:

```python
async def generate_response(self):
    response = await self.llm.chat(...)
    
    tool_calls = []
    async for chunk in response:
        if chunk.content:
            yield chunk.content
        if chunk.tool_calls:
            tool_calls.extend(chunk.tool_calls)
    
    if tool_calls:
        # Immediate feedback
        yield "One moment while I look that up."
        
        # Long operation
        results = await self.tool_registry.execute(tool_calls, parallel=True)
        
        # Continue with results
        # ...
```

## Streaming Best Practices

<AccordionGroup>
  <Accordion title="Do's and Don'ts">
    **Do:** Always set `stream=True` for LLM calls. Yield chunks as soon as they're available. Provide intermediate feedback during long operations. Keep responses concise since shorter means faster to speak.

    **Don't:** Buffer the entire response before yielding. Never leave users in silence for more than 2 seconds. Avoid yielding empty strings or whitespace-only chunks.
  </Accordion>
</AccordionGroup>

## Measuring Stream Performance

Track time-to-first-chunk:

```python
import time

async def generate_response(self):
    start = time.time()
    first_chunk_sent = False
    
    response = await self.llm.chat(
        messages=self.context.messages,
        stream=True
    )
    
    async for chunk in response:
        if chunk.content:
            if not first_chunk_sent:
                ttfc = (time.time() - start) * 1000
                logger.info(f"Time to first chunk: {ttfc:.0f}ms")
                first_chunk_sent = True
            
            yield chunk.content
```