LLM Settings | Smallest AI Docs

The OpenAIClient is the unified interface for calling LLMs. It works with OpenAI by default, but supports any OpenAI-compatible endpoint by changing the base_url.

Basic Configuration

1 import os
2 from smallestai.atoms.agent.clients.openai import OpenAIClient
3 
4 self.llm = OpenAIClient(
5     model="gpt-4o-mini",
6     temperature=0.7,
7     api_key=os.getenv("OPENAI_API_KEY")
8 )

Parameters

Parameter	Type	Default	Description
`model`	str	—	The model identifier (e.g., `gpt-4o-mini`, `llama-3.1-70b-versatile`)
`api_key`	str	`OPENAI_API_KEY` env	Your provider’s API key
`base_url`	str	OpenAI	Custom endpoint URL for other providers
`temperature`	float	`0.7`	Controls randomness. Lower = consistent, higher = creative
`max_tokens`	int	`1024`	Maximum tokens in the response

Temperature

Controls how “creative” vs “predictable” the model behaves:

0.0–0.3: Consistent, factual. Best for support, FAQ bots.
0.4–0.6: Balanced. Good for general conversation.
0.7–1.0: Creative, varied. Better for sales, engagement.

1 # Consistent support agent
2 self.llm = OpenAIClient(model="gpt-4o-mini", temperature=0.3)
3 
4 # Engaging sales agent
5 self.llm = OpenAIClient(model="gpt-4o-mini", temperature=0.8)

Using Other Providers

Any provider with an OpenAI-compatible API works by setting base_url. This includes Groq, Together.ai, Fireworks, Anyscale, OpenRouter, and Azure OpenAI.

1 import os
2 from smallestai.atoms.agent.clients.openai import OpenAIClient
3 
4 # Example: Using Groq
5 self.llm = OpenAIClient(
6     model="llama-3.1-70b-versatile",
7     base_url="https://api.groq.com/openai/v1",
8     api_key=os.getenv("GROQ_API_KEY")
9 )

Just swap the base_url and api_key—your agent code stays the same.

Streaming

Voice agents must use streaming. Without it, users wait for the entire response before hearing anything.

1 async def generate_response(self):
2     response = await self.llm.chat(
3         messages=self.context.messages,
4         stream=True
5     )
6     
7     async for chunk in response:
8         if chunk.content:
9             yield chunk.content

Tool Calling

To enable function calling, pass tool schemas:

1 response = await self.llm.chat(
2     messages=self.context.messages,
3     stream=True,
4     tools=self.tool_registry.get_schemas()
5 )

Error Handling

1 from openai import RateLimitError, APIError
2 
3 async def generate_response(self):
4     try:
5         response = await self.llm.chat(
6             messages=self.context.messages,
7             stream=True
8         )
9         async for chunk in response:
10             if chunk.content:
11                 yield chunk.content
12                 
13     except RateLimitError:
14         yield "I'm experiencing high demand. Please try again."
15         
16     except APIError as e:
17         logger.error(f"LLM error: {e}")
18         yield "I'm having trouble. Let me try again."

Voice Configuration

Agents also need a voice for text-to-speech. Waves is our recommended TTS engine—ultra-low latency, optimized for real-time telephony.

Basic Voice Setup

1 from smallestai.atoms.models import CreateAgentRequest
2 
3 agent = CreateAgentRequest(
4     name="SupportAgent",
5     synthesizer={
6         "provider": "smallest",  # Use Waves
7         "model": "waves_lightning_v2",
8         "voice_id": "zorin",     # Male, professional
9         "speed": 1.0
10     },
11     # ... other config
12 )

Waves Voice IDs

Voice ID	Description
`zorin`	Male, professional (recommended)
`emily`	Female, warm
`raj`	Male, Indian English
`aria`	Female, neutral

For the complete Waves voice library with audio samples: → Waves Voice Models

Voice Cloning

Create custom voices from audio samples: → Waves Voice Cloning Guide

Third-Party Providers

OpenAI and ElevenLabs voices are also supported. Set provider to openai or elevenlabs and use their respective voice IDs.

Tips

Keep max_tokens low for voice

Set max_tokens to 100-200. Shorter responses mean faster audio playback. Guide conciseness in your prompt too.

Warm up connections on start

The first LLM call has connection overhead. Send a tiny request in start() to warm up before the user speaks.

Use fallbacks for reliability

Configure a backup provider (e.g., Groq primary, OpenAI fallback) to handle rate limits or outages gracefully.