Bring Your Own Model | Smallest AI Docs

For privacy, cost control, or specialized models, you can run LLMs locally or on your own servers. The OpenAIClient works with any endpoint that implements the OpenAI chat completions API.

Complete Example

Here’s a full agent using a local Ollama model:

1 import os
2 from smallestai.atoms.agent.nodes import OutputAgentNode
3 from smallestai.atoms.agent.clients.openai import OpenAIClient
4 from smallestai.atoms.agent.server import AtomsApp
5 from smallestai.atoms.agent.session import AgentSession
6 
7 class LocalAgent(OutputAgentNode):
8     def __init__(self):
9         super().__init__(name="local-agent")
10         
11         # Connect to your local model
12         self.llm = OpenAIClient(
13             model="llama3",
14             base_url="http://localhost:11434/v1",
15             api_key="ollama"  # Not required for Ollama
16         )
17         
18         self.context.add_message(
19             "system",
20             "You are a helpful assistant running on local hardware."
21         )
22 
23     async def generate_response(self):
24         response = await self.llm.chat(
25             messages=self.context.messages,
26             stream=True
27         )
28         async for chunk in response:
29             if chunk.content:
30                 yield chunk.content
31 
32 async def on_start(session: AgentSession):
33     session.add_node(LocalAgent())
34     await session.start()
35     await session.wait_until_complete()
36 
37 if __name__ == "__main__":
38     app = AtomsApp(setup_handler=on_start)
39     app.run()

Requirements

Your model server must implement the OpenAI Chat Completions API:

Feature	Required	Notes
`/chat/completions` endpoint	Yes	Standard OpenAI format
Streaming	Yes	`stream=True` must work
Tool calling	For tools	OpenAI-format function calling

Custom Endpoints

Connect to any custom model server:

1 from smallestai.atoms.agent.clients.openai import OpenAIClient
2 
3 llm = OpenAIClient(
4     model="your-model-name",
5     base_url="https://your-server.example.com/v1",
6     api_key=os.getenv("YOUR_API_KEY")
7 )

Ollama

Ollama is the easiest way to run models locally. It handles model downloads and serving automatically.

Setup

$ # Install Ollama
$ curl -fsSL https://ollama.com/install.sh | sh
$ 
$ # Pull a model
$ ollama pull llama3
$ 
$ # Start the server (runs on port 11434)
$ ollama serve

Usage

1 from smallestai.atoms.agent.clients.openai import OpenAIClient
2 
3 llm = OpenAIClient(
4     model="llama3",
5     base_url="http://localhost:11434/v1",
6     api_key="ollama"  # Doesn't require a real key
7 )

vLLM

vLLM is a high-performance inference server for production workloads.

Setup

$ pip install vllm
$ 
$ python -m vllm.entrypoints.openai.api_server \
>     --model meta-llama/Llama-3-8B-Instruct \
>     --port 8000

Usage

1 from smallestai.atoms.agent.clients.openai import OpenAIClient
2 
3 llm = OpenAIClient(
4     model="meta-llama/Llama-3-8B-Instruct",
5     base_url="http://localhost:8000/v1",
6     api_key="vllm"
7 )

LM Studio

LM Studio provides a desktop UI for running models locally.

Download from lmstudio.ai
Load a model
Start the local server (Settings → Local Server)

1 from smallestai.atoms.agent.clients.openai import OpenAIClient
2 
3 llm = OpenAIClient(
4     model="local-model",
5     base_url="http://localhost:1234/v1",
6     api_key="lmstudio"
7 )

Troubleshooting

Issue	Cause	Fix
Connection refused	Server not running	Start Ollama/vLLM
Model not found	Wrong name	Check `ollama list` or server logs
No streaming	Server config	Ensure streaming is enabled
Tool calls ignored	Model limitation	Use a larger model or cloud fallback

Tips

Use local LLMs for development, Cloud/Managed LLM providers for production

Ollama is great for local development. For production, consider vLLM or a cloud provider for reliability.

Set up a fallback LLM

Local models can fail. Configure a cloud fallback (e.g., OpenAI) to catch errors and keep the conversation going.

Check tool calling support

Not all local models support function calling. Test your tools or use a model known to support them.