Bring Your Own Model

View as MarkdownOpen in Claude

For privacy, cost control, or specialized models, you can run LLMs locally or on your own servers. The OpenAIClient works with any endpoint that implements the OpenAI chat completions API.

Complete Example

Here’s a full agent using a local Ollama model:

1import os
2from smallestai.atoms.agent.nodes import OutputAgentNode
3from smallestai.atoms.agent.clients.openai import OpenAIClient
4from smallestai.atoms.agent.server import AtomsApp
5from smallestai.atoms.agent.session import AgentSession
6
7class LocalAgent(OutputAgentNode):
8 def __init__(self):
9 super().__init__(name="local-agent")
10
11 # Connect to your local model
12 self.llm = OpenAIClient(
13 model="llama3",
14 base_url="http://localhost:11434/v1",
15 api_key="ollama" # Not required for Ollama
16 )
17
18 self.context.add_message(
19 "system",
20 "You are a helpful assistant running on local hardware."
21 )
22
23 async def generate_response(self):
24 response = await self.llm.chat(
25 messages=self.context.messages,
26 stream=True
27 )
28 async for chunk in response:
29 if chunk.content:
30 yield chunk.content
31
32async def on_start(session: AgentSession):
33 session.add_node(LocalAgent())
34 await session.start()
35 await session.wait_until_complete()
36
37if __name__ == "__main__":
38 app = AtomsApp(setup_handler=on_start)
39 app.run()

Requirements

Your model server must implement the OpenAI Chat Completions API:

FeatureRequiredNotes
/chat/completions endpointYesStandard OpenAI format
StreamingYesstream=True must work
Tool callingFor toolsOpenAI-format function calling

Custom Endpoints

Connect to any custom model server:

1from smallestai.atoms.agent.clients.openai import OpenAIClient
2
3llm = OpenAIClient(
4 model="your-model-name",
5 base_url="https://your-server.example.com/v1",
6 api_key=os.getenv("YOUR_API_KEY")
7)

Ollama

Ollama is the easiest way to run models locally. It handles model downloads and serving automatically.

Setup

$# Install Ollama
$curl -fsSL https://ollama.com/install.sh | sh
$
$# Pull a model
$ollama pull llama3
$
$# Start the server (runs on port 11434)
$ollama serve

Usage

1from smallestai.atoms.agent.clients.openai import OpenAIClient
2
3llm = OpenAIClient(
4 model="llama3",
5 base_url="http://localhost:11434/v1",
6 api_key="ollama" # Doesn't require a real key
7)

vLLM

vLLM is a high-performance inference server for production workloads.

Setup

$pip install vllm
$
$python -m vllm.entrypoints.openai.api_server \
> --model meta-llama/Llama-3-8B-Instruct \
> --port 8000

Usage

1from smallestai.atoms.agent.clients.openai import OpenAIClient
2
3llm = OpenAIClient(
4 model="meta-llama/Llama-3-8B-Instruct",
5 base_url="http://localhost:8000/v1",
6 api_key="vllm"
7)

LM Studio

LM Studio provides a desktop UI for running models locally.

  1. Download from lmstudio.ai
  2. Load a model
  3. Start the local server (Settings → Local Server)
1from smallestai.atoms.agent.clients.openai import OpenAIClient
2
3llm = OpenAIClient(
4 model="local-model",
5 base_url="http://localhost:1234/v1",
6 api_key="lmstudio"
7)

Troubleshooting

IssueCauseFix
Connection refusedServer not runningStart Ollama/vLLM
Model not foundWrong nameCheck ollama list or server logs
No streamingServer configEnsure streaming is enabled
Tool calls ignoredModel limitationUse a larger model or cloud fallback

Tips

Ollama is great for local development. For production, consider vLLM or a cloud provider for reliability.

Local models can fail. Configure a cloud fallback (e.g., OpenAI) to catch errors and keep the conversation going.

Not all local models support function calling. Test your tools or use a model known to support them.