Best Practices | Smallest AI Docs

A short, opinionated guide to using Electron well.

Prompt structure for cache hits

Stable content first, variable content last. The prefix cache matches identical token prefixes — if you put a per-request value early in the prompt, nothing after it can match the cache.

1 # ✅ stable prefix → caches well
2 messages = [
3     {"role": "system", "content": SYSTEM_PROMPT_VERBATIM},     # identical across users
4     {"role": "user",   "content": RAG_CONTEXT_BLOCK},          # identical across the session
5     {"role": "user",   "content": user_question},              # varies — last
6 ]
7 
8 # ❌ per-request value in system prompt → no cache hits
9 messages = [
10     {"role": "system", "content": f"You're helping {user_name}. {SYSTEM_PROMPT}"},
11     ...
12 ]

See Prefix Caching for the full guide.

Always pass `stream_options.include_usage` when streaming

1 client.chat.completions.create(
2     model="electron",
3     messages=[...],
4     stream=True,
5     stream_options={"include_usage": True},
6 )

Without it, you can’t tell how many tokens were generated when the client disconnects mid-stream. With it, the server appends a final usage chunk so your accounting is exact.

Tool calls in voice agents — speak the filler

If you’re streaming Electron’s output into a TTS engine for a voice agent:

As soon as delta.content chunks arrive, feed them to TTS.
When delta.tool_calls chunks arrive, kick off the actual tool execution in parallel with TTS — don’t wait for the filler to finish speaking.
Append tool role messages with results and continue.

The user hears “Let me check the weather for you…” while your weather API resolves in the background. Perceived latency drops by hundreds of milliseconds. See Tool Calling: voice-agent pattern.

Concurrency

Electron’s plan limits cap concurrent in-flight requests. If you’re firing many parallel requests:

Standard plan: 3 concurrent. Burst-safe up to this; over the limit returns HTTP 429 with Concurrency limit reached.
Enterprise plan: 20 concurrent.

If you need to fire more in parallel, batch with a semaphore or queue:

1 import asyncio
2 sem = asyncio.Semaphore(3)   # match your plan's concurrency
3 
4 async def safe_chat(messages):
5     async with sem:
6         return await async_client.chat.completions.create(model="electron", messages=messages)

Retry on `502` / `503` only

Status	Retry?
`400`	❌ — bad request, fix the input
`401` / `403`	❌ — credential or access issue
`429`	⚠️ — back off then retry (rate limit or concurrency hit)
`502`	✅ — upstream blip, retry with backoff
`503`	✅ — upstream overloaded, retry with backoff

Use exponential backoff (start ~250 ms, double up to ~8 s, cap at ~5 attempts). Always carry the request_id from the failed response so support can trace it if you escalate.

Use `request_id` in every log line

Every response sets an X-Request-Id header and includes request_id in error envelopes. Capture and log it on every call — it’s the only way support can trace a specific request through the system.

1 resp = client.with_raw_response.chat.completions.create(model="electron", messages=[...])
2 request_id = resp.http_response.headers.get("x-request-id")
3 logger.info("chat completion", extra={"request_id": request_id, "user_id": ...})

Keep tool descriptions short

Every tool description and parameter schema gets sent as input tokens on every turn. A 200-token tool catalog gets re-billed (at cache rates after the first turn) every time you call the model. Aim for:

One sentence of intent per tool.
Tight parameter description fields — name + brief purpose.
Don’t include examples in the tool definition unless they materially improve calling behavior. Put examples in the system prompt where they cache once.

Use `seed` for reproducibility — but don’t over-rely

Electron honors seed for best-effort determinism. Same seed + same input usually yields the same output, but it’s not a hard guarantee — model deployments, batching, and version updates can cause drift. For test fixtures, prefer temperature: 0 over relying solely on seed.

Cost control

Prefix caching is the single biggest lever. Put stable content first.
max_tokens caps output cost. Set it to the smallest value that still lets the model finish naturally.
stop sequences can end generation early when the model emits a known terminator.
For agent workflows, bound tool-call chains — a runaway agent that calls itself in a loop can blow through your budget.
Monitor usage.prompt_tokens_details.cached_tokens in your logs to confirm caching is working.

When to choose another product

Electron is the right call for:

OpenAI-compatible chat/agent workloads
Voice-agent backends (especially with tool calling)
Indic-language workloads
Cost-conscious migrations from frontier models

Look elsewhere on the Smallest stack for:

Voice transcription → Pulse
Speech synthesis → Lightning
Full voice-agent platform with built-in workflow tooling → Atoms. Electron is the LLM behind many Atoms agents; Atoms is the right choice when you want the platform-level scaffolding (campaigns, knowledge base, telephony) rather than building it yourself.