For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Prompt structure for cache hits
  • Always pass stream_options.include_usage when streaming
  • Tool calls in voice agents — speak the filler
  • Concurrency
  • Retry on 502 / 503 only
  • Use request_id in every log line
  • Keep tool descriptions short
  • Use seed for reproducibility — but don’t over-rely
  • Cost control
  • When to choose another product
LLM (Electron)

Best Practices

||View as Markdown|
Was this page helpful?
Previous

Migrate from OpenAI

Next

Speech to Text Examples

Built with

A short, opinionated guide to using Electron well.

Prompt structure for cache hits

Stable content first, variable content last. The prefix cache matches identical token prefixes — if you put a per-request value early in the prompt, nothing after it can match the cache.

1# ✅ stable prefix → caches well
2messages = [
3 {"role": "system", "content": SYSTEM_PROMPT_VERBATIM}, # identical across users
4 {"role": "user", "content": RAG_CONTEXT_BLOCK}, # identical across the session
5 {"role": "user", "content": user_question}, # varies — last
6]
7
8# ❌ per-request value in system prompt → no cache hits
9messages = [
10 {"role": "system", "content": f"You're helping {user_name}. {SYSTEM_PROMPT}"},
11 ...
12]

See Prefix Caching for the full guide.

Always pass stream_options.include_usage when streaming

1client.chat.completions.create(
2 model="electron",
3 messages=[...],
4 stream=True,
5 stream_options={"include_usage": True},
6)

Without it, you can’t tell how many tokens were generated when the client disconnects mid-stream. With it, the server appends a final usage chunk so your accounting is exact.

Tool calls in voice agents — speak the filler

If you’re streaming Electron’s output into a TTS engine for a voice agent:

  1. As soon as delta.content chunks arrive, feed them to TTS.
  2. When delta.tool_calls chunks arrive, kick off the actual tool execution in parallel with TTS — don’t wait for the filler to finish speaking.
  3. Append tool role messages with results and continue.

The user hears “Let me check the weather for you…” while your weather API resolves in the background. Perceived latency drops by hundreds of milliseconds. See Tool Calling: voice-agent pattern.

Concurrency

Electron’s plan limits cap concurrent in-flight requests. If you’re firing many parallel requests:

  • Standard plan: 3 concurrent. Burst-safe up to this; over the limit returns HTTP 429 with Concurrency limit reached.
  • Enterprise plan: 20 concurrent.

If you need to fire more in parallel, batch with a semaphore or queue:

1import asyncio
2sem = asyncio.Semaphore(3) # match your plan's concurrency
3
4async def safe_chat(messages):
5 async with sem:
6 return await async_client.chat.completions.create(model="electron", messages=messages)

Retry on 502 / 503 only

StatusRetry?
400❌ — bad request, fix the input
401 / 403❌ — credential or access issue
429⚠️ — back off then retry (rate limit or concurrency hit)
502✅ — upstream blip, retry with backoff
503✅ — upstream overloaded, retry with backoff

Use exponential backoff (start ~250 ms, double up to ~8 s, cap at ~5 attempts). Always carry the request_id from the failed response so support can trace it if you escalate.

Use request_id in every log line

Every response sets an X-Request-Id header and includes request_id in error envelopes. Capture and log it on every call — it’s the only way support can trace a specific request through the system.

1resp = client.with_raw_response.chat.completions.create(model="electron", messages=[...])
2request_id = resp.http_response.headers.get("x-request-id")
3logger.info("chat completion", extra={"request_id": request_id, "user_id": ...})

Keep tool descriptions short

Every tool description and parameter schema gets sent as input tokens on every turn. A 200-token tool catalog gets re-billed (at cache rates after the first turn) every time you call the model. Aim for:

  • One sentence of intent per tool.
  • Tight parameter description fields — name + brief purpose.
  • Don’t include examples in the tool definition unless they materially improve calling behavior. Put examples in the system prompt where they cache once.

Use seed for reproducibility — but don’t over-rely

Electron honors seed for best-effort determinism. Same seed + same input usually yields the same output, but it’s not a hard guarantee — model deployments, batching, and version updates can cause drift. For test fixtures, prefer temperature: 0 over relying solely on seed.

Cost control

  • Prefix caching is the single biggest lever. Put stable content first.
  • max_tokens caps output cost. Set it to the smallest value that still lets the model finish naturally.
  • stop sequences can end generation early when the model emits a known terminator.
  • For agent workflows, bound tool-call chains — a runaway agent that calls itself in a loop can blow through your budget.
  • Monitor usage.prompt_tokens_details.cached_tokens in your logs to confirm caching is working.

When to choose another product

Electron is the right call for:

  • OpenAI-compatible chat/agent workloads
  • Voice-agent backends (especially with tool calling)
  • Indic-language workloads
  • Cost-conscious migrations from frontier models

Look elsewhere on the Smallest stack for:

  • Voice transcription → Pulse
  • Speech synthesis → Lightning
  • Full voice-agent platform with built-in workflow tooling → Atoms. Electron is the LLM behind many Atoms agents; Atoms is the right choice when you want the platform-level scaffolding (campaigns, knowledge base, telephony) rather than building it yourself.