A short, opinionated guide to using Electron well.
Stable content first, variable content last. The prefix cache matches identical token prefixes — if you put a per-request value early in the prompt, nothing after it can match the cache.
See Prefix Caching for the full guide.
stream_options.include_usage when streamingWithout it, you can’t tell how many tokens were generated when the client disconnects mid-stream. With it, the server appends a final usage chunk so your accounting is exact.
If you’re streaming Electron’s output into a TTS engine for a voice agent:
delta.content chunks arrive, feed them to TTS.delta.tool_calls chunks arrive, kick off the actual tool execution in parallel with TTS — don’t wait for the filler to finish speaking.tool role messages with results and continue.The user hears “Let me check the weather for you…” while your weather API resolves in the background. Perceived latency drops by hundreds of milliseconds. See Tool Calling: voice-agent pattern.
Electron’s plan limits cap concurrent in-flight requests. If you’re firing many parallel requests:
HTTP 429 with Concurrency limit reached.If you need to fire more in parallel, batch with a semaphore or queue:
502 / 503 onlyUse exponential backoff (start ~250 ms, double up to ~8 s, cap at ~5 attempts). Always carry the request_id from the failed response so support can trace it if you escalate.
request_id in every log lineEvery response sets an X-Request-Id header and includes request_id in error envelopes. Capture and log it on every call — it’s the only way support can trace a specific request through the system.
Every tool description and parameter schema gets sent as input tokens on every turn. A 200-token tool catalog gets re-billed (at cache rates after the first turn) every time you call the model. Aim for:
description fields — name + brief purpose.seed for reproducibility — but don’t over-relyElectron honors seed for best-effort determinism. Same seed + same input usually yields the same output, but it’s not a hard guarantee — model deployments, batching, and version updates can cause drift. For test fixtures, prefer temperature: 0 over relying solely on seed.
max_tokens caps output cost. Set it to the smallest value that still lets the model finish naturally.stop sequences can end generation early when the model emits a known terminator.usage.prompt_tokens_details.cached_tokens in your logs to confirm caching is working.Electron is the right call for:
Look elsewhere on the Smallest stack for: