Electron caches stable prompt prefixes across requests. When a new request shares a prefix with a recent one, those tokens are served from cache:
$0.10 / 1M vs $0.40 / 1M for normal input. A 75% discount.No flag or knob — caching is automatic. Your job is to structure prompts so the prefix is stable across requests.
Every response includes usage.prompt_tokens_details.cached_tokens:
In this example, 1024 of 1200 input tokens were served from cache. You’re billed for 176 fresh tokens at the normal rate plus 1024 cached tokens at the discounted rate.
The cache matches identical prefixes — same tokens in the same order. The principle: put stable content first, variable content last.
Putting per-request variability before stable content prevents the cache from matching anything past that point.
If many requests share the same system prompt, place it first verbatim — including whitespace and punctuation. The first request warms the cache; subsequent requests benefit.
Result: ~800 of prompt_tokens show up as cached_tokens on every call after the first.
For a chat session where the user is asking follow-up questions over the same retrieved documents, keep the context block fixed across turns:
The system prompt + context block + earlier conversation history all hit the cache as the conversation grows.
In a multi-turn chat, each new turn’s prefix (system + all previous messages) is identical to the previous turn’s full message list. Sending the full history every turn benefits from caching of all the older turns.
Assume a typical agent turn with a 1000-token system prompt, 500 tokens of conversation history, and a 50-token user question — 1550 input tokens, 200 output tokens.
~48% cheaper per turn on a typical chat workload. The savings get more pronounced as your system prompt + history grows.
If you see cached_tokens lower than expected, check that your system prompt template doesn’t have variable substitutions — even invisible ones like timestamps.
The simplest verification: send the same request twice in a row and inspect cached_tokens on the second response.
You’ll see cached_tokens=0 on the first call and a non-zero value on the second.