Prefix Caching

View as Markdown

Electron caches stable prompt prefixes across requests. When a new request shares a prefix with a recent one, those tokens are served from cache:

  • Faster — fewer tokens to process means lower time-to-first-token.
  • Cheaper — cached input tokens are billed at a discounted rate vs normal input. Contact your Smallest AI account manager for the current rate card.

No flag or knob — caching is automatic. Your job is to structure prompts so the prefix is stable across requests.

How to tell what got cached

Every response includes usage.prompt_tokens_details.cached_tokens:

1{
2 "usage": {
3 "prompt_tokens": 1200,
4 "completion_tokens": 80,
5 "total_tokens": 1280,
6 "prompt_tokens_details": {
7 "cached_tokens": 1024
8 }
9 }
10}

In this example, 1024 of 1200 input tokens were served from cache. The fresh 176 tokens bill at the normal input rate; the 1024 cached tokens bill at the discounted rate.

How to maximize cache hits

The cache matches identical prefixes — same tokens in the same order. The principle: put stable content first, variable content last.

✅ Do this

1messages = [
2 {"role": "system", "content": LONG_SYSTEM_PROMPT}, # stable — caches
3 {"role": "user", "content": LONG_RAG_CONTEXT_BLOCK}, # stable per session
4 {"role": "user", "content": user_question}, # varies — last
5]

❌ Avoid this

1messages = [
2 {"role": "system", "content": f"You are helping {user_name}. {SYSTEM_PROMPT}"},
3 # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ varies per user
4 # breaks the cache for everyone
5]

Putting per-request variability before stable content prevents the cache from matching anything past that point.

Practical patterns

Pattern 1: shared system prompt across users

If many requests share the same system prompt, place it first verbatim — including whitespace and punctuation. The first request warms the cache; subsequent requests benefit.

1SYSTEM_PROMPT = """You are a customer-support agent for Acme Corp. ...
2[800 tokens of policies, tone guidelines, escalation rules]
3"""
4
5# Every request:
6messages = [
7 {"role": "system", "content": SYSTEM_PROMPT},
8 {"role": "user", "content": user_input},
9]

Result: ~800 of prompt_tokens show up as cached_tokens on every call after the first.

Pattern 2: RAG with stable retrieved context

For a chat session where the user is asking follow-up questions over the same retrieved documents, keep the context block fixed across turns:

1context_block = "\n\n".join(retrieved_chunks) # stable for the session
2
3messages = [
4 {"role": "system", "content": SYSTEM_PROMPT},
5 {"role": "system", "content": f"Context:\n{context_block}"},
6 *conversation_history, # grows but doesn't shrink — cache hits the prefix
7 {"role": "user", "content": current_question},
8]

The system prompt + context block + earlier conversation history all hit the cache as the conversation grows.

Pattern 3: conversation history

In a multi-turn chat, each new turn’s prefix (system + all previous messages) is identical to the previous turn’s full message list. Sending the full history every turn benefits from caching of all the older turns.

1# Turn 5:
2messages = [
3 {"role": "system", "content": SYSTEM_PROMPT},
4 # turns 1-4 (already cached after turn 4 was sent)
5 *history,
6 # turn 5 user message (the only new bit)
7 {"role": "user", "content": "..."},
8]

Where the savings come from

A typical voice-agent turn carries a long stable prefix (system prompt + retrieved context + conversation history) and a short tail (the new user message). Once that prefix is warm in the cache, every subsequent turn pays the cached rate on the prefix and the normal rate on just the tail. The longer your stable prefix, the bigger the saving.

Contact your Smallest AI account manager for the current rate card and to model the savings against your workload.

What invalidates the cache

  • Any change in the prefix tokens — including whitespace, punctuation, capitalization.
  • Changes in the order of messages.
  • Cache entries expire after a period of inactivity. Hot prompts stay cached; cold prompts drop out.

If you see cached_tokens lower than expected, check that your system prompt template doesn’t have variable substitutions — even invisible ones like timestamps.

Verifying savings

The simplest verification: send the same request twice in a row and inspect cached_tokens on the second response.

1for i in range(2):
2 resp = client.chat.completions.create(
3 model="electron",
4 messages=[
5 {"role": "system", "content": "You are a helpful assistant."},
6 {"role": "user", "content": "What is 2+2?"},
7 ],
8 )
9 cached = resp.usage.prompt_tokens_details.cached_tokens
10 print(f"call {i+1}: cached_tokens={cached}, prompt_tokens={resp.usage.prompt_tokens}")

You’ll see cached_tokens=0 on the first call and a non-zero value on the second.