For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • How to tell what got cached
  • How to maximize cache hits
  • ✅ Do this
  • ❌ Avoid this
  • Practical patterns
  • Pattern 1: shared system prompt across users
  • Pattern 2: RAG with stable retrieved context
  • Pattern 3: conversation history
  • Cost math
  • What invalidates the cache
  • Verifying savings
LLM (Electron)

Prefix Caching

||View as Markdown|
Was this page helpful?
Previous

Tool / Function Calling

Next

Supported Parameters

Built with

Electron caches stable prompt prefixes across requests. When a new request shares a prefix with a recent one, those tokens are served from cache:

  • Faster — fewer tokens to process means lower time-to-first-token.
  • Cheaper — cached input tokens are billed at $0.10 / 1M vs $0.40 / 1M for normal input. A 75% discount.

No flag or knob — caching is automatic. Your job is to structure prompts so the prefix is stable across requests.

How to tell what got cached

Every response includes usage.prompt_tokens_details.cached_tokens:

1{
2 "usage": {
3 "prompt_tokens": 1200,
4 "completion_tokens": 80,
5 "total_tokens": 1280,
6 "prompt_tokens_details": {
7 "cached_tokens": 1024
8 }
9 }
10}

In this example, 1024 of 1200 input tokens were served from cache. You’re billed for 176 fresh tokens at the normal rate plus 1024 cached tokens at the discounted rate.

How to maximize cache hits

The cache matches identical prefixes — same tokens in the same order. The principle: put stable content first, variable content last.

✅ Do this

1messages = [
2 {"role": "system", "content": LONG_SYSTEM_PROMPT}, # stable — caches
3 {"role": "user", "content": LONG_RAG_CONTEXT_BLOCK}, # stable per session
4 {"role": "user", "content": user_question}, # varies — last
5]

❌ Avoid this

1messages = [
2 {"role": "system", "content": f"You are helping {user_name}. {SYSTEM_PROMPT}"},
3 # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ varies per user
4 # breaks the cache for everyone
5]

Putting per-request variability before stable content prevents the cache from matching anything past that point.

Practical patterns

Pattern 1: shared system prompt across users

If many requests share the same system prompt, place it first verbatim — including whitespace and punctuation. The first request warms the cache; subsequent requests benefit.

1SYSTEM_PROMPT = """You are a customer-support agent for Acme Corp. ...
2[800 tokens of policies, tone guidelines, escalation rules]
3"""
4
5# Every request:
6messages = [
7 {"role": "system", "content": SYSTEM_PROMPT},
8 {"role": "user", "content": user_input},
9]

Result: ~800 of prompt_tokens show up as cached_tokens on every call after the first.

Pattern 2: RAG with stable retrieved context

For a chat session where the user is asking follow-up questions over the same retrieved documents, keep the context block fixed across turns:

1context_block = "\n\n".join(retrieved_chunks) # stable for the session
2
3messages = [
4 {"role": "system", "content": SYSTEM_PROMPT},
5 {"role": "system", "content": f"Context:\n{context_block}"},
6 *conversation_history, # grows but doesn't shrink — cache hits the prefix
7 {"role": "user", "content": current_question},
8]

The system prompt + context block + earlier conversation history all hit the cache as the conversation grows.

Pattern 3: conversation history

In a multi-turn chat, each new turn’s prefix (system + all previous messages) is identical to the previous turn’s full message list. Sending the full history every turn benefits from caching of all the older turns.

1# Turn 5:
2messages = [
3 {"role": "system", "content": SYSTEM_PROMPT},
4 # turns 1-4 (already cached after turn 4 was sent)
5 *history,
6 # turn 5 user message (the only new bit)
7 {"role": "user", "content": "..."},
8]

Cost math

Assume a typical agent turn with a 1000-token system prompt, 500 tokens of conversation history, and a 50-token user question — 1550 input tokens, 200 output tokens.

TokensRateCost
No cache hit (first turn)
Input1,550$0.40/M$0.00062
Output200$1.60/M$0.00032
Total$0.00094
Cache hit on the prefix (subsequent turns)
Cached input1,500$0.10/M$0.00015
Fresh input50$0.40/M$0.00002
Output200$1.60/M$0.00032
Total$0.00049

~48% cheaper per turn on a typical chat workload. The savings get more pronounced as your system prompt + history grows.

What invalidates the cache

  • Any change in the prefix tokens — including whitespace, punctuation, capitalization.
  • Changes in the order of messages.
  • Cache entries expire after a period of inactivity. Hot prompts stay cached; cold prompts drop out.

If you see cached_tokens lower than expected, check that your system prompt template doesn’t have variable substitutions — even invisible ones like timestamps.

Verifying savings

The simplest verification: send the same request twice in a row and inspect cached_tokens on the second response.

1for i in range(2):
2 resp = client.chat.completions.create(
3 model="electron",
4 messages=[
5 {"role": "system", "content": "You are a helpful assistant."},
6 {"role": "user", "content": "What is 2+2?"},
7 ],
8 )
9 cached = resp.usage.prompt_tokens_details.cached_tokens
10 print(f"call {i+1}: cached_tokens={cached}, prompt_tokens={resp.usage.prompt_tokens}")

You’ll see cached_tokens=0 on the first call and a non-zero value on the second.