Someone analyzed 3,007 Claude Code sessions and found a ratio that broke my brain: for every fresh token sent to the API, 525 tokens were served from cache. The total? 12.2 billion cached tokens against 10 million fresh ones.
That's not a rounding error. That's a system where prompt structure — not prompt content — is doing most of the economic work.
The Prefix Is the Product
Every major LLM provider now offers some form of prompt caching. The mechanism is consistent across all of them: they store the computed K and V matrices from the transformer's attention layers so that identical prompt prefixes don't need to be reprocessed. If your first 8,000 tokens match a previous request, the provider skips computation on those 8,000 tokens and you only pay for what's new.
This sounds like a backend optimization that shouldn't concern anyone writing prompts. It's not. The moment you understand that token ordering determines cost, you start writing prompts differently.
The rule is simple: stable content first, variable content last.
# BAD — dynamic content breaks the prefix
[user query] ← changes every request
[system instructions] ← stable
[tool definitions] ← stable
[few-shot examples] ← stable
# GOOD — stable prefix maximizes reuse
[system instructions] ← stable, cached
[tool definitions] ← stable, cached
[few-shot examples] ← stable, cached
[user query] ← only this gets computed fresh
Written out like that, it seems obvious. In practice, most prompt templates I've audited get it backwards — the user message sits at the top, or timestamps and session metadata pollute the prefix.
Where Agentic Systems Self-Sabotage
The "Don't Break the Cache" paper published in January tested prompt caching across GPT-5.2, Claude Sonnet 4.5, and Gemini 2.0 Flash on long-horizon agentic tasks. The headline numbers are strong: 45–80% cost reduction and 13–31% latency improvement when caching was done correctly.
But the paper's real contribution is documenting what goes wrong. Naive full-context caching — storing everything including tool results and conversation history — can increase latency. The system wastes time writing cache entries for content that never gets reused. A tool result from one request is almost never identical to the next one. The winning strategy was boring: cache the system prompt and tool definitions only. Leave everything else out.
Three specific patterns that kill your cache in agentic workflows:
Timestamps in system prompts. "The current date is April 2, 2026" sitting at the top of your instructions means every request after midnight invalidates the entire cached prefix. Move temporal context to the end, or pass it as a user message instead.
Dynamic tool sets. Adding or removing tools between requests changes the prefix, torching the cached KV tensors for everything below. The fix: register all tools all the time, even ones not currently active, and use mode-switching tools at runtime to enable or disable capabilities. Bonus — the model can then autonomously enter modes when it detects the need.
Session IDs and UUIDs. Any dynamic identifier embedded in the cached region poisons every block that follows it. One researcher used UUID boundaries specifically to prevent unwanted caching of volatile sections.
Two Philosophies of Control
Anthropic and OpenAI handle this completely differently, and the architectural choice matters.
OpenAI caches automatically. You don't mark anything — the system detects repeated prefixes and attempts to route requests to previously cached entries. Zero effort required. But also zero control. Hit rates hover around 50% in independent testing, and cached entries live only 5–10 minutes.
Anthropic exposes explicit cache_control breakpoints — up to four per request. You mark exactly where the cacheable prefix ends. The system then works backwards from your breakpoint, checking up to 20 preceding blocks for the longest matching prefix. Hit rates in controlled testing: 100% when prompts are structured properly.
For anything making repeated API calls — agents, multi-turn conversations, RAG pipelines — explicit control compounds. The gap between 50% and 95%+ hit rates gets expensive fast when you're making forty sequential calls per task.
What 12.2 Billion Cached Tokens Look Like
Back to that Claude Code dataset. The story unfolds in three phases.
During the first three weeks, the developer was generating code at scale — 197,831 lines of Java. Cache rates dipped to 88.8% because each request carried substantial novel context. Expected behavior for a generative workload.
Then the workflow pivoted to comprehension: reading, reviewing, navigating existing code. 673 sessions in a single week, minimal new output. Cache rates climbed to 93–95% as the stable prefix — project context, skill definitions, memory files — dominated each request.
By the final phase, sustained 95%+ rates. Cache reads had doubled while fresh input dropped. Estimated API cost without caching: around 40,000. With it: 8,900. The user's actual subscription cost for those 55 days was roughly $300.
The author's framing sticks with me: "Treat your prompt context as infrastructure, not scaffolding." System prompts, tool schemas, persona definitions — those aren't disposable wrappers you rewrite per conversation. They're compiled artifacts amortized across thousands of interactions.
So What Do You Actually Do
If your application makes more than one API call per user session, cache-awareness should be a design constraint, not an afterthought. Concretely:
Audit your system prompt for anything that changes between requests. Timestamps, session data, user-specific metadata — move all of it after the stable blocks. Place your longest, most stable context at the very beginning: reference documents, tool schemas, persona instructions, few-shot examples.
On Anthropic, drop a cache_control breakpoint right after the last invariant block. On OpenAI, keep your prefix byte-identical across requests and accept the probabilistic hit rate. On Google, consider their persistent cache objects for contexts exceeding 100k tokens — those survive up to an hour.
The irony of prompt engineering in 2026 might be this: the highest-ROI optimization isn't about what you tell the model. It's about where you put it.