Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Prompt Caching

Pattern

A reusable solution you can apply to your work.

Pin the unchanging part of your prompt at the front so the provider can reuse its computed state, and pay a fraction of the cost on every reuse.

Also known as: Context Caching (Google), Implicit Caching, Explicit Caching

Understand This First

  • Prompt — what gets cached is a prefix of the prompt sent to the model.
  • Context Window — caching operates inside the window; it does not extend it.
  • Context Engineering — the discipline that produces the stable structure caching needs.

Context

Agentic workloads have a peculiar shape. The same long preamble (an instruction file, a tool catalog, retrieved documents, a conversation transcript) gets sent to the model again and again, with only the last few turns changing between calls. The first call to the provider sees a 50,000-token prompt. The next call sees the same 50,000 tokens plus 200 new ones. Without help, the provider charges full price for the whole thing every time.

Prompt caching is the help. The provider remembers the model’s internal state for a prefix it has seen before, and on the next call, if the new prompt starts with that same prefix byte-for-byte, the provider skips the recomputation and bills the cached portion at a steep discount. The mechanism is now standard across OpenAI (automatic), Anthropic (explicit cache_control breakpoints), Google (implicit and explicit context caching), and the cross-provider routing layers that wrap them.

For agent builders, prompt caching is the lever that turns long-context workflows from “we cannot afford to ship this” into “this is fine.” Anthropic publishes up to 90% cost reduction and 85% latency reduction on long prompts. OpenAI’s automatic caching reports up to 90% input-token savings and up to 80% latency reduction on cached prefixes once a prompt crosses the 1,024-token threshold. ProjectDiscovery’s published case study cut their LLM bill 59% by adopting it across their pipeline.

Problem

How do you afford to send a long, mostly-stable prompt on every turn of an agent loop, when the per-token cost of input scales linearly with prompt length?

The naive math is brutal. A 50,000-token system prompt at $3 per million input tokens costs 15 cents per call. A Ralph Wiggum loop calling 50 times a day at that price costs $7.50 per day per agent, before any output tokens, before any tool turns, before any retries. Multiply by a fleet of agents and the bill is real money. The whole prompt is also redundant: the first 49,800 tokens are identical to last call’s first 49,800. Paying full price to recompute the same KV-cache the provider just discarded is an avoidable tax.

Forces

  • Recomputation cost grows with prompt length. Every input token costs both money and latency at full rate, even when the provider just computed the same prefix one second ago.
  • Stable prefixes are what production agents actually have. Instruction files, tool catalogs, system prompts, and conversation transcripts that grow only at the tail are the dominant prompt shape — exactly the shape caching rewards.
  • Cache invalidation is byte-exact. A single character change anywhere in the prefix throws away every downstream token’s cached state. Reorder two paragraphs in the system prompt and you start paying full price again.
  • Provider TTLs are short. Most cached entries expire in five minutes to an hour. An agent that runs once an hour rarely sees a cache hit; an agent that runs every minute almost always does.
  • Caching does not improve quality. It only makes the prompt cheaper. A bad prompt cached is still a bad prompt — and a stale, accumulating context cached is still suffering from context rot, just at a discount.

Solution

Architect the prompt as stable-prefix-first, variable-suffix-last, and let the provider cache the prefix. The prefix should be everything that does not change between calls in this session: the system prompt, the instruction file, the tool catalog, retrieved documents that survive across turns, and the part of the conversation transcript that is now fixed history. The suffix is whatever changed since last call: the new user turn, the new tool result, the latest streaming output.

Three implementation styles are common, and most production stacks use the one that matches their provider:

Implicit caching (OpenAI, Google). The provider hashes the prompt prefix automatically and matches against its cache without any annotation. There is nothing to configure; if the prompt is long enough (OpenAI requires 1,024 tokens; Google’s threshold varies by model) and starts with a prefix the provider has seen recently, you get the discount. OpenAI bills cached input at a deep discount (their docs report up to 90% off) and Google’s implicit tier offers similar discounts on cached portions. The price you pay for zero configuration is zero control: the cache is opaque, and you cannot force a hit or guarantee one.

Explicit caching (Anthropic, Google CachedContent). The caller marks cache breakpoints in the request. Anthropic uses cache_control: { type: "ephemeral" } on specific content blocks; Google uses a CachedContent resource you create and reference by name. The provider commits to caching at exactly those points. On Anthropic, cache reads bill at about 10% of the normal input rate (90% off), with a 25% write premium on the 5-minute TTL and a 100% write premium on the 1-hour TTL. Google’s explicit tier follows a similar shape. Explicit caching is the right choice when you know which prefix is hot and want a guaranteed hit.

Cross-provider abstractions (LiteLLM, OpenRouter). Both expose a single caching surface that maps to whichever underlying provider is in use. The semantics flatten to the lowest common denominator (you give up some Anthropic-specific TTL controls when going through OpenRouter, for example), but you get to write the agent once and switch providers without rewriting the cache integration.

The cross-cutting discipline is the same in all three: stable parts first, never reorder, never edit in place. If a fact in the instruction file becomes wrong, the temptation is to fix it in place. Don’t, mid-session. That change invalidates every byte downstream, and you’ll pay full price on the next call. Either let the session finish on the stale fact, or accept the cache miss as the cost of correctness.

Tip

Order your prompt by stability. Put the stuff that never changes (system prompt, role description) first. Then the stuff that changes per session (project instruction file, retrieved documents). Then the stuff that changes per turn (conversation history, current user message). The earlier in the prompt a token is, the more cache hits it earns over the session’s lifetime.

How It Plays Out

A team runs a coding agent with a 30,000-token CLAUDE.md and a tool catalog of about 8,000 tokens. Every turn ships those 38,000 tokens plus the conversation so far. Without caching, the bill works out to about $0.11 per turn just on input, and a typical session has 40 turns. They switch on Anthropic’s cache_control with a breakpoint after the tool catalog. The first turn pays the 25% write premium on the prefix. Every subsequent turn within the 5-minute TTL bills the prefix at 10% of normal, around $0.011 instead of $0.114. The session that used to cost $4.56 now costs around $0.50. They extend the TTL to 1 hour for the long-running agents that idle between user turns, and the savings compound.

A multi-tenant RAG system runs hundreds of concurrent users, each with a different retrieved document set. The naive shape (system prompt, then user-specific documents, then user query) gets a cache hit only on the system prompt because every user’s document set differs. The team restructures: system prompt first, then a stable user-tier description (“free” / “pro” / “enterprise”), then the documents, then the query. The first two segments cache cleanly across all users in a tier. The documents cache per-user within their session. The query never caches. Total cost drops 40%, and latency on cached prefixes drops more than half.

A developer building a long-form research agent notices that every turn of the agent’s reflection loop is sending the same 60-page paper as context. The paper hasn’t changed; only the agent’s question about it has. Switching to explicit caching with a breakpoint after the paper ends turns the per-turn cost from prohibitive to nearly free. The 1-hour TTL covers a typical research session end-to-end, so the only full-price call is the first one.

Warning

The most expensive cache is the one that never hits. Sources of silent invalidation: a timestamp injected into the system prompt (“today is 2026-04-27”), a non-deterministic tool catalog ordering, a conversation history that includes the assistant’s own variable response (the cache extends through the assistant turn the user sees, but the next request starts from the new content the assistant just wrote, which is fine; what’s not fine is rewriting earlier turns to “clean up” history). The arXiv paper “Don’t Break the Cache” documents how easy it is to inadvertently miss the cache by reordering equivalent content, especially in long-horizon agentic loops.

Consequences

The wins are real and measured. Long-context agentic workflows that would otherwise be uneconomic become routine. Latency drops because the provider skips recomputation: Anthropic publishes 85% latency reduction on long prompts; on conversational agents this shows up as the response starting visibly faster after the first turn warms the cache. Costs drop in lockstep with hit rates: an agent that consistently hits the cache pays roughly 10–25% of the no-cache rate on its prefix, depending on provider.

The cost is architectural discipline. Once a prompt is cached, every byte upstream of any change is a sunk cost. This forces a clear separation between what’s stable and what’s variable, and it punishes mid-session edits to anything early in the prompt. Some teams find this a useful constraint: it nudges them toward keeping configuration in stable files and accumulating volatile state outside the prompt entirely, which is the externalized state discipline. Other teams find it a footgun: every refactor of the prompt template invalidates the cache for every running agent.

A second cost is operational: you have to monitor hit rates. Cached input is billed and reported separately from uncached input, and the ratio is the lever you actually care about. A hit rate above 80% on a long prefix means the architecture is working. A hit rate near zero means something is invalidating silently (a timestamp, a non-deterministic ordering, an over-eager template change), and you’ll see it on the bill before you see it in code review.

A third is provider lock-in pressure. Each provider’s exact semantics differ: TTLs, breakpoint placement rules, minimum cacheable lengths, and discount tiers all vary. A workload tuned for Anthropic’s 90% discount on a 1-hour TTL will not see the same economics on OpenAI’s 50% automatic cache, and switching providers without recalibrating the prompt structure costs more than it saves. Cross-provider abstractions help but always at the cost of the lowest-common-denominator feature set.

Finally, caching is a cost lever, not a quality lever. It does not extend the context window, does not slow context rot, and does not improve the model’s reasoning on the cached content. A long, stale, repetitive context cached is exactly as bad for output quality as the same context uncached, just much cheaper to keep shipping. Use compaction when the prompt is too long for the model to reason over, and use prompt caching when the prompt is the right length but is repeated across many calls. They solve different problems and compose well.

Sources

The mechanism is a direct application of decades of work on transformer KV-cache reuse, applied to the inference-API setting. What’s new is the commercial productization: providers exposing the cache as a first-class billing tier with developer-controlled breakpoints.

Anthropic’s prompt caching feature, launched in 2024 and stabilized through 2025, established the explicit cache_control breakpoint model that other providers have since adopted variants of. The Anthropic documentation is the canonical reference for the explicit-caching shape.

OpenAI introduced automatic prompt caching in 2024, prioritizing zero-configuration adoption over caller control. Their published cache-hit pricing (up to 90% off cached input tokens above the 1,024-token threshold, with up to 80% latency reduction) is the reference point for implicit caching’s economics.

Google’s context caching, available through Gemini, splits the surface into implicit and explicit modes. The explicit CachedContent resource model is closer to a named cache entry than to inline breakpoints, which is a different ergonomic choice from Anthropic’s but the same underlying mechanism.

The arXiv paper Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks provides an academic evaluation of cache stability under exactly the workload pattern this article addresses. Its central finding, that small changes in prompt construction order produce large swings in cache hit rate, is the failure mode every production team rediscovers.

The cross-provider abstractions LiteLLM and OpenRouter document the lowest-common-denominator caching surface across providers and are the most concise inventory of which provider supports which feature.

Further Reading