Prompt Caching and KV Cache: Speeding Up LLM Responses
If you've ever built an application with LLMs, you've felt the pain: every API call costs money, and long prompts take forever to process. Now imagine sending the same 10,000 token document to the model over and over for each user question. That's a lot of wasted computation.
Prompt caching solves this problem by storing frequently used prompt content between API calls. Instead of reprocessing your system instructions or reference documents each time, the model reuses its previous work. The result? Costs drop by up to 90%, and response times shrink by up to 85%.
But here's what most explanations miss: prompt caching isn't about saving your actual text somewhere. It's about caching the model's internal computational state, specifically something called the KV cache. Understanding this distinction helps you structure prompts better and troubleshoot when caching doesn't work as expected.
Let's break down exactly what happens under the hood.
What Is KV Cache and Why Does It Matter?
Before we can understand prompt caching, we need to talk about how tokens work in LLMs. When you send text to an LLM, it first gets split into tokens, then converted into numerical representations the model can process.
The magic happens in the attention mechanism. During attention, each token gets transformed into three vectors: a query (Q), a key (K), and a value (V). The model uses these to figure out which parts of the input are relevant to each other.
Here's the problem: LLMs generate text one token at a time. For each new token, the model needs to compute attention against all previous tokens. Without optimization, this means recalculating the same key and value vectors over and over again.
KV caching fixes this by storing the K and V vectors after they're computed the first time. When generating the next token, the model just looks up the cached vectors instead of recalculating them. This transforms what would be quadratic scaling into linear scaling.
Think of it this way: imagine reading a 500 page book and taking detailed notes as you go. Without KV caching, you'd have to re-read and re-note the entire book every time someone asked you a question. With KV caching, you keep your notes and just flip to the relevant page.
This is why ChatGPT takes a moment before the first token appears, then streams responses quickly. That initial pause is the model building its KV cache for your prompt.
How Prompt Caching Builds on KV Cache
LLM providers realized something clever: if many API requests share the same prompt prefix (like the same system instructions), why not cache that computational work between requests too?
That's essentially what prompt caching does. When you enable it, the provider stores the KV cache state for your prompt prefix on their servers. The next request that uses the same prefix can skip straight to processing the new content.
The key requirements for cache hits:
- Exact prefix match: The cached portion must be identical, down to the character and whitespace
- Minimum token threshold: Most providers require at least 1,024 tokens before caching kicks in
- Cache lifetime (TTL): Caches typically expire after 5 to 10 minutes of inactivity, though some providers offer longer durations
- Consistent ordering: For tools and examples, the order must match exactly
This is why understanding context window and token limits matters. The cached portion counts against your context window, but you're not paying full price for those tokens on subsequent requests.
Comparing Provider Implementations
The three major providers handle prompt caching differently. Here's how to work with each.
OpenAI: Automatic Caching
OpenAI takes a hands-off approach. Prompt caching activates automatically for any prompt over 1,024 tokens. There's no API changes required and no extra fees for cache writes or storage.
The system caches in 128 token increments beyond the initial 1,024 tokens. Cached tokens cost 50% of normal input token prices. Cache entries typically last 5 to 10 minutes of inactivity, though they can persist up to an hour during off-peak times.
With GPT-5.1, OpenAI introduced extended caching that can retain prompts for up to 24 hours. This is particularly useful for long-running coding sessions or multi-turn conversations.
One quirk: OpenAI routes requests to servers based on a hash of your prompt's first ~256 tokens. If you want consistent cache hits, keep that initial prefix stable. You can also pass a prompt_cache_key parameter to influence routing.
Anthropic: Explicit Control
Anthropic gives you more control but charges for it. You explicitly mark cache boundaries using cache_control breakpoints in your API requests.
The pricing model:
- Cache writes cost 125% of base input token price (25% premium)
- Cache reads cost just 10% of base input token price (90% discount)
- Standard cache TTL is 5 minutes, with a 1-hour option available at 2x base price
This means Anthropic's approach pays off if you're hitting the cache frequently. If your prompts change often, those write premiums add up.
Minimum cacheable tokens vary by model: 1,024 for most Claude models, but 4,096 for Claude Haiku 4.5. You can place up to 4 cache breakpoints per request, letting you cache different sections independently.
Google Gemini: Hybrid Approach
Google offers both implicit and explicit caching for Gemini 2.5 models.
Implicit caching works automatically, similar to OpenAI. The minimum is 1,024 tokens for Gemini 2.5 Flash and 2,048 tokens for Gemini 2.5 Pro. Cached tokens get a 90% discount with no additional setup.
Explicit caching through the Context Caching API gives you more control. You create a named cache object with a configurable TTL (default 60 minutes) and reference it in subsequent requests. This is useful when you want guaranteed cost savings rather than hoping for cache hits.
Google also charges for cache storage based on TTL duration, so factor that into your cost calculations for long-lived caches.
The Real Cost Savings: A Quick Calculation
Let's say you're building a document Q&A app. Users upload a 50,000 token contract and ask multiple questions about it.
Without caching (Anthropic Claude Sonnet 4.5 pricing):
- Each question: 50,000 input tokens at $3.00 per million = $0.15
- 10 questions: $1.50 total
With caching:
- First question (cache write): 50,000 tokens at $3.75 per million = $0.1875
- Subsequent questions (cache read): 50,000 tokens at $0.30 per million = $0.015 each
- 10 questions: $0.1875 + ($0.015 x 9) = $0.3225 total
That's a 78% cost reduction, and it gets better with more questions.
When you're thinking about strategies for AI cost reduction, prompt caching should be near the top of your list. Combine it with batch requests for efficiency and you can dramatically lower your API bills.
Best Use Cases for Prompt Caching
Prompt caching works best when you have repetitive, static content that gets reused across multiple requests. Here are the scenarios where it shines.
Chatbots and Conversational Agents
Your chatbot probably has a detailed system prompt defining its persona, capabilities, and rules. Without caching, the model reprocesses these instructions on every message.
With prompt caching, you cache the system prompt once. Each user message only requires processing the new content. This is especially valuable for AI workflow automation tools that maintain long conversations.
Document Q&A
Embed your entire document in the prompt, cache it, then let users ask unlimited questions. The model only processes each new question, not the full document each time.
This pattern works great for legal contracts, technical documentation, research papers, or any long-form content that needs repeated querying.
Coding Assistants
Cache your codebase summary, project context, or coding standards. As developers ask questions or request changes, the model already understands the project context without re-reading everything.
Few-Shot Learning
Including many high-quality examples improves model performance but increases prompt length. With caching, you can include 20 or even 100 examples without paying full price for each request.
Agentic Workflows
AI agents typically have complex system prompts with detailed tool definitions. When using multiple rounds of tool calls, caching these definitions dramatically speeds up each step.
Understanding what is AI inference exactly helps you appreciate why this matters. Every inference call has compute costs, and caching lets you amortize those costs across many requests.
How to Structure Prompts for Maximum Cache Hits
The golden rule: put static content first, dynamic content last.
Cache matching works on prefixes. If anything changes in the cached portion, you lose the cache hit. So structure your prompts like this:
- [TOOLS - static, cache this]
- [SYSTEM PROMPT - static, cache this]
- [FEW-SHOT EXAMPLES - static, cache this]
- [DOCUMENT CONTEXT - static for this session, cache this]
- [CONVERSATION HISTORY - semi-dynamic]
- [USER QUERY - dynamic, don't cache]
Some practical tips:
Keep tool definitions stable. If you're using function calling, define tools in a consistent order. Changing the order invalidates the cache.
Don't put user-specific data in cached sections. If you include the user's name or preferences in your system prompt, every user gets a different cache entry.
Use consistent formatting. Extra whitespace or reformatting text can break cache matches. Normalize your prompts.
Consider cache breakpoints strategically. With Anthropic, you can set multiple breakpoints to cache sections independently. If your tool definitions change rarely but your examples update weekly, cache them separately.
For more on LLM parameters and configuration, see our dedicated guide.
Prompt Caching vs Semantic Caching
You might hear about semantic caching as another option. These are different techniques that can work together.
Prompt caching (what we've been discussing):
- Happens inside the LLM provider's infrastructure
- Requires exact prefix matches
- Caches the model's intermediate computational state
- Still calls the LLM, just faster and cheaper
Semantic caching:
- Happens at the application layer
- Uses embedding similarity to match similar queries
- Caches the actual output text
- Completely skips the LLM call on cache hits
Semantic caching is useful when users ask the same questions in different ways. If one user asks "How do I reset my password?" and another asks "I forgot my password, what do I do?", semantic caching can return the same cached answer.
The best approach often combines both. Use prompt caching to make each LLM call efficient, and semantic caching to eliminate redundant LLM calls entirely.
Advanced KV Cache Optimizations
While prompt caching is what you control as an API user, it's worth understanding the techniques providers use to manage KV cache at scale.
PagedAttention
Traditional KV cache allocation is wasteful. Memory gets reserved for the maximum possible sequence length, even if most requests are shorter.
PagedAttention, developed for the vLLM framework, borrows concepts from operating system memory management. It splits the KV cache into fixed-size blocks that can be allocated non-contiguously, just like virtual memory pages. This dramatically reduces memory waste and enables higher throughput.
Grouped Query Attention (GQA)
Standard multi-head attention uses separate K and V vectors for each attention head. GQA shares K and V vectors across groups of query heads, reducing KV cache size by 4x to 8x while maintaining most of the model quality.
Models like Llama 3, Mistral, and Gemini use GQA. This is why newer models can handle longer contexts without running out of GPU memory.
Sliding Window Attention
Instead of attending to all previous tokens, sliding window attention focuses on a fixed window of recent tokens. This caps KV cache growth regardless of sequence length.
Models like Mistral use a 4,096 token sliding window, enabling efficient processing of very long sequences.
Relationship to Other LLM Concepts
Prompt caching connects to several other optimization techniques worth understanding.
Streaming for faster first tokens: While prompt caching reduces the time before the first token (TTFT), streaming lets you display tokens as they're generated. Use both for the most responsive experience.
Temperature for output variation: Caching affects input processing, not output generation. You can still vary temperature, top-p, and other sampling parameters without invalidating your cache.
Training vs inference: KV caching is purely an inference optimization. During training, the model processes full sequences in parallel and doesn't need to cache intermediate states.
Managing API rate limits: Cache hits typically don't count against rate limits (Anthropic explicitly excludes them). This means caching can also improve your effective rate limit utilization.
Latency vs throughput optimization: Prompt caching primarily reduces latency. For throughput optimization, combine it with batching and other techniques.
Common Pitfalls and How to Avoid Them
Expecting cache hits immediately. Caches need to be warmed. The first request with a given prefix always pays full price. Build this into your cost projections.
Changing prefix content frequently. If your system prompt or tool definitions change often, you'll rarely get cache hits. Stabilize your prefixes.
Not meeting minimum token thresholds. Prompts under 1,024 tokens (or higher for some models) won't cache at all. Sometimes it's worth padding prompts to hit the threshold.
Forgetting about cache expiration. Standard TTL is 5 minutes. For intermittent workloads with gaps longer than 5 minutes, you'll frequently recache. Consider Anthropic's 1-hour option or Google's explicit caching with longer TTL.
Putting dynamic content in the wrong place. User IDs, timestamps, or personalized content in your cached prefix break cache matches. Move anything dynamic to the end of your prompt.
What's Next for LLM Caching
The trend is toward smarter, more flexible caching. We're seeing:
- Longer cache retention: OpenAI's 24-hour option for GPT-5.1 signals that providers recognize the value of persistent caches
- Automatic prefix matching: Instead of requiring exact matches, future systems might identify common subsequences and cache those
- Cross-request optimization: Techniques like RadixAttention already enable automatic prefix caching within inference engines like SGLang
- Lower minimum thresholds: Google recently dropped Gemini's minimum from 32K to 4K tokens, making caching accessible to more use cases
As transformer architecture continues evolving with techniques like MLA (Multi-Head Latent Attention), we'll see even more efficient KV representations that enable better caching.
Getting Started
Ready to implement prompt caching in your application? Here's a quick action plan:
- Audit your current prompts. Identify which parts are static (system instructions, tool definitions, reference documents) vs dynamic (user queries, conversation history).
- Restructure for caching. Move all static content to the beginning of your prompts.
- Check if you meet minimums. Ensure your static prefix is at least 1,024 tokens.
- Enable caching per your provider. OpenAI is automatic. Anthropic requires cache_control breakpoints. Google works automatically for implicit caching.
- Monitor cache hit rates. Track cached_tokens (OpenAI) or cache_read_input_tokens (Anthropic) in API responses. If hit rates are low, investigate why.
- Calculate your actual savings. Compare costs before and after caching to validate the impact.
If you're exploring AI tools for your workflow, prompt caching is one of those optimizations that pays dividends immediately. A few hours of implementation work can save thousands of dollars monthly at scale.



