Prompt Caching and KV Cache: Speeding Up LLM Responses
LLM APIs & Developer Tools
Prompt Caching and KV Cache: Speeding Up LLM Responses
SStackviv Team
13 min read

Key takeaways

  • Prompt caching stores frequently used prompt content (like system instructions or documents) so LLMs don't reprocess them on every request, cutting costs by up to 90% and latency by up to 85%
  • KV cache is the underlying mechanism that makes this possible, storing key and value tensors from the transformer's attention layers during inference
  • OpenAI caches automatically for prompts over 1,024 tokens with no extra charge. Anthropic gives you more control with explicit cache breakpoints but charges 25% more for cache writes. Google offers both implicit and explicit caching.
  • Best use cases include chatbots with fixed system prompts, document Q&A, coding assistants, and agentic workflows with tool definitions
  • Structure your prompts with static content first and dynamic content last to maximize cache hits

Prompt Caching and KV Cache: Speeding Up LLM Responses

If you've ever built an application with LLMs, you've felt the pain: every API call costs money, and long prompts take forever to process. Now imagine sending the same 10,000 token document to the model over and over for each user question. That's a lot of wasted computation.

Prompt caching solves this problem by storing frequently used prompt content between API calls. Instead of reprocessing your system instructions or reference documents each time, the model reuses its previous work. The result? Costs drop by up to 90%, and response times shrink by up to 85%.

But here's what most explanations miss: prompt caching isn't about saving your actual text somewhere. It's about caching the model's internal computational state, specifically something called the KV cache. Understanding this distinction helps you structure prompts better and troubleshoot when caching doesn't work as expected.

Let's break down exactly what happens under the hood.

What Is KV Cache and Why Does It Matter?

Before we can understand prompt caching, we need to talk about how tokens work in LLMs. When you send text to an LLM, it first gets split into tokens, then converted into numerical representations the model can process.

The magic happens in the attention mechanism. During attention, each token gets transformed into three vectors: a query (Q), a key (K), and a value (V). The model uses these to figure out which parts of the input are relevant to each other.

Here's the problem: LLMs generate text one token at a time. For each new token, the model needs to compute attention against all previous tokens. Without optimization, this means recalculating the same key and value vectors over and over again.

KV caching fixes this by storing the K and V vectors after they're computed the first time. When generating the next token, the model just looks up the cached vectors instead of recalculating them. This transforms what would be quadratic scaling into linear scaling.

Think of it this way: imagine reading a 500 page book and taking detailed notes as you go. Without KV caching, you'd have to re-read and re-note the entire book every time someone asked you a question. With KV caching, you keep your notes and just flip to the relevant page.

This is why ChatGPT takes a moment before the first token appears, then streams responses quickly. That initial pause is the model building its KV cache for your prompt.

How Prompt Caching Builds on KV Cache

LLM providers realized something clever: if many API requests share the same prompt prefix (like the same system instructions), why not cache that computational work between requests too?

That's essentially what prompt caching does. When you enable it, the provider stores the KV cache state for your prompt prefix on their servers. The next request that uses the same prefix can skip straight to processing the new content.

The key requirements for cache hits:

  • Exact prefix match: The cached portion must be identical, down to the character and whitespace
  • Minimum token threshold: Most providers require at least 1,024 tokens before caching kicks in
  • Cache lifetime (TTL): Caches typically expire after 5 to 10 minutes of inactivity, though some providers offer longer durations
  • Consistent ordering: For tools and examples, the order must match exactly

This is why understanding context window and token limits matters. The cached portion counts against your context window, but you're not paying full price for those tokens on subsequent requests.

Comparing Provider Implementations

The three major providers handle prompt caching differently. Here's how to work with each.

OpenAI: Automatic Caching

OpenAI takes a hands-off approach. Prompt caching activates automatically for any prompt over 1,024 tokens. There's no API changes required and no extra fees for cache writes or storage.

The system caches in 128 token increments beyond the initial 1,024 tokens. Cached tokens cost 50% of normal input token prices. Cache entries typically last 5 to 10 minutes of inactivity, though they can persist up to an hour during off-peak times.

With GPT-5.1, OpenAI introduced extended caching that can retain prompts for up to 24 hours. This is particularly useful for long-running coding sessions or multi-turn conversations.

One quirk: OpenAI routes requests to servers based on a hash of your prompt's first ~256 tokens. If you want consistent cache hits, keep that initial prefix stable. You can also pass a prompt_cache_key parameter to influence routing.

Anthropic: Explicit Control

Anthropic gives you more control but charges for it. You explicitly mark cache boundaries using cache_control breakpoints in your API requests.

The pricing model:

  • Cache writes cost 125% of base input token price (25% premium)
  • Cache reads cost just 10% of base input token price (90% discount)
  • Standard cache TTL is 5 minutes, with a 1-hour option available at 2x base price

This means Anthropic's approach pays off if you're hitting the cache frequently. If your prompts change often, those write premiums add up.

Minimum cacheable tokens vary by model: 1,024 for most Claude models, but 4,096 for Claude Haiku 4.5. You can place up to 4 cache breakpoints per request, letting you cache different sections independently.

Google Gemini: Hybrid Approach

Google offers both implicit and explicit caching for Gemini 2.5 models.

Implicit caching works automatically, similar to OpenAI. The minimum is 1,024 tokens for Gemini 2.5 Flash and 2,048 tokens for Gemini 2.5 Pro. Cached tokens get a 90% discount with no additional setup.

Explicit caching through the Context Caching API gives you more control. You create a named cache object with a configurable TTL (default 60 minutes) and reference it in subsequent requests. This is useful when you want guaranteed cost savings rather than hoping for cache hits.

Google also charges for cache storage based on TTL duration, so factor that into your cost calculations for long-lived caches.

The Real Cost Savings: A Quick Calculation

Let's say you're building a document Q&A app. Users upload a 50,000 token contract and ask multiple questions about it.

Without caching (Anthropic Claude Sonnet 4.5 pricing):

  • Each question: 50,000 input tokens at $3.00 per million = $0.15
  • 10 questions: $1.50 total

With caching:

  • First question (cache write): 50,000 tokens at $3.75 per million = $0.1875
  • Subsequent questions (cache read): 50,000 tokens at $0.30 per million = $0.015 each
  • 10 questions: $0.1875 + ($0.015 x 9) = $0.3225 total

That's a 78% cost reduction, and it gets better with more questions.

When you're thinking about strategies for AI cost reduction, prompt caching should be near the top of your list. Combine it with batch requests for efficiency and you can dramatically lower your API bills.

Best Use Cases for Prompt Caching

Prompt caching works best when you have repetitive, static content that gets reused across multiple requests. Here are the scenarios where it shines.

Chatbots and Conversational Agents

Your chatbot probably has a detailed system prompt defining its persona, capabilities, and rules. Without caching, the model reprocesses these instructions on every message.

With prompt caching, you cache the system prompt once. Each user message only requires processing the new content. This is especially valuable for AI workflow automation tools that maintain long conversations.

Document Q&A

Embed your entire document in the prompt, cache it, then let users ask unlimited questions. The model only processes each new question, not the full document each time.

This pattern works great for legal contracts, technical documentation, research papers, or any long-form content that needs repeated querying.

Coding Assistants

Cache your codebase summary, project context, or coding standards. As developers ask questions or request changes, the model already understands the project context without re-reading everything.

Few-Shot Learning

Including many high-quality examples improves model performance but increases prompt length. With caching, you can include 20 or even 100 examples without paying full price for each request.

Agentic Workflows

AI agents typically have complex system prompts with detailed tool definitions. When using multiple rounds of tool calls, caching these definitions dramatically speeds up each step.

Understanding what is AI inference exactly helps you appreciate why this matters. Every inference call has compute costs, and caching lets you amortize those costs across many requests.

How to Structure Prompts for Maximum Cache Hits

The golden rule: put static content first, dynamic content last.

Cache matching works on prefixes. If anything changes in the cached portion, you lose the cache hit. So structure your prompts like this:

  • [TOOLS - static, cache this]
  • [SYSTEM PROMPT - static, cache this]
  • [FEW-SHOT EXAMPLES - static, cache this]
  • [DOCUMENT CONTEXT - static for this session, cache this]
  • [CONVERSATION HISTORY - semi-dynamic]
  • [USER QUERY - dynamic, don't cache]

Some practical tips:

Keep tool definitions stable. If you're using function calling, define tools in a consistent order. Changing the order invalidates the cache.

Don't put user-specific data in cached sections. If you include the user's name or preferences in your system prompt, every user gets a different cache entry.

Use consistent formatting. Extra whitespace or reformatting text can break cache matches. Normalize your prompts.

Consider cache breakpoints strategically. With Anthropic, you can set multiple breakpoints to cache sections independently. If your tool definitions change rarely but your examples update weekly, cache them separately.

For more on LLM parameters and configuration, see our dedicated guide.

Prompt Caching vs Semantic Caching

You might hear about semantic caching as another option. These are different techniques that can work together.

Prompt caching (what we've been discussing):

  • Happens inside the LLM provider's infrastructure
  • Requires exact prefix matches
  • Caches the model's intermediate computational state
  • Still calls the LLM, just faster and cheaper

Semantic caching:

  • Happens at the application layer
  • Uses embedding similarity to match similar queries
  • Caches the actual output text
  • Completely skips the LLM call on cache hits

Semantic caching is useful when users ask the same questions in different ways. If one user asks "How do I reset my password?" and another asks "I forgot my password, what do I do?", semantic caching can return the same cached answer.

The best approach often combines both. Use prompt caching to make each LLM call efficient, and semantic caching to eliminate redundant LLM calls entirely.

Advanced KV Cache Optimizations

While prompt caching is what you control as an API user, it's worth understanding the techniques providers use to manage KV cache at scale.

PagedAttention

Traditional KV cache allocation is wasteful. Memory gets reserved for the maximum possible sequence length, even if most requests are shorter.

PagedAttention, developed for the vLLM framework, borrows concepts from operating system memory management. It splits the KV cache into fixed-size blocks that can be allocated non-contiguously, just like virtual memory pages. This dramatically reduces memory waste and enables higher throughput.

Grouped Query Attention (GQA)

Standard multi-head attention uses separate K and V vectors for each attention head. GQA shares K and V vectors across groups of query heads, reducing KV cache size by 4x to 8x while maintaining most of the model quality.

Models like Llama 3, Mistral, and Gemini use GQA. This is why newer models can handle longer contexts without running out of GPU memory.

Sliding Window Attention

Instead of attending to all previous tokens, sliding window attention focuses on a fixed window of recent tokens. This caps KV cache growth regardless of sequence length.

Models like Mistral use a 4,096 token sliding window, enabling efficient processing of very long sequences.

Relationship to Other LLM Concepts

Prompt caching connects to several other optimization techniques worth understanding.

Streaming for faster first tokens: While prompt caching reduces the time before the first token (TTFT), streaming lets you display tokens as they're generated. Use both for the most responsive experience.

Temperature for output variation: Caching affects input processing, not output generation. You can still vary temperature, top-p, and other sampling parameters without invalidating your cache.

Training vs inference: KV caching is purely an inference optimization. During training, the model processes full sequences in parallel and doesn't need to cache intermediate states.

Managing API rate limits: Cache hits typically don't count against rate limits (Anthropic explicitly excludes them). This means caching can also improve your effective rate limit utilization.

Latency vs throughput optimization: Prompt caching primarily reduces latency. For throughput optimization, combine it with batching and other techniques.

Common Pitfalls and How to Avoid Them

Expecting cache hits immediately. Caches need to be warmed. The first request with a given prefix always pays full price. Build this into your cost projections.

Changing prefix content frequently. If your system prompt or tool definitions change often, you'll rarely get cache hits. Stabilize your prefixes.

Not meeting minimum token thresholds. Prompts under 1,024 tokens (or higher for some models) won't cache at all. Sometimes it's worth padding prompts to hit the threshold.

Forgetting about cache expiration. Standard TTL is 5 minutes. For intermittent workloads with gaps longer than 5 minutes, you'll frequently recache. Consider Anthropic's 1-hour option or Google's explicit caching with longer TTL.

Putting dynamic content in the wrong place. User IDs, timestamps, or personalized content in your cached prefix break cache matches. Move anything dynamic to the end of your prompt.

What's Next for LLM Caching

The trend is toward smarter, more flexible caching. We're seeing:

  • Longer cache retention: OpenAI's 24-hour option for GPT-5.1 signals that providers recognize the value of persistent caches
  • Automatic prefix matching: Instead of requiring exact matches, future systems might identify common subsequences and cache those
  • Cross-request optimization: Techniques like RadixAttention already enable automatic prefix caching within inference engines like SGLang
  • Lower minimum thresholds: Google recently dropped Gemini's minimum from 32K to 4K tokens, making caching accessible to more use cases

As transformer architecture continues evolving with techniques like MLA (Multi-Head Latent Attention), we'll see even more efficient KV representations that enable better caching.

Getting Started

Ready to implement prompt caching in your application? Here's a quick action plan:

  1. Audit your current prompts. Identify which parts are static (system instructions, tool definitions, reference documents) vs dynamic (user queries, conversation history).
  2. Restructure for caching. Move all static content to the beginning of your prompts.
  3. Check if you meet minimums. Ensure your static prefix is at least 1,024 tokens.
  4. Enable caching per your provider. OpenAI is automatic. Anthropic requires cache_control breakpoints. Google works automatically for implicit caching.
  5. Monitor cache hit rates. Track cached_tokens (OpenAI) or cache_read_input_tokens (Anthropic) in API responses. If hit rates are low, investigate why.
  6. Calculate your actual savings. Compare costs before and after caching to validate the impact.

If you're exploring AI tools for your workflow, prompt caching is one of those optimizations that pays dividends immediately. A few hours of implementation work can save thousands of dollars monthly at scale.

Frequently Asked Questions

Does prompt caching affect output quality?

No. The model processes cached tokens identically to fresh tokens. Prompt caching only skips redundant computation, it doesn't change what the model outputs.

Can I cache across different models?

No. Caches are model-specific. A cache for GPT-5.1 won't work with GPT-4.1, and a Claude Sonnet cache won't work with Claude Opus.

What happens if the cache expires mid-conversation?

The next request pays full price to rebuild the cache. For long-running conversations, consider extended cache durations or accept occasional cache misses.

How do I know if caching is working?

Check the API response metadata. OpenAI returns cached_tokens in usage.prompt_tokens_details. Anthropic returns cache_read_input_tokens in the usage object. Google returns cached_content_token_count.

Is prompt caching the same as the cached context window?

Related but distinct. The cached context window refers to how much of your prompt prefix is cached. Prompt caching is the feature that enables this. A cached context window can significantly reduce costs for subsequent requests using that cached content.
Stackviv Team

Stackviv Team

Author

Stackviv Team is our editorial crew of AI enthusiasts and tech researchers dedicated to helping you discover the best AI tools. We test, compare, and review AI software across every category to bring you honest insights and practical guides. Our mission: make AI accessible and useful for everyone - from beginners to professionals.

Related Articles

View All
Streaming vs Non-streaming API Responses
LLM APIs & Developer Tools

Streaming vs Non-streaming API Responses

Understanding when to use streaming APIs for real-time AI output versus non-streaming batch responses, including implementation details for SSE, chunked responses, and performance optimization.

SStackviv Team
14 min
Read: Streaming vs Non-streaming API Responses
Batching API Requests: Optimizing for Cost and Speed
LLM APIs & Developer Tools

Batching API Requests: Optimizing for Cost and Speed

Learn how to batch API requests to cut LLM costs by 50% and dramatically boost throughput. Complete guide covering OpenAI, Anthropic Claude, and Google Gemini batch processing implementations for 2026.

SStackviv Team
11 min
Read: Batching API Requests: Optimizing for Cost and Speed
LLM Temperature Explained: Controlling AI Creativity
LLM APIs & Developer Tools

LLM Temperature Explained: Controlling AI Creativity

Learn how LLM temperature controls AI output randomness, from predictable responses at temperature 0 to creative outputs at temperature 1, with practical use case recommendations and API examples.

SStackviv Team
12 min
Read: LLM Temperature Explained: Controlling AI Creativity