Prompt Caching Efficiency — Measuring Reuse Across Real Workloads
DOI: 10.5281/zenodo.19187992[1] · View on Zenodo (CERN)
| Badge | Metric | Value | Status | Description |
|---|---|---|---|---|
| [s] | Reviewed Sources | 9% | ○ | ≥80% from editorially reviewed sources |
| [t] | Trusted | 91% | ✓ | ≥80% from verified, high-quality sources |
| [a] | DOI | 73% | ○ | ≥80% have a Digital Object Identifier |
| [b] | CrossRef | 0% | ○ | ≥80% indexed in CrossRef |
| [i] | Indexed | 100% | ✓ | ≥80% have metadata indexed |
| [l] | Academic | 9% | ○ | ≥80% from journals/conferences/preprints |
| [f] | Free Access | 18% | ○ | ≥80% are freely accessible |
| [r] | References | 11 refs | ✓ | Minimum 10 references required |
| [w] | Words [REQ] | 2,628 | ✓ | Minimum 2,000 words for a full research article. Current: 2,628 |
| [d] | DOI [REQ] | ✓ | ✓ | Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19187992 |
| [o] | ORCID [REQ] | ✓ | ✓ | Author ORCID verified for academic identity |
| [p] | Peer Reviewed [REQ] | — | ✗ | Peer reviewed by an assigned reviewer |
| [h] | Freshness [REQ] | 89% | ✓ | ≥80% of references from 2025–2026. Current: 89% |
| [c] | Data Charts | 5 | ✓ | Original data charts from reproducible analysis (min 2). Current: 5 |
| [g] | Code | — | ○ | Source code available on GitHub |
| [m] | Diagrams | 3 | ✓ | Mermaid architecture/flow diagrams. Current: 3 |
| [x] | Cited by | 0 | ○ | Referenced by 0 other hub article(s) |
Abstract #
Prompt caching has emerged as one of the most impactful optimizations for reducing both cost and latency in large language model inference, with major providers reporting 50-90% cost savings through prefix reuse. Yet the efficiency of prompt caching varies dramatically across workload types, caching strategies, and eviction policies. This article investigates three research questions: how cache hit rates differ across real-world workload categories, which caching strategies maximize cost and latency savings across providers, and how advanced eviction policies compare to standard LRU for prompt cache management. Drawing on recent 2026 evaluations spanning OpenAI, Anthropic, and Google deployments, we quantify reuse efficiency using cache hit rate, time-to-first-token (TTFT) improvement, and API cost reduction as primary metrics. Our analysis reveals that batch processing achieves the highest hit rates (92%) while RAG pipelines exhibit the lowest (45%), that strategic cache block placement outperforms naive full-context caching, and that learned eviction policies reduce P90 tail latency by up to 32% compared to LRU baselines. These findings directly inform KV-cache memory management strategies examined throughout this series.
1. Introduction #
In the previous article, we established a comprehensive cross-architecture comparison of KV-cache memory behavior across Llama, Mistral, Gemma, and Qwen model families, demonstrating significant variation in memory footprint and attention pattern efficiency ([1][2]). Building on that architectural foundation, we now turn to the practical question of how efficiently these KV-caches can be reused across requests in production serving environments.
Prompt caching refers to the productized, provider-managed features that reuse KV tensors across API requests when prompts share common prefixes ([2][3]). Unlike general KV-cache optimization techniques explored in earlier articles of this series, prompt caching operates at the serving infrastructure level, enabling entire prefix computations to be skipped when subsequent requests begin with identical content. Major LLM providers including OpenAI, Anthropic, and Google have each implemented prompt caching with distinct approaches, minimum token thresholds, and pricing models, making cross-provider efficiency comparison essential for practitioners.
The significance of prompt caching extends beyond simple cost reduction. As LLM applications evolve from single-turn chatbots to complex multi-turn agentic workflows requiring dozens of API calls with increasingly large context windows, the cumulative savings from effective caching compound dramatically ([3][4]). However, naive caching strategies can paradoxically increase latency when dynamic content placement disrupts prefix matching, and standard eviction policies fail to account for workload-specific reuse patterns ([4][5]).
Research Questions #
RQ1: How do prompt cache hit rates vary across different real-world workload categories (chat, code completion, RAG, agentic, batch processing)?
RQ2: Which caching strategies (full context, system-only, exclude-dynamic) maximize cost and latency savings across major LLM providers?
RQ3: How do advanced eviction policies (Tail-Optimized LRU, Learned Prefix Caching) compare to standard LRU for prompt cache management?
These questions matter for this series because KV-cache memory management cannot be optimized in isolation from the reuse patterns that determine whether cached data delivers value. Understanding prompt caching efficiency provides the bridge between the architectural memory properties we have analyzed and the infrastructure-level optimization techniques we will examine in subsequent articles.
2. Existing Approaches (2026 State of the Art) #
2.1 Provider-Level Prompt Caching #
The three major LLM API providers each implement prompt caching with distinct mechanisms. OpenAI caches prompts automatically for requests sharing common prefixes above a minimum threshold, offering 50% cost reduction on cached input tokens with no additional API configuration required ([2][3]). Anthropic’s implementation requires explicit cache control headers but supports more granular cache block placement, enabling developers to specify exactly which prompt segments should be cached. Google’s Gemini API provides the most aggressive caching with up to 75% discount on cached tokens and broader matching capabilities.
Each approach carries limitations. OpenAI’s automatic caching provides no developer control over cache block boundaries, making it difficult to optimize for workloads with dynamic content interspersed within prompts. Anthropic’s explicit approach requires engineering investment but enables superior cache hit rates for complex prompt structures. Google’s system, while offering the largest discounts, has minimum token requirements that exclude shorter prompts from caching benefits.
2.2 Infrastructure-Level KV-Cache Reuse #
Beyond API-level caching, open-source inference engines have developed sophisticated prefix caching mechanisms. vLLM implements automatic prefix caching (APC) using block-level hashing, where KV-cache is divided into fixed-size blocks that are matched against incoming requests via hash comparison ([5][6]). SGLang takes a different approach with its RadixAttention system, using a token-level radix tree that enables more fine-grained prefix matching and supports partial prefix reuse.
LMCache represents the most comprehensive open-source solution, operating as a dedicated KV-cache layer that extracts and stores caches generated by vLLM and SGLang outside GPU memory, sharing them across engines and queries ([5][6]). It supports both cache offloading for prefix reuse across queries and prefill-decode disaggregation for cross-engine cache transfer, achieving significant throughput improvements in enterprise deployments.
2.3 Semantic and Plan-Level Caching #
A newer category of caching extends beyond exact prefix matching to semantic similarity. GPTCache and similar systems store (input, output) pairs and serve cached responses for semantically similar queries, bypassing model inference entirely. Agentic Plan Caching (APC) represents an even higher abstraction, extracting structured plan templates from completed agent executions and reusing them for semantically similar tasks, reducing costs by 50.31% and latency by 27.28% on average ([3][4]).
flowchart TD
A[Prompt Caching Approaches] --> B[Exact Prefix Match]
A --> C[Semantic Cache]
A --> D[Plan-Level Cache]
B --> B1[Provider APIs - OpenAI/Anthropic/Google]
B --> B2[Engine-Level - vLLM APC / SGLang RadixAttention]
B --> B3[Dedicated Layer - LMCache]
C --> C1[GPTCache - Input-Output Pairs]
C --> C2[Embedding Similarity Matching]
D --> D1[Agentic Plan Caching]
D --> D2[Template Extraction and Reuse]
B1 --> L1[Limited: No block control - OpenAI]
B2 --> L2[Limited: Single-engine scope]
C1 --> L3[Limited: Stale outputs, accuracy drift]
D1 --> L4[Limited: Task similarity threshold]
2.4 Distributed and Edge Prompt Caching #
Recent work has extended prompt caching to distributed and edge environments. Distributed prompt caching across resource-constrained devices uses Bloom-filter-based catalogs to determine whether remote servers possess desired internal states, reducing TTFT by 93.12% on edge deployments ([6][7]). KVFlow addresses multi-agent workflows specifically, introducing workflow-aware KV-cache management that models agent execution schedules as directed graphs to predict future cache reuse, achieving significantly better hit rates than LRU-based systems ([7][8]).
3. Quality Metrics and Evaluation Framework #
3.1 Primary Metrics #
Evaluating prompt caching efficiency requires metrics that capture both the economic and performance dimensions of cache reuse. We adopt the following framework:
| RQ | Metric | Source | Threshold |
|---|---|---|---|
| RQ1 | Cache Hit Rate (%) | LMCache analytics, provider dashboards | >60% for production viability |
| RQ2 | API Cost Reduction (%) and TTFT Improvement (%) | Gupta et al. 2026 evaluation | >40% cost reduction, >15% TTFT improvement |
| RQ3 | P90/P95 Tail Latency Reduction (%) and SLO Violation Rate | Zhang et al. 2025, Yang et al. 2025 | >20% tail latency reduction |
Cache Hit Rate measures the fraction of incoming requests that match an existing cached prefix. This metric varies by workload because prompt structure determines prefix stability. A system prompt that remains identical across requests yields high hit rates, while RAG-augmented prompts with dynamically retrieved context fragments achieve lower rates.
API Cost Reduction quantifies the percentage decrease in inference cost relative to a no-cache baseline. This metric captures the economic value of caching and depends on both hit rate and the provider’s pricing differential between cached and uncached tokens.
TTFT Improvement measures the reduction in time-to-first-token, the latency between request submission and the first generated token. Prompt caching improves TTFT by eliminating the prefill computation for cached portions, with the benefit scaling linearly with cached prefix length ([2][3]).
Tail Latency Metrics (P90, P95) and SLO violation rates capture the worst-case performance that matters most for production deployments. Standard eviction policies optimize for average-case performance but can exhibit poor tail behavior when high-latency conversations are evicted prematurely ([4][5]).
graph LR
RQ1 --> M1[Cache Hit Rate] --> E1[Workload Category Analysis]
RQ2 --> M2[Cost Reduction + TTFT] --> E2[Cross-Provider Strategy Comparison]
RQ3 --> M3[Tail Latency + SLO Violations] --> E3[Eviction Policy Benchmarks]
E1 --> V[Production Viability Assessment]
E2 --> V
E3 --> V
3.2 Workload Categorization #
To answer RQ1, we categorize workloads by their prompt structure characteristics:
- Chat (multi-turn): Stable system prompts with growing conversation history. Prefix stability degrades as conversation length increases.
- Code completion: Highly stable file context prefixes with small varying suffixes. Naturally suited to prefix caching.
- RAG pipelines: Dynamic retrieved context inserted into prompts. Prefix matching depends on retrieval consistency.
- Agentic workflows: Complex multi-step prompts with tool call results. Dynamic content placement significantly affects cache behavior.
- Batch processing: Identical system prompts across parallel requests. Maximum prefix reuse opportunity.
4. Application to Our Case #
4.1 Cache Hit Rate Analysis Across Workloads (RQ1) #
Analyzing cache hit rates across workload categories reveals substantial variation driven by prompt structure. Our analysis synthesizes data from LMCache production telemetry, provider documentation, and the DeepResearch Bench evaluation ([2][3]).

Batch processing achieves the highest cache hit rates (92%) because all requests in a batch share identical system prompts and processing instructions. Code completion follows closely (85%) due to the stability of file context and project-level prompt prefixes. Multi-turn chat achieves moderate rates (72%), limited by conversation history growth that creates unique prefixes after several turns. Agentic workflows show lower rates (63%) because tool call results inject dynamic content that breaks prefix matching. RAG pipelines exhibit the lowest hit rates (45%) as dynamically retrieved documents create unique prompt content on each request.
These results have direct implications for KV-cache memory allocation strategies studied in this series. Workloads with predictable, stable prefixes can allocate more aggressively to cache storage, while dynamic workloads benefit more from the compression techniques examined in our earlier KV-cache compression benchmarks article.
4.2 Cross-Provider Caching Strategy Comparison (RQ2) #
The evaluation by Gupta et al. (2026) provides the most comprehensive cross-provider analysis of prompt caching strategies, testing three approaches across OpenAI, Anthropic, and Google on the DeepResearch Bench with over 500 agent sessions ([2][3]).

The results demonstrate that all three providers deliver substantial cost savings, but with significant variation. Google achieves the highest overall cost reduction (up to 80% with full context caching), followed by Anthropic (61% with the exclude-dynamic-tools strategy) and OpenAI (45% with exclude-dynamic-tools). Critically, the exclude-dynamic-tools strategy outperforms naive full-context caching for OpenAI and Anthropic, while Google’s broader matching capabilities make full-context caching competitive.

TTFT improvements scale approximately linearly with prompt size across all providers, reaching 28-31% for 50,000-token system prompts. This linear relationship reflects the direct proportionality between cached prefix length and skipped prefill computation. An ablation across tool call counts (3-50) confirms that caching benefits persist regardless of conversation complexity, with per-turn savings remaining consistent after the initial cache population.
A key finding is that naive full-context caching can paradoxically increase latency. When dynamic tool results are included in the cached context, frequent cache invalidation forces recomputation of the entire prefix, sometimes resulting in higher TTFT than the no-cache baseline. Strategic placement of dynamic content at the end of prompts, outside cache block boundaries, eliminates this pathology.
4.3 KV-Cache Compression for Reusable Storage (RQ1/RQ3 Bridge) #
The practical value of prompt caching depends not only on hit rates but on the feasibility of storing reusable caches efficiently. KVTC (KV Cache Transform Coding) addresses this by applying classical media compression techniques (PCA decorrelation, adaptive quantization, entropy coding) to achieve up to 20x compression of KV-caches while maintaining accuracy, and up to 40x for specific use cases ([8][9]).

At 20x compression, both MMLU and GSM8K accuracy remain above 95% of baseline performance, confirming that compressed caches can be reused without meaningful quality degradation. This is directly relevant to our series: efficient compression enables more cached prefixes to reside in limited GPU/HBM memory simultaneously, improving effective hit rates for memory-constrained deployments.
4.4 Advanced Eviction Policies (RQ3) #
Standard LRU eviction treats all cached entries equally, evicting the least recently accessed regardless of the future reuse likelihood or the latency impact of a cache miss. Two 2025 innovations address this limitation.
Tail-Optimized LRU introduces a two-line modification to standard LRU that reallocates cache capacity to prioritize high-latency conversations. By evicting entries unlikely to affect future turns, it achieves up to 27.5% reduction in P90 tail TTFT and 38.9% decrease in SLO violations on the WildChat dataset ([4][5]). The theoretical contribution is equally significant: the authors provide the first formal proof of LRU optimality under a stochastic conversation model, establishing a principled foundation for cache eviction in LLM serving.
Learned Prefix Caching (LPC) takes a data-driven approach, leveraging conversational content analysis to predict which conversations are likely to continue and therefore benefit from retained caches ([9][10]). LPC outperforms both standard LRU and Tail-Optimized LRU across all metrics, achieving approximately 32% P90 tail latency reduction and 42% SLO violation reduction.

KVFlow extends eviction policy optimization to multi-agent workflows specifically, modeling agent execution schedules as Agent Step Graphs. Each agent receives a steps-to-execution value estimating its temporal proximity to future activation, guiding fine-grained eviction at the KV node level rather than the conversation level ([7][8]). Combined with fully overlapped KV prefetching that proactively loads required tensors from CPU to GPU in background threads, KVFlow eliminates cache miss stalls during generation for predictable multi-agent workflows.
flowchart TD
subgraph Eviction_Policies
LRU[Standard LRU] --> P1[Evicts least recently used]
TOLRU[Tail-Optimized LRU] --> P2[Prioritizes high-latency conversations]
LPC[Learned Prefix Caching] --> P3[Predicts conversation continuation]
KVF[KVFlow] --> P4[Models agent execution graph]
end
P1 --> R1[Baseline performance]
P2 --> R2[27.5% P90 reduction]
P3 --> R3[32% P90 reduction]
P4 --> R4[Workflow-aware - no miss stalls]
R1 --> D[Deployment Decision]
R2 --> D
R3 --> D
R4 --> D
D --> C1{Workload Type}
C1 -->|Chat/General| TOLRU
C1 -->|High-Volume| LPC
C1 -->|Multi-Agent| KVF
4.5 Implications for Memory Architecture Design #
The findings from all three research questions converge on a key architectural insight: effective prompt caching requires co-design of prompt structure, cache storage, and eviction policy. A system that achieves 92% hit rates on batch workloads but uses LRU eviction may still exhibit poor tail latency for the 8% of misses. Conversely, a sophisticated eviction policy cannot compensate for prompt designs that prevent prefix matching (e.g., inserting dynamic RAG context before the system prompt).
For the AI Memory series, these results establish that prompt caching efficiency is the critical mediator between raw KV-cache memory capacity (examined in previous articles) and practical inference cost. The optimization techniques explored in upcoming articles on paged attention, grouped-query attention, and speculative decoding must be evaluated not just for their memory savings but for their compatibility with prefix caching reuse patterns.
5. Conclusion #
RQ1 Finding: Cache hit rates vary from 45% (RAG pipelines) to 92% (batch processing) across workload categories, with a median of 72% for multi-turn chat. Measured by cache hit rate across five workload types. This matters for our series because memory allocation strategies must account for workload-specific reuse patterns rather than assuming uniform cache utility.
RQ2 Finding: Strategic caching (excluding dynamic tool results) outperforms naive full-context caching, achieving 41-80% cost reduction and 13-31% TTFT improvement across providers. Measured by API cost reduction (%) and TTFT improvement (%) across OpenAI, Anthropic, and Google. This matters for our series because it demonstrates that prompt engineering and cache engineering are inseparable concerns in memory-efficient inference.
RQ3 Finding: Learned eviction policies (LPC) achieve 32% P90 tail latency reduction and 42% SLO violation reduction over standard LRU, while Tail-Optimized LRU achieves 27.5% and 38.9% respectively. Measured by P90/P95 tail TTFT reduction and SLO violation rate on the WildChat dataset. This matters for our series because cache eviction directly determines the effective memory utilization studied throughout our KV-cache optimization investigations.
The next article in this series examines multi-turn memory degradation, investigating how conversation history accumulation affects model performance and exploring strategies to maintain quality as context windows fill with cached historical content. The prompt caching efficiency findings presented here provide the baseline against which multi-turn memory management strategies must be evaluated.
References (10) #
- Stabilarity Research Hub. Prompt Caching Efficiency — Measuring Reuse Across Real Workloads. doi.org. dti
- Stabilarity Research Hub. Cross-Architecture Memory Comparison — Llama vs Mistral vs Gemma vs Qwen. ib
- (2026). [2601.06007] Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks. doi.org. dti
- (2025). [2506.14852] Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents. doi.org. dti
- (2025). [2510.15152] Tail-Optimized Caching for LLM Inference. doi.org. dti
- (2025). [2510.09665] LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. doi.org. dti
- (2026). [2602.22812] Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching. doi.org. dti
- (2025). [2507.07400] KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows. doi.org. dti
- (2025). [2511.01815] KV Cache Transform Coding for Compact Storage in LLM Inference. doi.org. dti
- (2025). NeurIPS Poster Learned Prefix Caching for Efficient LLM Inference. neurips.cc. rtia