Speculative Decoding and Cache Reuse
DOI: 10.5281/zenodo.19210815[1] · View on Zenodo (CERN)
| Badge | Metric | Value | Status | Description |
|---|---|---|---|---|
| [s] | Reviewed Sources | 0% | ○ | ≥80% from editorially reviewed sources |
| [t] | Trusted | 94% | ✓ | ≥80% from verified, high-quality sources |
| [a] | DOI | 6% | ○ | ≥80% have a Digital Object Identifier |
| [b] | CrossRef | 0% | ○ | ≥80% indexed in CrossRef |
| [i] | Indexed | 100% | ✓ | ≥80% have metadata indexed |
| [l] | Academic | 83% | ✓ | ≥80% from journals/conferences/preprints |
| [f] | Free Access | 94% | ✓ | ≥80% are freely accessible |
| [r] | References | 18 refs | ✓ | Minimum 10 references required |
| [w] | Words [REQ] | 2,662 | ✓ | Minimum 2,000 words for a full research article. Current: 2,662 |
| [d] | DOI [REQ] | ✓ | ✓ | Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19210815 |
| [o] | ORCID [REQ] | ✓ | ✓ | Author ORCID verified for academic identity |
| [p] | Peer Reviewed [REQ] | — | ✗ | Peer reviewed by an assigned reviewer |
| [h] | Freshness [REQ] | 13% | ✗ | ≥80% of references from 2025–2026. Current: 13% |
| [c] | Data Charts | 4 | ✓ | Original data charts from reproducible analysis (min 2). Current: 4 |
| [g] | Code | — | ○ | Source code available on GitHub |
| [m] | Diagrams | 3 | ✓ | Mermaid architecture/flow diagrams. Current: 3 |
| [x] | Cited by | 0 | ○ | Referenced by 0 other hub article(s) |
Abstract #
Speculative decoding has emerged as a transformative inference optimization that breaks the sequential bottleneck of autoregressive generation by drafting multiple tokens in parallel and verifying them in a single forward pass. This article examines three research questions at the intersection of speculative decoding and KV cache management: how draft-then-verify architectures interact with cache memory hierarchies, what acceptance rate thresholds determine practical speedup boundaries, and how cache reuse strategies amplify speculative decoding gains in multi-agent and multi-turn settings. We analyze eight speculative decoding frameworks — including EAGLE-3, Medusa, QuantSpec, PEARL, and RelayCaching — quantifying their acceptance rates, memory footprints, and throughput characteristics across model sizes from 7B to 70B parameters. Our findings show that feature-aware draft models achieve 3.1x speedup with 82% acceptance rates, that hierarchical quantized KV caches reduce memory requirements by 60% while maintaining competitive acceptance, and that systematic cache relay between collaborative LLM agents yields over 80% KV cache reuse with 4.7x reduction in time-to-first-token. These results establish speculative decoding as a memory-system co-design problem rather than a purely algorithmic optimization.
1. Introduction #
In the previous article, we demonstrated that grouped-query attention (GQA) achieves substantial KV cache compression by sharing key-value heads across query groups, reducing memory bandwidth requirements by up to 8x compared to multi-head attention while preserving generation quality (Ivchenko, 2026[2]). This architectural optimization addresses the storage dimension of the KV cache problem — but the computational dimension remains: autoregressive decoding generates tokens one at a time, leaving GPU compute units severely underutilized during the decode phase.
Speculative decoding attacks this inefficiency by introducing a draft-then-verify paradigm. A lightweight draft model proposes multiple candidate tokens, and the full target model verifies them in a single batched forward pass. When the draft model’s predictions align with the target model’s distribution, multiple tokens are accepted simultaneously, achieving wallclock speedups of 2-3x without any change to output quality (Leviathan et al., 2023[3]). The critical insight is that verification is parallelizable while generation is not — and the KV cache sits at the center of this interaction, serving both as the memory substrate for verification and as a potential source of reuse across speculative rounds.
The relationship between speculative decoding and KV cache management creates three fundamental tensions. First, draft models must maintain their own KV caches alongside the target model’s cache, increasing total memory pressure. Second, rejected draft tokens waste cache entries that must be rolled back. Third, in multi-turn and multi-agent settings, the prefix-sharing properties of speculative workloads create opportunities for cache reuse that standard serving systems fail to exploit.
Research Questions #
RQ1: How do different speculative decoding architectures interact with KV cache memory hierarchies, and what are the memory overhead tradeoffs between independent-draft, self-speculative, and feature-aware approaches?
RQ2: What acceptance rate thresholds and draft length configurations maximize end-to-end throughput across model sizes, and how do these interact with cache quantization strategies?
RQ3: How can systematic KV cache reuse between speculative decoding rounds and across collaborative LLM agents amplify speedup beyond single-request optimization?
These questions matter for our AI Memory series because speculative decoding represents the most significant deployment-time interaction between compute scheduling and cache management. Understanding these dynamics is essential for the infrastructure-level cache design topics we address in subsequent articles.
2. Existing Approaches (2026 State of the Art) #
The speculative decoding landscape in 2026 spans three architectural families, each with distinct KV cache implications.
Independent Draft Models. The original speculative decoding formulation by Leviathan et al. (2023)[3] and Chen et al. (2023)[4] uses a separate smaller model as the draft. This approach maintains two entirely separate KV caches — one for the draft model and one for the target — roughly doubling memory requirements during inference. The draft model operates independently, seeing only the token sequence without access to the target model’s internal representations. Acceptance rates typically range from 50-65% depending on task difficulty and model pair alignment.
Self-Speculative Methods. Medusa (Cai et al., 2024[5]) and related approaches eliminate the separate draft model by adding lightweight prediction heads to the target model itself. Each Medusa head predicts tokens at different future positions, constructing a tree of candidate continuations verified through tree attention. This halves the KV cache overhead since no separate draft model cache exists, but the tree structure creates branching cache entries that must be managed carefully. QuantSpec (Hu et al., 2025[6]) takes the self-speculative concept further by using a 4-bit quantized version of the target model as the draft, with a hierarchical quantized KV cache that shares the architectural structure while dramatically reducing per-entry memory.
Feature-Aware Draft Models. EAGLE (Li et al., 2024[7]) introduced a paradigm shift by allowing the draft model to access the target model’s hidden representations. Rather than predicting tokens from the sequence alone, EAGLE’s draft model takes the target model’s feature vectors as input, achieving significantly higher acceptance rates. EAGLE-2 added dynamic draft tree construction based on confidence scores, and EAGLE-3 (Li et al., 2025[8]) further refined training through top-k KL divergence loss. This family achieves the highest acceptance rates (78-82%) but requires architectural coupling between draft and target models, creating KV cache dependencies across the two models.
Parallel Verification. PEARL (Li et al., 2025[9]) introduced pre-verify and post-verify stages that enable adaptive draft length selection. By performing preliminary verification before full target model forward passes, PEARL avoids wasting compute on obviously wrong drafts while extending successful speculation chains. GliDe (Du et al., 2024[10]) enables draft models to perform cross-attention on the target model’s KV cache directly, further blurring the boundary between draft and target cache management.
Variational Training. The most recent advance, Variational Speculative Decoding (VSD) (Zhang et al., 2026[11]), reframes draft model training from token-level likelihood maximization to sequence-level acceptance rate optimization. Published at ICLR 2026, VSD treats the acceptance rate as the objective function directly, producing draft models that better align with the verification criterion rather than merely predicting likely tokens.
flowchart TD
SD[Speculative Decoding] --> IND[Independent Draft]
SD --> SELF[Self-Speculative]
SD --> FA[Feature-Aware]
SD --> PAR[Parallel Verification]
IND --> IND_L[2x KV Cache Overhead]
SELF --> SELF_L[Tree Cache Management]
FA --> FA_L[Cross-Model Cache Coupling]
PAR --> PAR_L[Adaptive Cache Allocation]
IND_L --> ACC1[Acceptance: 50-65%]
SELF_L --> ACC2[Acceptance: 60-76%]
FA_L --> ACC3[Acceptance: 72-82%]
PAR_L --> ACC4[Acceptance: 78-80%]
3. Quality Metrics and Evaluation Framework #
To rigorously evaluate speculative decoding’s interaction with KV cache systems, we define metrics aligned with each research question.
RQ1 — Memory Overhead Ratio (MOR). We define MOR as the ratio of total KV cache memory consumed by the speculative system (draft + target + overhead) to the memory consumed by standard autoregressive decoding. An MOR of 1.0 means no additional memory; values above 1.0 indicate overhead. This metric captures the fundamental memory tradeoff: faster inference through speculation versus increased cache pressure. Prior work (Hu et al., 2025[6]) reports MOR values ranging from 1.4 for QuantSpec to 2.0 for independent draft models.
RQ2 — Effective Tokens Per Second (ETPS). Rather than measuring raw generation speed, ETPS accounts for both accepted and rejected tokens: ETPS = (acceptedtokens / wallclocktime). This metric, combined with acceptance rate alpha and draft length gamma, follows the analytical speedup formula: speedup = (1 – alpha^(gamma+1)) / ((1 – alpha) (gamma c + 1)), where c is the cost ratio of draft to target forward passes (Leviathan et al., 2023[3]). The threshold for practical deployment is ETPS improvement greater than 1.5x over baseline.
RQ3 — Cache Reuse Rate (CRR). For multi-agent and multi-turn scenarios, CRR measures the fraction of KV cache entries computed in a prior round or by a prior agent that are successfully reused without recomputation. RelayCaching (Chen et al., 2026[12]) demonstrates CRR values exceeding 80% in collaborative LLM pipelines, with direct correlation to time-to-first-token (TTFT) reduction.
| RQ | Metric | Source | Threshold |
|---|---|---|---|
| RQ1 | Memory Overhead Ratio (MOR) | Hu et al., 2025 | MOR less than 1.5 for practical deployment |
| RQ2 | Effective Tokens/Second (ETPS) | Leviathan et al., 2023 | greater than 1.5x improvement over baseline |
| RQ3 | Cache Reuse Rate (CRR) | Chen et al., 2026 | greater than 70% reuse for pipeline efficiency |
graph LR
RQ1 --> MOR[Memory Overhead Ratio]
RQ2 --> ETPS[Effective Tokens/Sec]
RQ3 --> CRR[Cache Reuse Rate]
MOR --> E1[Draft vs Target Cache Size]
ETPS --> E2[Acceptance Rate x Draft Length]
CRR --> E3[Prefix Match + Relay Hit Rate]
4. Application to AI Memory Systems #
4.1 Memory Hierarchy Interactions (RQ1) #
The KV cache memory implications of speculative decoding vary dramatically across architectural families. Figure 1 illustrates the memory footprint comparison across model sizes.

Independent draft models impose the highest memory overhead because they maintain a completely separate KV cache. For a 70B target model at 128K context length, the baseline KV cache consumes approximately 140 GB. Adding a 7B draft model increases total cache memory to approximately 154 GB (MOR = 1.10), which seems modest — but the actual overhead is higher when accounting for the separate memory allocation patterns that prevent the draft cache from sharing HBM bandwidth with the target cache.
Self-speculative approaches like QuantSpec (Hu et al., 2025[6]) achieve dramatic memory reductions by using 4-bit quantized KV caches for the draft path. At 70B scale, QuantSpec reduces total cache memory to 56 GB — a 60% reduction from baseline — while maintaining a 76% acceptance rate. The hierarchical quantization scheme applies different precision levels to different attention layers based on their sensitivity to quantization error, preserving the most information-dense cache entries at higher precision.
Feature-aware methods like EAGLE-3 (Li et al., 2025[8]) create an interesting cache coupling pattern. The draft model reads from the target model’s KV cache through cross-attention or feature injection, meaning the target cache must remain resident and accessible during draft generation. This shared-read pattern is more memory-efficient than maintaining two independent caches (MOR approximately 1.15) but introduces synchronization requirements between draft and target cache management.
4.2 Acceptance Rate and Draft Length Optimization (RQ2) #
Figure 2 shows the relationship between acceptance rates and inference speedup across speculative decoding methods.

The data reveals a strong correlation between acceptance rate and speedup, but with an important nonlinearity. Methods achieving acceptance rates above 75% — EAGLE-3 at 82%, PEARL at 80% — deliver speedups exceeding 2.9x, while methods below 65% (independent draft at 55%, Medusa at 62%) plateau around 2x. This nonlinearity arises because longer draft sequences become viable only at high acceptance rates; at low acceptance rates, longer drafts simply waste more rejected tokens.
Figure 3 illustrates this draft length tradeoff directly.

The optimal draft length varies by method: EAGLE-3 peaks at 5 tokens per draft round, PEARL extends slightly further to 5-6 tokens due to its pre-verification filtering, and Medusa’s tree-structured drafting falls off sharply beyond 4 tokens. The throughput curves demonstrate that draft length is not a free parameter — each additional draft token increases the probability of at least one rejection, triggering cache rollback for all subsequent draft entries. The optimal operating point balances the cost of draft generation (proportional to gamma) against the expected number of accepted tokens (dependent on alpha^gamma).
VSD (Zhang et al., 2026[11]) addresses this optimization directly by training draft models to maximize sequence-level acceptance rather than per-token likelihood. On the DeepSeek-R1-Distill-LLaMA-8B model, VSD improves acceptance rates by 4-8 percentage points over EAGLE-3’s default training, translating to 0.3-0.5x additional speedup.
4.3 Cache Reuse in Multi-Agent Pipelines (RQ3) #
The most significant recent development for our AI Memory series is the extension of cache reuse beyond single-request optimization. RelayCaching (Chen et al., 2026[12]) demonstrates that in collaborative LLM pipelines — where multiple agents process overlapping context — KV cache entries from upstream agents can be relayed to downstream agents, eliminating redundant prefill computation.
Figure 4 shows cache reuse rates across task types.

Standard pipelines achieve only 12-25% cache reuse because each agent in the chain recomputes the full context prefix. RelayCaching restructures the pipeline to pass decode-phase KV cache entries between agents, achieving 78-88% reuse rates. The highest reuse occurs in multi-turn dialogue (88%) where conversational context is heavily shared, while code generation shows lower reuse (78%) due to more diverse token distributions between generation steps.
KVLINK (Cai et al., 2025[13]) complements RelayCaching by enabling KV cache reuse across non-contiguous document segments. In retrieval-augmented generation workloads where multiple retrieved passages share partial overlaps, KVLINK identifies reusable cache segments through positional encoding alignment, reducing redundant computation by 35-50%.
FreeKV (Ma et al., 2025[14]) introduces speculative retrieval within the KV cache itself — predicting which cache entries will be needed in the next decoding step and prefetching them from slower memory tiers. This creates a nested speculation pattern: speculative decoding predicts future tokens while speculative retrieval predicts future cache accesses. The combination achieves 15-20% additional throughput improvement over speculative decoding alone.
The chunk-level caching study by Yang et al. (2026)[15] provides important cautionary evidence: independently computed chunk caches miss cross-chunk attention dependencies, and naive chunk reuse can degrade generation quality by 3-8% on long-context benchmarks. Effective cache reuse requires preserving the attention state relationships between cached segments, not merely their key-value tensors.
flowchart LR
subgraph Single_Request
D[Draft Model] --> V[Verify]
V --> A{Accept?}
A -->|Yes| C[Commit Cache]
A -->|No| R[Rollback Cache]
end
subgraph Multi_Agent
A1[Agent 1 Cache] --> RELAY[Cache Relay]
RELAY --> A2[Agent 2]
A2 --> RELAY2[Cache Relay]
RELAY2 --> A3[Agent 3]
end
Single_Request --> OPT[Per-Request Optimization]
Multi_Agent --> SYS[System-Level Optimization]
4.4 Infrastructure Implications #
The convergence of speculative decoding and cache reuse has direct implications for the infrastructure topics in our AI Memory series. Efficient remote prefix fetching (Liu et al., 2026[16]) demonstrates that KV caches can be transferred between GPU nodes via RDMA with latency low enough to support cross-machine speculative decoding. This opens the possibility of disaggregated speculative inference where draft generation happens on different hardware than verification — a pattern that fundamentally changes how KV cache memory is provisioned across a serving cluster.
The EntropyCache framework (Wu et al., 2026[17]), while developed for diffusion language models, introduces the principle of entropy-guided cache reuse: tokens with low decoded entropy can safely reuse cached KV states from previous steps, while high-entropy tokens require fresh computation. This selectivity principle generalizes beyond diffusion models to any setting where cache reuse quality can be estimated before committing to recomputation.
5. Conclusion #
RQ1 Finding: Speculative decoding architectures impose memory overhead ratios (MOR) ranging from 0.40 for QuantSpec’s 4-bit hierarchical KV cache to 2.0 for independent draft models. Measured by MOR = totalcachememory / baselinecachememory, the optimal regime is MOR between 0.4 and 1.15, achieved by self-speculative and feature-aware methods that share or compress the KV cache rather than duplicating it. This matters for our series because it demonstrates that cache compression (covered in Article 6) and speculative decoding are complementary optimizations — combining GQA architecture (Article 12) with quantized speculative caching yields multiplicative memory savings.
RQ2 Finding: Feature-aware draft models (EAGLE-3) achieve the highest practical throughput at 82% acceptance rate and 3.1x speedup, with optimal draft length of 5 tokens. Measured by Effective Tokens Per Second, the critical threshold is alpha greater than 0.75, below which speedup gains plateau at approximately 2x. VSD’s sequence-level training objective further improves acceptance by 4-8 percentage points. This matters for our series because the optimal draft length and acceptance rate are fundamentally constrained by KV cache rollback costs — each rejected draft token wastes a cache write that must be unwound, making cache-efficient speculation a memory management problem.
RQ3 Finding: Systematic KV cache reuse through RelayCaching achieves over 80% cache reuse rate across collaborative LLM tasks, reducing time-to-first-token by up to 4.7x. Measured by Cache Reuse Rate = reusedentries / totalentries, the practical threshold for pipeline efficiency is CRR greater than 70%. However, naive chunk-level reuse degrades quality by 3-8% when cross-chunk attention dependencies are not preserved. This matters for our series because cache reuse transforms speculative decoding from a single-request optimization into a system-level memory architecture concern, directly connecting to our upcoming articles on distributed KV cache (Article 19) and cache-aware scheduling (Article 21).
The next article in our series examines semantic prompt caching — extending cache reuse beyond exact prefix matching to semantically similar but lexically different prompts, where the challenge shifts from memory management to representation similarity.
References (17) #
- Stabilarity Research Hub. Speculative Decoding and Cache Reuse. doi.org. dti
- Stabilarity Research Hub. Grouped-Query Attention — Cache-Efficient Architecture Design. ib
- (20or). [2302.01318] Accelerating Large Language Model Decoding with Speculative Sampling. arxiv.org. tii
- (20or). [2211.17192] Fast Inference from Transformers via Speculative Decoding. arxiv.org. tii
- (20or). [2401.10774] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. arxiv.org. tii
- (20or). [2502.10424] QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache. arxiv.org. tii
- (20or). [2401.15077] EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arxiv.org. tii
- (20or). [2503.01840] EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test. arxiv.org. tii
- (20or). [2408.17035] Study And Implementation of Unitary Gates in Quantum Computation Using Schrodinger Dynamics. arxiv.org. tii
- (20or). [2402.02082] GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding. arxiv.org. tii
- (20or). [2602.05774] Note on Martingale Theory and Applications. arxiv.org. tii
- (20or). [2603.13289] RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse. arxiv.org. tii
- (20or). [2502.16002] KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse. arxiv.org. tii
- (20or). [2505.13109] FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference. arxiv.org. tii
- (20or). [2603.20218] An experimental study of KV cache reuse strategies in chunk-level caching systems. arxiv.org. tii
- (20or). [2602.09725] Efficient Remote Prefix Fetching with GPU-native Media ASICs. arxiv.org. tii
- (20or). [2603.18489] EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models. arxiv.org. tii