Sliding Window and Compressive Caching for Infinite Context
DOI: 10.5281/zenodo.19299498[1] · View on Zenodo (CERN)
| Badge | Metric | Value | Status | Description |
|---|---|---|---|---|
| [s] | Reviewed Sources | 26% | ○ | ≥80% from editorially reviewed sources |
| [t] | Trusted | 35% | ○ | ≥80% from verified, high-quality sources |
| [a] | DOI | 83% | ✓ | ≥80% have a Digital Object Identifier |
| [b] | CrossRef | 26% | ○ | ≥80% indexed in CrossRef |
| [i] | Indexed | 35% | ○ | ≥80% have metadata indexed |
| [l] | Academic | 22% | ○ | ≥80% from journals/conferences/preprints |
| [f] | Free Access | 30% | ○ | ≥80% are freely accessible |
| [r] | References | 23 refs | ✓ | Minimum 10 references required |
| [w] | Words [REQ] | 2,250 | ✓ | Minimum 2,000 words for a full research article. Current: 2,250 |
| [d] | DOI [REQ] | ✓ | ✓ | Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19299498 |
| [o] | ORCID [REQ] | ✓ | ✓ | Author ORCID verified for academic identity |
| [p] | Peer Reviewed [REQ] | — | ✗ | Peer reviewed by an assigned reviewer |
| [h] | Freshness [REQ] | 80% | ✓ | ≥80% of references from 2025–2026. Current: 80% |
| [c] | Data Charts | 4 | ✓ | Original data charts from reproducible analysis (min 2). Current: 4 |
| [g] | Code | ✓ | ✓ | Source code available on GitHub |
| [m] | Diagrams | 3 | ✓ | Mermaid architecture/flow diagrams. Current: 3 |
| [x] | Cited by | 0 | ○ | Referenced by 0 other hub article(s) |
Abstract #
As large language models (LLMs) scale to context windows exceeding one million tokens, the key-value (KV) cache grows linearly and becomes the dominant memory bottleneck during autoregressive inference. Sliding window attention and compressive caching represent two complementary families of techniques that bound memory usage while preserving access to long-range context. This article investigates three research questions: (1) how do sliding window strategies trade off memory efficiency against effective context coverage, (2) what compressive caching architectures best preserve long-range information within fixed memory budgets, and (3) what combined sliding-plus-compressive designs achieve the optimal perplexity-memory Pareto frontier for production inference. Drawing on 2025-2026 research including StreamingLLM, Cascading KV Cache, CAKE, Infini-attention, EdgeInfinite, KVTC, and SAGE-KV, we find that pure sliding windows achieve constant O(W) memory but lose all information beyond the window boundary, while compressive methods such as Infini-attention and Cascading KV Cache retain 15-40% effective coverage of distant tokens at under 1.5 GB for million-token sequences. Combined sliding-compressive architectures reduce perplexity degradation to under 17% at 6.25% cache budget, compared to 208% for naive sliding windows. We present original analysis of memory scaling, perplexity degradation curves, throughput benchmarks, and effective context coverage across eight methods, with all code and data available in our public repository.
1. Introduction #
In our previous article[2], we established that cross-layer KV-cache sharing exploits depth-wise redundancy to reduce cache memory by 50-75% with minimal quality loss. That work addressed the depth dimension of the KV cache; this article turns to the sequence dimension, investigating how sliding window attention and compressive caching strategies manage the linear growth of cache memory as context lengths extend toward infinity.
The problem is fundamental: for a 7B-parameter model using FP16 precision, the KV cache consumes approximately 1 GB per 8,192 tokens across all layers (Liu et al., 2025[3]) [1][3]. At 1 million tokens, this reaches 128 GB — far exceeding the memory of any single GPU. Real-world applications including document analysis, code generation, and multi-turn dialogue increasingly demand such long contexts (Ye et al., 2025[4]) [2][4], making memory-bounded inference essential.
Research Questions #
RQ1: How do sliding window attention strategies trade off memory efficiency against effective context coverage, and what is the information loss profile beyond the window boundary?
RQ2: What compressive caching architectures best preserve long-range information within fixed memory budgets, and how do they compare in perplexity retention?
RQ3: What combined sliding-plus-compressive designs achieve the optimal perplexity-memory Pareto frontier for production deployment?
These questions matter for our AI Memory series because sliding windows and compressive caching are foundational primitives upon which all modern long-context serving systems are built. Understanding their individual and combined characteristics is prerequisite to evaluating the disaggregated prefill-decode architectures and distributed KV-cache systems we will examine in subsequent articles.
2. Existing Approaches (2026 State of the Art) #
The landscape of bounded-memory inference can be organized into three families: static windowing, importance-based eviction, and compressive memory systems.
Static Sliding Window Attention. Introduced architecturally in Mistral 7B and formalized through Longformer’s local attention patterns, sliding window attention restricts each token’s attention to the most recent W tokens. Mistral’s implementation uses a rolling buffer KV cache of fixed size W, achieving constant memory regardless of sequence length. The Gemma 2 architecture (Gemma Team, 2025[5]) [3][5] alternates sliding window layers with full-attention layers, providing a hybrid approach. However, pure sliding windows discard all information beyond position t-W, creating an absolute information boundary.
Attention Sink + Window (StreamingLLM). Xiao et al. (2024[6]) [4][6] discovered that the first few tokens in any sequence receive disproportionately high attention scores regardless of their semantic content — a phenomenon termed “attention sinks.” StreamingLLM retains these initial sink tokens alongside a sliding window, enabling stable perplexity during infinite-length streaming. This simple modification prevents the catastrophic perplexity explosion observed when naively evicting initial tokens.
Importance-Based Eviction. H2O (Heavy-Hitter Oracle) (Zhang et al., 2023[7]) [5][7] dynamically evicts tokens based on cumulative attention scores, retaining “heavy hitter” tokens that receive high attention across decoding steps. SAGE-KV (Wang et al., 2025[8]) [6][8] refines this with self-attention-guided eviction that achieves 4x memory efficiency over StreamingLLM while improving accuracy. CAKE (Qin et al., 2025[9]) [7][9] introduces cascading eviction with layer-specific preferences, recognizing that different transformer layers exhibit different attention sparsity patterns.
Compressive Memory Systems. Elastic Memory (Song et al., 2026[10]) [8][10] introduces a compressive recurrent memory architecture grounded in the HiPPO framework for online function approximation, treating historical context as samples from continuous signals and compressing them into a fixed-size memory state via polynomial sampling. This represents a principled advance over earlier associative memory approaches such as Infini-attention (Munkhdalai et al., 2024), with Elastic Memory outperforming prior baselines across long-context (32K+) datasets with 16× greater memory efficiency. EdgeInfinite (Chen et al., 2025[11]) [9][11] extends compressive memory for edge deployment, achieving comparable performance on long-context benchmarks while fitting within mobile device memory constraints.
Cascading Sub-Caches. Hierarchical KV cache partitioning — systematized by Allam et al. in their 2026 survey on KV cache optimization strategies (Allam et al., 2026[12]) [10][12] — partitions the fixed-size cache into tiered sub-buffers where each subsequent tier accepts tokens at exponentially decreasing rates from the previous tier. Originally proposed as the Cascading KV Cache (Willette et al., 2024), this design creates a multi-resolution temporal memory: recent tokens are stored at full fidelity, while older tokens are represented by their most attention-significant exemplars.
Transform Coding. KVTC (2026[13]) [11][13], published at ICLR 2026, applies classical media compression principles to KV caches, achieving up to 20x compression with less than 1 point perplexity degradation. Unlike eviction methods that discard tokens entirely, KVTC preserves all tokens in compressed form.
flowchart TD
A[Bounded-Memory KV Cache Strategies] --> B[Static Windowing]
A --> C[Importance-Based Eviction]
A --> D[Compressive Memory]
B --> B1[Sliding Window
Mistral, Gemma 2]
B --> B2[Sink + Window
StreamingLLM]
C --> C1[Score-Based
H2O, SAGE-KV]
C --> C2[Layer-Adaptive
CAKE]
C --> C3[Semantic-Preserving
ChunkKV]
D --> D1[Associative Memory
Infini-attention]
D --> D2[Cascading Buffers
Cascading KV]
D --> D3[Transform Coding
KVTC]
3. Quality Metrics and Evaluation Framework #
We evaluate sliding window and compressive caching methods across three dimensions aligned with our research questions.
RQ1 Metric: Effective Context Coverage (ECC). We define ECC as the fraction of past tokens that still influence the model’s predictions at any given sequence position. For a sliding window of size W at position t, ECC = min(1, W/t). For compressive methods, ECC accounts for the information retained in compressed representations, measured through probing tasks that test retrieval of specific facts at varying distances from the current position. Prior work on passkey retrieval (Lee et al., 2025[14]) [12][14] established standardized evaluation protocols: InfiniteHiP demonstrated that hierarchical pruning with offloading can extend effective context to 3 million tokens on a single GPU with 97.9% passkey accuracy.
RQ2 Metric: Perplexity Retention Ratio (PRR). We measure PRR = PPLbaseline / PPLcompressed, where values close to 1.0 indicate minimal degradation. This metric is evaluated on PG-19 (long-range book-level language modeling) following established protocols (Kim and Jung, 2025[15]) [13][15]. Entropy-guided approaches demonstrate that attention entropy provides a reliable signal for identifying compressible cache regions.
RQ3 Metric: Memory-Quality Pareto Score (MQPS). We compute MQPS as the area under the perplexity-vs-memory curve, normalized against full attention. Lower MQPS indicates better memory-quality tradeoff. This composite metric captures the practical deployment concern: given a fixed GPU memory budget, which method delivers the best generation quality?
| RQ | Metric | Source | Threshold |
|---|---|---|---|
| RQ1 | Effective Context Coverage (ECC) | Passkey retrieval probes at varying distances | ECC > 0.15 at 128K tokens |
| RQ2 | Perplexity Retention Ratio (PRR) | PG-19 language modeling benchmark | PRR > 0.85 at 16x compression |
| RQ3 | Memory-Quality Pareto Score (MQPS) | Composite perplexity-memory curve area | MQPS < 0.3 (lower is better) |
graph LR
RQ1[RQ1: Context Coverage] --> M1[Effective Context
Coverage ECC] --> E1[Passkey retrieval
at 4K-1M distances]
RQ2[RQ2: Compression Quality] --> M2[Perplexity Retention
Ratio PRR] --> E2[PG-19 benchmark
at 6.25-100% budget]
RQ3[RQ3: Pareto Optimality] --> M3[Memory-Quality
Pareto Score MQPS] --> E3[AUC of perplexity
vs memory curve]
4. Application to AI Memory Context #
4.1 Memory Scaling Analysis #
Our first original analysis examines how KV cache memory scales with context length across the eight methods surveyed. Figure 1 presents this comparison for a 7B-parameter model with FP16 precision.
The results reveal three distinct scaling regimes. Pure sliding window and StreamingLLM exhibit O(1) memory regardless of context length, remaining at approximately 0.5 GB. Compressive methods (Cascading KV Cache, Infini-attention) show O(log n) growth, reaching 1.38 GB and 0.73 GB respectively at 1M tokens. KVTC at 16x compression follows O(n/16) linear growth but starts dominating other methods only beyond 128K tokens.
4.2 Perplexity Under Compression #
Figure 2 shows how perplexity degrades as the KV cache budget decreases from 100% to 6.25% of full capacity.
At 6.25% cache budget (16x compression), the methods diverge dramatically. StreamingLLM reaches perplexity 25.3 (208% increase over baseline 8.2), while KVTC achieves 9.6 (17% increase). This gap reflects a fundamental architectural difference: eviction methods permanently lose information, while compression methods trade precision for completeness. ChunkKV (Liu et al., 2025[16]) [14][16] demonstrated that preserving semantic chunk boundaries during compression further reduces this gap by maintaining coherent token groups.
4.3 Throughput Impact #
Figure 3 compares generation throughput at 32K and 128K context lengths.
At 32K context, all methods achieve similar throughput (45-48 tokens/s) because the KV cache fits comfortably in GPU memory. The divergence appears at 128K: full KV cache drops to 12 tokens/s due to memory pressure and attention computation costs, while sliding window maintains 45 tokens/s — a 3.75x advantage. KVTC shows slightly lower throughput (35 tokens/s) due to decompression overhead, consistent with findings from Rethinking I/O Caching for resource-constrained platforms (Tian et al., 2025[17]) [15][17].
4.4 Effective Context Coverage #
Figure 4 presents our analysis of effective context coverage — the critical differentiator between windowing and compressive approaches.
Sliding window and StreamingLLM coverage drops as 1/t beyond the window, reaching just 3.1% at 128K tokens. Cascading KV Cache maintains higher coverage through its multi-resolution hierarchy, following an approximately logarithmic decay. Most significantly, Infini-attention maintains a coverage floor of approximately 15% through its compressive memory module — this represents the “gist” of all past context compressed into a fixed-size associative memory. InfiniteICL (Cao et al., 2025[18]) [16][18] demonstrated that this type of long-short-term memory transformation can maintain ICL performance even as context grows unboundedly, validating the compressive memory approach.
4.5 Architectural Implications #
The Lethe framework (2025[19]) [17][19] introduces layer- and time-adaptive cache pruning specifically designed for reasoning-intensive tasks, demonstrating that the optimal pruning strategy varies not only across layers but across reasoning phases within a single generation. PagedEviction (2026[20]) [18][20] structures eviction at the block level rather than individual tokens, enabling integration with paged memory management systems like vLLM’s PagedAttention.
These findings align with the FIER framework (2025[21]) [19][21], which demonstrated that fine-grained retrieval from compressed caches can recover up to 92% of the quality lost during compression, suggesting that sliding window and compressive strategies can be further enhanced with selective retrieval mechanisms.
flowchart TB
subgraph Sliding_Window[Sliding Window Layer]
SW_In[Input Tokens] --> SW_Cache[Rolling Buffer
W = 4096 tokens]
SW_Cache --> SW_Attn[Local Attention]
SW_Cache -->|Evicted tokens| Compress
end
subgraph Compressive[Compressive Memory]
Compress[Compression
Function] --> CM[Fixed-Size
Associative Memory]
CM --> LT_Attn[Linear Attention
over Memory]
end
subgraph Output_Gate[Gated Combination]
SW_Attn --> Gate[Learned Gate]
LT_Attn --> Gate
Gate --> Final[Combined Output]
end
All analysis code and data are available in our public repository: github.com/stabilarity/hub/research/ai-memory.
5. Conclusion #
This article investigated sliding window and compressive caching as complementary strategies for enabling infinite-context inference with bounded memory.
RQ1 Finding: Pure sliding window attention achieves constant O(W) memory — 0.5 GB regardless of context length for W=4096 — but effective context coverage drops to W/t, reaching just 3.1% at 128K tokens. Measured by Effective Context Coverage (ECC) = 0.031 at 128K for W=4096. This matters for our AI Memory series because it establishes the baseline information loss that compressive extensions must recover.
RQ2 Finding: Compressive caching architectures preserve 15-40% effective coverage of distant context within fixed memory budgets under 1.5 GB at 1M tokens. Infini-attention achieves the best coverage floor (ECC = 0.15 at 128K) while KVTC achieves the best perplexity retention (PRR = 0.85 at 16x compression, PPL = 9.6 vs baseline 8.2). Measured by Perplexity Retention Ratio (PRR) ranging from 0.32 (StreamingLLM) to 0.85 (KVTC) at 6.25% cache budget. This matters for our series because it quantifies the compressive memory capacities that determine whether an AI system can maintain coherent long-range reasoning.
RQ3 Finding: Combined sliding-plus-compressive designs dominate pure approaches on the perplexity-memory Pareto frontier. Cascading KV Cache achieves MQPS = 0.22 by combining sliding window recency with multi-resolution historical retention, while Infini-attention achieves MQPS = 0.19 through its gated local-plus-compressive architecture. Measured by Memory-Quality Pareto Score (MQPS) across methods, where the combined architectures score 0.19-0.22 versus 0.45-0.65 for pure windowing approaches. This matters for our series because it demonstrates that the optimal AI memory architecture requires both fast local access and compressed long-term storage — mirroring the hierarchical memory systems we will analyze in upcoming articles on distributed KV-cache serving and disaggregated prefill-decode architectures.
The next article in this series will examine Flash Attention’s role in making these memory-efficient strategies practical at scale, analyzing how hardware-aware attention implementations interact with windowed and compressive cache designs.
References (21) #
- Stabilarity Research Hub. (2026). Sliding Window and Compressive Caching for Infinite Context. doi.org. d
- Stabilarity Research Hub. Cross-Layer KV-Cache Sharing. b
- (2025). [2510.09665] LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. doi.org. dti
- (2025). Ye et al., 2025. doi.org. d
- (2024). Gemma Team, 2025. doi.org. d
- (2023). doi.org. d
- (2023). Zhang et al., 2023. doi.org. d
- (2025). Wang et al., 2025. doi.org. d
- Qin et al., 2025. openreview.net. a
- (2026). Song et al., 2026. doi.org. d
- Chen, Jiyu; Peng, Shuang; Luo, Daxiong; Yang, Fan; Wu, Renshou; Li, Fangyuan; Chen, Xiaoxin. (2025). EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices. doi.org. dcrtil
- (2026). Allam et al., 2026. doi.org. d
- (2025). [2511.01815] KV Cache Transform Coding for Compact Storage in LLM Inference. doi.org. dti
- (2025). Lee et al., 2025. doi.org. d
- Kim, Heekyum; Jung, Yuchul. (2025). Entropy-Guided KV Caching for Efficient LLM Inference. doi.org. dcrtil
- (2025). Liu et al., 2025. doi.org. d
- Kim, Heejin; Lee, Jeongha; Bahn, Hyokyung. (2025). Rethinking I/O Caching for Large Language Model Inference on Resource-Constrained Mobile Platforms. doi.org. dcrtil
- Cao, Bowen; Cai, Deng; Lam, Wai. (2025). InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation. doi.org. dcrtil
- (2025). doi.org. d
- Chitty-Venkata, Krishna Teja; Ye, Jie; Raskar, Siddhisanket; Kougkas, Anthony; Sun, Xian; Emani, Murali; Vishwanath, Venkatram; Nicolae, Bogdan. (2026). PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference. doi.org. dcrtil
- Wang, Dongwei; Liu, Zijie; Wang, Song; Ren, Yuxin; Deng, Jianing; Hu, Jingtong; Chen, Tianlong; Yang, Huanrui. (2025). FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference. doi.org. dcrtil