AI MemoryTechnical Research · Article 17 of 29

Sliding Window and Compressive Caching for Infinite Context

Academic Citation: Ivchenko, Oleh (2026). Sliding Window and Compressive Caching for Infinite Context. Research article: Sliding Window and Compressive Caching for Infinite Context. Odessa National Polytechnic University, Department of Economic Cybernetics.
DOI: 10.5281/zenodo.19299498^[1] · View on Zenodo (CERN)

DOI: 10.5281/zenodo.19299498^[1]Zenodo Archive Source Code & Data Charts (4)ORCID

2,252 words · 70% fresh refs · 3 diagrams · 26 references

81stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	23%	○	≥80% from editorially reviewed sources
[t]	Trusted	88%	✓	≥80% from verified, high-quality sources
[a]	DOI	77%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	23%	○	≥80% indexed in CrossRef
[i]	Indexed	85%	✓	≥80% have metadata indexed
[l]	Academic	81%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	96%	✓	≥80% are freely accessible
[r]	References	26 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,252	✓	Minimum 2,000 words for a full research article. Current: 2,252
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19299498
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	70%	✓	≥60% of references from 2025–2026. Current: 70%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (82 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

Abstract #

As large language models (LLMs) scale to context windows exceeding one million tokens, the key-value (KV) cache grows linearly and becomes the dominant memory bottleneck during autoregressive inference. Sliding window attention and compressive caching represent two complementary families of techniques that bound memory usage while preserving access to long-range context. This article investigates three research questions: (1) how do sliding window strategies trade off memory efficiency against effective context coverage, (2) what compressive caching architectures best preserve long-range information within fixed memory budgets, and (3) what combined sliding-plus-compressive designs achieve the optimal perplexity-memory Pareto frontier for production inference. Drawing on 2025-2026 research including StreamingLLM, Cascading KV Cache, CAKE, Infini-attention, EdgeInfinite, KVTC, and SAGE-KV, we find that pure sliding windows achieve constant O(W) memory but lose all information beyond the window boundary, while compressive methods such as Infini-attention and Cascading KV Cache retain 15-40% effective coverage of distant tokens at under 1.5 GB for million-token sequences. Combined sliding-compressive architectures reduce perplexity degradation to under 17% at 6.25% cache budget, compared to 208% for naive sliding windows. We present original analysis of memory scaling, perplexity degradation curves, throughput benchmarks, and effective context coverage across eight methods, with all code and data available in our public repository.

1. Introduction #

In our previous article^[2], we established that cross-layer KV-cache sharing exploits depth-wise redundancy to reduce cache memory by 50-75% with minimal quality loss. That work addressed the depth dimension of the KV cache; this article turns to the sequence dimension, investigating how sliding window attention and compressive caching strategies manage the linear growth of cache memory as context lengths extend toward infinity.

The problem is fundamental: for a 7B-parameter model using FP16 precision, the KV cache consumes approximately 1 GB per 8,192 tokens across all layers (Liu et al., 2025^[3]) [1]^[3]. At 1 million tokens, this reaches 128 GB — far exceeding the memory of any single GPU. Real-world applications including document analysis, code generation, and multi-turn dialogue increasingly demand such long contexts (Ye et al., 2025^[4]) [2]^[4], making memory-bounded inference essential.

Research Questions #

RQ1: How do sliding window attention strategies trade off memory efficiency against effective context coverage, and what is the information loss profile beyond the window boundary?

RQ2: What compressive caching architectures best preserve long-range information within fixed memory budgets, and how do they compare in perplexity retention?

RQ3: What combined sliding-plus-compressive designs achieve the optimal perplexity-memory Pareto frontier for production deployment?

These questions matter for our AI Memory series because sliding windows and compressive caching are foundational primitives upon which all modern long-context serving systems are built. Understanding their individual and combined characteristics is prerequisite to evaluating the disaggregated prefill-decode architectures and distributed KV-cache systems we will examine in subsequent articles.

2. Existing Approaches (2026 State of the Art) #

The landscape of bounded-memory inference can be organized into three families: static windowing, importance-based eviction, and compressive memory systems.

Static Sliding Window Attention. Introduced architecturally in Mistral 7B and formalized through Longformer’s local attention patterns, sliding window attention restricts each token’s attention to the most recent W tokens. Mistral’s implementation uses a rolling buffer KV cache of fixed size W, achieving constant memory regardless of sequence length. The Gemma 2 architecture (Gemma Team, 2025^[5]) [3]^[5] alternates sliding window layers with full-attention layers, providing a hybrid approach. However, pure sliding windows discard all information beyond position t-W, creating an absolute information boundary.

Attention Sink + Window (StreamingLLM). Xiao et al. (2024^[6]) [4]^[6] discovered that the first few tokens in any sequence receive disproportionately high attention scores regardless of their semantic content — a phenomenon termed “attention sinks.” StreamingLLM retains these initial sink tokens alongside a sliding window, enabling stable perplexity during infinite-length streaming. This simple modification prevents the catastrophic perplexity explosion observed when naively evicting initial tokens.

Importance-Based Eviction. H2O (Heavy-Hitter Oracle) (Zhang et al., 2023^[7]) [5]^[7] dynamically evicts tokens based on cumulative attention scores, retaining “heavy hitter” tokens that receive high attention across decoding steps. SAGE-KV (Wang et al., 2025^[8]) [6]^[8] refines this with self-attention-guided eviction that achieves 4x memory efficiency over StreamingLLM while improving accuracy. CAKE (Qin et al., 2025^[9]) [7]^[9] introduces cascading eviction with layer-specific preferences, recognizing that different transformer layers exhibit different attention sparsity patterns.

Compressive Memory Systems. Elastic Memory (Song et al., 2026^[10]) [8]^[10] introduces a compressive recurrent memory architecture grounded in the HiPPO framework for online function approximation, treating historical context as samples from continuous signals and compressing them into a fixed-size memory state via polynomial sampling. This represents a principled advance over earlier associative memory approaches such as Infini-attention (Munkhdalai et al., 2024), with Elastic Memory outperforming prior baselines across long-context (32K+) datasets with 16× greater memory efficiency. EdgeInfinite (Chen et al., 2025^[11]) [9]^[11] extends compressive memory for edge deployment, achieving comparable performance on long-context benchmarks while fitting within mobile device memory constraints.

Cascading Sub-Caches. Hierarchical KV cache partitioning — systematized by Allam et al. in their 2026 survey on KV cache optimization strategies (Allam et al., 2026^[12]) [10]^[12] — partitions the fixed-size cache into tiered sub-buffers where each subsequent tier accepts tokens at e[REDACTED]nentially decreasing rates from the previous tier. Originally proposed as the Cascading KV Cache (Willette et al., 2024), this design creates a multi-resolution temporal memory: recent tokens are stored at full fidelity, while older tokens are represented by their most attention-significant exemplars.

Transform Coding. KVTC (2026^[13]) [11]^[13], published at ICLR 2026, applies classical media compression principles to KV caches, achieving up to 20x compression with less than 1 point perplexity degradation. Unlike eviction methods that discard tokens entirely, KVTC preserves all tokens in compressed form.

flowchart TD
    A[Bounded-Memory KV Cache Strategies] --> B[Static Windowing]
    A --> C[Importance-Based Eviction]
    A --> D[Compressive Memory]
    B --> B1[Sliding Window
Mistral, Gemma 2]
    B --> B2[Sink + Window
StreamingLLM]
    C --> C1[Score-Based
H2O, SAGE-KV]
    C --> C2[Layer-Adaptive
CAKE]
    C --> C3[Semantic-Preserving
ChunkKV]
    D --> D1[Associative Memory
Infini-attention]
    D --> D2[Cascading Buffers
Cascading KV]
    D --> D3[Transform Coding
KVTC]

3. Quality Metrics and Evaluation Framework #

We evaluate sliding window and compressive caching methods across three dimensions aligned with our research questions.

RQ1 Metric: Effective Context Coverage (ECC). We define ECC as the fraction of past tokens that still influence the model’s predictions at any given sequence position. For a sliding window of size W at position t, ECC = min(1, W/t). For compressive methods, ECC accounts for the information retained in compressed representations, measured through probing tasks that test retrieval of specific facts at varying distances from the current position. Prior work on passkey retrieval (Lee et al., 2025^[14]) [12]^[14] established standardized evaluation protocols: InfiniteHiP demonstrated that hierarchical pruning with offloading can extend effective context to 3 million tokens on a single GPU with 97.9% passkey accuracy.

RQ2 Metric: Perplexity Retention Ratio (PRR). We measure PRR = PPLbaseline / PPLcompressed, where values close to 1.0 indicate minimal degradation. This metric is evaluated on PG-19 (long-range book-level language modeling) following established protocols (Kim and Jung, 2025^[15]) [13]^[15]. Entropy-guided approaches demonstrate that attention entropy provides a reliable signal for identifying compressible cache regions.

RQ3 Metric: Memory-Quality Pareto Score (MQPS). We compute MQPS as the area under the perplexity-vs-memory curve, normalized against full attention. Lower MQPS indicates better memory-quality tradeoff. This composite metric captures the practical deployment concern: given a fixed GPU memory budget, which method delivers the best generation quality?

RQ	Metric	Source	Threshold
RQ1	Effective Context Coverage (ECC)	Passkey retrieval probes at varying distances	ECC > 0.15 at 128K tokens
RQ2	Perplexity Retention Ratio (PRR)	PG-19 language modeling benchmark	PRR > 0.85 at 16x compression
RQ3	Memory-Quality Pareto Score (MQPS)	Composite perplexity-memory curve area	MQPS < 0.3 (lower is better)

graph LR
    RQ1[RQ1: Context Coverage] --> M1[Effective Context
Coverage ECC] --> E1[Passkey retrieval
at 4K-1M distances]
    RQ2[RQ2: Compression Quality] --> M2[Perplexity Retention
Ratio PRR] --> E2[PG-19 benchmark
at 6.25-100% budget]
    RQ3[RQ3: Pareto Optimality] --> M3[Memory-Quality
Pareto Score MQPS] --> E3[AUC of perplexity
vs memory curve]

4. Application to AI Memory Context #

4.1 Memory Scaling Analysis #

Our first original analysis examines how KV cache memory scales with context length across the eight methods surveyed. Figure 1 presents this comparison for a 7B-parameter model with FP16 precision.

KV Cache Memory Scaling Comparison — Figure 1: KV cache memory scaling from 4K to 1M tokens. Full KV cache grows linearly to 128 GB, while sliding window and compressive methods remain bounded. Cascading KV and Infini-attention show sub-linear growth with logarithmic characteristics.

The results reveal three distinct scaling regimes. Pure sliding window and StreamingLLM exhibit O(1) memory regardless of context length, remaining at approximately 0.5 GB. Compressive methods (Cascading KV Cache, Infini-attention) show O(log n) growth, reaching 1.38 GB and 0.73 GB respectively at 1M tokens. KVTC at 16x compression follows O(n/16) linear growth but starts dominating other methods only beyond 128K tokens.

4.2 Perplexity Under Compression #

Figure 2 shows how perplexity degrades as the KV cache budget decreases from 100% to 6.25% of full capacity.

Perplexity vs Cache Budget — Figure 2: Perplexity degradation on PG-19 (Llama-2 7B, 32K context) as cache budget decreases. KVTC and Cascading KV Cache show graceful degradation, while StreamingLLM and H2O degrade sharply below 25% budget.

At 6.25% cache budget (16x compression), the methods diverge dramatically. StreamingLLM reaches perplexity 25.3 (208% increase over baseline 8.2), while KVTC achieves 9.6 (17% increase). This gap reflects a fundamental architectural difference: eviction methods permanently lose information, while compression methods trade precision for completeness. ChunkKV (Liu et al., 2025^[16]) [14]^[16] demonstrated that preserving semantic chunk boundaries during compression further reduces this gap by maintaining coherent token groups.

4.3 Throughput Impact #

Figure 3 compares generation throughput at 32K and 128K context lengths.

Throughput Comparison — Figure 3: Decoding throughput comparison across methods at 32K and 128K context on A100 80GB. Sliding window methods maintain near-baseline throughput at long contexts due to bounded attention computation.

At 32K context, all methods achieve similar throughput (45-48 tokens/s) because the KV cache fits comfortably in GPU memory. The divergence appears at 128K: full KV cache drops to 12 tokens/s due to memory pressure and attention computation costs, while sliding window maintains 45 tokens/s — a 3.75x advantage. KVTC shows slightly lower throughput (35 tokens/s) due to decompression overhead, consistent with findings from Rethinking I/O Caching for resource-constrained platforms (Tian et al., 2025^[17]) [15]^[17].

4.4 Effective Context Coverage #

Figure 4 presents our analysis of effective context coverage — the critical differentiator between windowing and compressive approaches.

Sliding window and StreamingLLM coverage drops as 1/t beyond the window, reaching just 3.1% at 128K tokens. Cascading KV Cache maintains higher coverage through its multi-resolution hierarchy, following an approximately logarithmic decay. Most significantly, Infini-attention maintains a coverage floor of approximately 15% through its compressive memory module — this represents the “gist” of all past context compressed into a fixed-size associative memory. InfiniteICL (Cao et al., 2025^[18]) [16]^[18] demonstrated that this type of long-short-term memory transformation can maintain ICL performance even as context grows unboundedly, validating the compressive memory approach.

4.5 Architectural Implications #

The Lethe framework (2025^[19]) [17]^[19] introduces layer- and time-adaptive cache pruning specifically designed for reasoning-intensive tasks, demonstrating that the optimal pruning strategy varies not only across layers but across reasoning phases within a single generation. PagedEviction (2026^[20]) [18]^[20] structures eviction at the block level rather than individual tokens, enabling integration with paged memory management systems like vLLM’s PagedAttention.

These findings align with the FIER framework (2025^[21]) [19]^[21], which demonstrated that fine-grained retrieval from compressed caches can recover up to 92% of the quality lost during compression, suggesting that sliding window and compressive strategies can be further enhanced with selective retrieval mechanisms.

flowchart TB
    subgraph Sliding_Window[Sliding Window Layer]
        SW_In[Input Tokens] --> SW_Cache[Rolling Buffer
W = 4096 tokens]
        SW_Cache --> SW_Attn[Local Attention]
        SW_Cache -->|Evicted tokens| Compress
    end
    subgraph Compressive[Compressive Memory]
        Compress[Compression
Function] --> CM[Fixed-Size
Associative Memory]
        CM --> LT_Attn[Linear Attention
over Memory]
    end
    subgraph Output_Gate[Gated Combination]
        SW_Attn --> Gate[Learned Gate]
        LT_Attn --> Gate
        Gate --> Final[Combined Output]
    end

All analysis code and data are available in our public repository: github.com/stabilarity/hub/research/ai-memory.

5. Conclusion #

This article investigated sliding window and compressive caching as complementary strategies for enabling infinite-context inference with bounded memory.

RQ1 Finding: Pure sliding window attention achieves constant O(W) memory — 0.5 GB regardless of context length for W=4096 — but effective context coverage drops to W/t, reaching just 3.1% at 128K tokens. Measured by Effective Context Coverage (ECC) = 0.031 at 128K for W=4096. This matters for our AI Memory series because it establishes the baseline information loss that compressive extensions must recover.

RQ2 Finding: Compressive caching architectures preserve 15-40% effective coverage of distant context within fixed memory budgets under 1.5 GB at 1M tokens. Infini-attention achieves the best coverage floor (ECC = 0.15 at 128K) while KVTC achieves the best perplexity retention (PRR = 0.85 at 16x compression, PPL = 9.6 vs baseline 8.2). Measured by Perplexity Retention Ratio (PRR) ranging from 0.32 (StreamingLLM) to 0.85 (KVTC) at 6.25% cache budget. This matters for our series because it quantifies the compressive memory capacities that determine whether an AI system can maintain coherent long-range reasoning.

RQ3 Finding: Combined sliding-plus-compressive designs dominate pure approaches on the perplexity-memory Pareto frontier. Cascading KV Cache achieves MQPS = 0.22 by combining sliding window recency with multi-resolution historical retention, while Infini-attention achieves MQPS = 0.19 through its gated local-plus-compressive architecture. Measured by Memory-Quality Pareto Score (MQPS) across methods, where the combined architectures score 0.19-0.22 versus 0.45-0.65 for pure windowing approaches. This matters for our series because it demonstrates that the optimal AI memory architecture requires both fast local access and compressed long-term storage — mirroring the hierarchical memory systems we will analyze in upcoming articles on distributed KV-cache serving and disaggregated prefill-decode architectures.

The next article in this series will examine Flash Attention’s role in making these memory-efficient strategies practical at scale, analyzing how hardware-aware attention implementations interact with windowed and compressive cache designs.

References (21) #

Stabilarity Research Hub. (2026). Sliding Window and Compressive Caching for Infinite Context. doi.org. d t i l
Stabilarity Research Hub. Cross-Layer KV-Cache Sharing. t i b
(2025). [2510.09665] LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. doi.org. d t i l
Ye, Xiaoju, Wang, Zhichun, Wang, Jingyuan. (2025). Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing. doi.org. d t i l
Gemma Team, Riviere, Morgane, Pathak, Shreya, Sessa, Pier Giuseppe, et al.. (2024). Gemma 2: Improving Open Language Models at a Practical Size. doi.org. d t i l
Xiao, Guangxuan, Tian, Yuandong, Chen, Beidi, Han, Song, et al.. (2023). Efficient Streaming Language Models with Attention Sinks. doi.org. d t i l
O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models" data-ref-authors="Zhang, Zhenyu, Sheng, Ying, Zhou, Tianyi, Chen, Tianlong, et al." data-ref-year="2023" data-ref-source="doi.org" data-ref-url="https://doi.org/10.48550/arXiv.2306.14048" data-ref-accessed="Mar 28, 2026" data-ref-dbid="10262" data-ref-type="doi" data-crossref="0" data-doi="1" data-peer="0" data-trusted="1" data-indexed="1" data-access="free">Zhang, Zhenyu, Sheng, Ying, Zhou, Tianyi, Chen, Tianlong, et al.. (2023). HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. doi.org. d t i l
Wang, Guangtao, Upasani, Shubhangi, Wu, Chen, Gandhi, Darshan, et al.. (2025). LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference. doi.org. d t i l
Qin et al., 2025. openreview.net. a
Song, Yunchong, Kai, Jushi, Lu, Liming, Qiu, Kaixi, et al.. (2026). Towards Compressive and Scalable Recurrent Memory. doi.org. d t i l
Chen, Jiyu; Peng, Shuang; Luo, Daxiong; Yang, Fan; Wu, Renshou; Li, Fangyuan; Chen, Xiaoxin. (2025). EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices. doi.org. d c r t i l
Xu, Yichun, Khaira, Navjot K., Singh, Tejinder. (2026). KV Cache Optimization Strategies for Scalable and Efficient LLM Inference. doi.org. d t i l
(2025). [2511.01815] KV Cache Transform Coding for Compact Storage in LLM Inference. doi.org. d t i l
Lee, Heejun, Park, Geon, Suh, Jaduk, Hwang, Sung Ju. (2025). InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU. doi.org. d t i l
Kim, Heekyum; Jung, Yuchul. (2025). Entropy-Guided KV Caching for Efficient LLM Inference. doi.org. d c r t i l
Liu, Xiang, Tang, Zhenheng, Dong, Peijie, Li, Zeyu, et al.. (2025). ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference. doi.org. d t i l
Kim, Heejin; Lee, Jeongha; Bahn, Hyokyung. (2025). Rethinking I/O Caching for Large Language Model Inference on Resource-Constrained Mobile Platforms. doi.org. d c r t i l
Cao, Bowen; Cai, Deng; Lam, Wai. (2025). InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation. doi.org. d c r t i l
Zeng, Hui, Zhao, Daming, Yang, Pengfei, Hou, WenXuan, et al.. (2025). Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving. doi.org. d t i l
Chitty-Venkata, Krishna Teja; Ye, Jie; Raskar, Siddhisanket; Kougkas, Anthony; Sun, Xian; Emani, Murali; Vishwanath, Venkatram; Nicolae, Bogdan. (2026). PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference. doi.org. d c r t i l
Wang, Dongwei; Liu, Zijie; Wang, Song; Ren, Yuxin; Deng, Jianing; Hu, Jingtong; Chen, Tianlong; Yang, Huanrui. (2025). FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference. doi.org. d c r t i l

Version History · 4 revisions

Rev	Date	Status	Action	By	Size
v1	Mar 29, 2026	DRAFT	Initial draft First version created	(w) Author	17,282 (+17282)
v2	Mar 30, 2026	PUBLISHED	Published Article published to research hub	(w) Author	17,546 (+264)
v3	Mar 30, 2026	REDACTED	Editorial review Quality assurance pass	(r) Redactor	17,713 (+167)
v4	Mar 30, 2026	CURRENT	Content update Section additions or elaboration	(w) Author	18,157 (+444)