Paged Attention and Virtual Memory for LLM Inference
DOI: 10.5281/zenodo.19203099[1] · View on Zenodo (CERN)
| Badge | Metric | Value | Status | Description |
|---|---|---|---|---|
| [s] | Reviewed Sources | 17% | ○ | ≥80% from editorially reviewed sources |
| [t] | Trusted | 75% | ○ | ≥80% from verified, high-quality sources |
| [a] | DOI | 33% | ○ | ≥80% have a Digital Object Identifier |
| [b] | CrossRef | 17% | ○ | ≥80% indexed in CrossRef |
| [i] | Indexed | 92% | ✓ | ≥80% have metadata indexed |
| [l] | Academic | 50% | ○ | ≥80% from journals/conferences/preprints |
| [f] | Free Access | 67% | ○ | ≥80% are freely accessible |
| [r] | References | 12 refs | ✓ | Minimum 10 references required |
| [w] | Words [REQ] | 2,912 | ✓ | Minimum 2,000 words for a full research article. Current: 2,912 |
| [d] | DOI [REQ] | ✓ | ✓ | Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19203099 |
| [o] | ORCID [REQ] | ✓ | ✓ | Author ORCID verified for academic identity |
| [p] | Peer Reviewed [REQ] | — | ✗ | Peer reviewed by an assigned reviewer |
| [h] | Freshness [REQ] | 40% | ✗ | ≥80% of references from 2025–2026. Current: 40% |
| [c] | Data Charts | 4 | ✓ | Original data charts from reproducible analysis (min 2). Current: 4 |
| [g] | Code | — | ○ | Source code available on GitHub |
| [m] | Diagrams | 3 | ✓ | Mermaid architecture/flow diagrams. Current: 3 |
| [x] | Cited by | 0 | ○ | Referenced by 0 other hub article(s) |
Abstract #
As large language models scale to billions of parameters and millions of context tokens, the key-value (KV) cache that stores attention states becomes the dominant memory bottleneck during inference. Traditional contiguous memory allocation for KV caches leads to severe fragmentation — wasting 40-60% of available GPU memory — and fundamentally limits serving throughput. This article investigates three research questions: how virtual memory abstractions from operating systems can be adapted to manage KV-cache memory in LLM inference, what quantitative improvements paged allocation achieves over contiguous baselines in memory utilization and throughput, and how alternative virtual memory strategies (OS-level vs application-level paging) compare in real deployment scenarios. Through analysis of PagedAttention (vLLM), vAttention (OS-level virtual memory management), and RadixAttention (prefix-aware paging in SGLang), we demonstrate that paged approaches reduce memory waste from 60.6% to under 6%, enable 2-8x higher batch sizes, and that the choice between application-level and OS-level virtual memory management involves measurable tradeoffs in kernel compatibility, throughput, and implementation complexity. These findings establish the memory management foundations for optimization techniques examined throughout this series.
1. Introduction #
In the previous article, we constructed a Unified Context Memory Score (UCMS) through meta-analysis of ten major context benchmarks, demonstrating that individual benchmark rankings correlate only moderately (mean Spearman rho = 0.55) and that composite scoring reduces ranking variance by 47% ([1][2]). That evaluation framework assumed models could efficiently process their full context windows — an assumption that collides with a fundamental engineering constraint: the KV cache required for attention computation consumes memory proportional to sequence length, and managing this memory efficiently is the single largest determinant of inference throughput in production systems.
The challenge mirrors a problem solved decades ago in operating systems. When early computers allocated contiguous memory blocks to processes, external fragmentation quickly rendered large portions of physical memory unusable. The introduction of virtual memory and paging — mapping contiguous virtual addresses to non-contiguous physical frames — transformed computing by decoupling the programmer’s view of memory from its physical layout. In 2023, Kwon et al. recognized that KV-cache allocation in LLM serving suffers from precisely the same pathology and proposed PagedAttention, applying paging concepts directly to attention key-value storage (Kwon et al., 2023[3]).
Since then, virtual memory approaches for LLM inference have diversified significantly. Microsoft’s vAttention leverages the operating system’s own virtual memory manager (VMM) rather than implementing application-level paging, achieving compatibility with unmodified attention kernels (Prabhu et al., 2025[4]). SGLang’s RadixAttention extends paging with a radix tree structure for automatic prefix sharing across requests (Zheng et al., 2024[5]). LMCache adds a cross-engine caching layer that operates atop paged-memory inference engines (Liu et al., 2025[6]). Meanwhile, hybrid approaches like PagedEviction combine paged allocation with structured pruning to reclaim memory from low-importance cache blocks (PagedEviction, 2025[7]).
Research Questions #
RQ1: How can virtual memory abstractions from operating systems be adapted to manage KV-cache memory in LLM inference, and what are the key architectural design decisions?
RQ2: What quantitative improvements does paged KV-cache allocation achieve over contiguous baselines in memory utilization and serving throughput?
RQ3: How do alternative virtual memory strategies — application-level paging (PagedAttention), OS-level VMM (vAttention), and prefix-aware paging (RadixAttention) — compare in real deployment scenarios?
These questions matter for our AI Memory series because every optimization technique examined in subsequent articles — from grouped-query attention to speculative decoding — operates within the memory management framework established by paged attention systems. Understanding the allocation layer is prerequisite to understanding what sits above it.
2. Existing Approaches (2026 State of the Art) #
The landscape of KV-cache memory management in 2026 centers on three major paradigms, each representing a different philosophy for applying virtual memory concepts to LLM serving.
2.1 Application-Level Paging: PagedAttention #
PagedAttention, introduced by Kwon et al. at SOSP 2023 and now the foundation of vLLM — the most widely deployed open-source LLM serving framework — implements paging entirely at the application level (Kwon et al., 2023[3]). The KV cache for each request is divided into fixed-size blocks (default: 16 tokens per block), managed through a block table analogous to a page table in operating systems. Each sequence maintains a logical-to-physical block mapping, allowing non-contiguous physical memory to appear contiguous from the attention computation’s perspective.
The key innovation is that PagedAttention modifies the attention kernel itself to perform block-table lookups during computation. Rather than accessing KV tensors through a single contiguous pointer, the modified kernel indexes into the block table, retrieves the physical block address for each logical block, and gathers the required key-value vectors. This enables copy-on-write semantics for beam search (where multiple beams share prefixes) and reference counting for deferred deallocation.
The primary limitation is that PagedAttention requires custom attention kernels. Every new attention implementation — FlashAttention, FlashInfer, xFormers — must be modified to support paged memory layout, creating a significant integration burden. As of 2026, vLLM V1 has unified its attention backend around FlashInfer’s paged kernels, but the fundamental coupling between memory layout and kernel implementation remains (vLLM Team, 2025[8]).
2.2 OS-Level Virtual Memory: vAttention #
vAttention, published at ASPLOS 2025 by Microsoft Research, takes a fundamentally different approach: rather than implementing paging within the application, it leverages the operating system’s existing virtual memory manager (Prabhu et al., 2025[4]). The system pre-reserves a large contiguous virtual address range for the KV cache but allocates physical memory on demand using low-level CUDA VMM APIs (cuMemCreate, cuMemMap). From the perspective of attention kernels, the KV cache appears as a single contiguous allocation — no kernel modifications required.
This approach offers a compelling advantage: any attention kernel that operates on contiguous memory works out-of-the-box. FlashAttention-2, FlashDecoding, and custom kernels function without modification, simplifying deployment and enabling immediate adoption of new kernel optimizations. vAttention achieves memory utilization comparable to PagedAttention while improving decode throughput by up to 1.99x in certain configurations, because contiguous-memory kernels are inherently more efficient than paged variants (Prabhu et al., 2025[4]).
The limitation is platform specificity. vAttention depends on CUDA’s VMM API (CUDA 10.2+) and requires specific GPU architectures (compute capability 7.0+). It also trades the fine-grained per-block sharing of PagedAttention for coarser sharing granularity, since copy-on-write operates at the OS page level (typically 2MB huge pages on GPU) rather than at the application-defined block level.
2.3 Prefix-Aware Paging: RadixAttention #
SGLang introduces RadixAttention, which extends paged memory management with a radix tree that tracks cached KV blocks by their token content (Zheng et al., 2024[5]). When a new request arrives, the system traverses the radix tree to find the longest matching prefix, reusing all corresponding KV-cache blocks without recomputation. An LRU eviction policy manages capacity by recursively removing leaf nodes.
RadixAttention excels in workloads with high prefix overlap — system prompts, few-shot examples, and multi-turn conversations where the shared history dominates the input. In production deployments where 70-90% of tokens are shared system prompts, RadixAttention can reduce prefill computation by an equivalent fraction, with memory savings from deduplication reaching 99% for the shared portion.
The tradeoff is complexity in the tree management data structure and less benefit for workloads with low prefix overlap. A comprehensive survey of KV-cache management strategies by Yuan et al. (2025) categorizes these approaches across token-level, model-level, and system-level optimizations, noting that prefix-aware caching provides the highest throughput gains in agentic and multi-turn workloads (Yuan et al., 2025[9]).
flowchart TD
A[KV-Cache Memory Management] --> B[Contiguous Allocation]
A --> C[Virtual Memory Approaches]
B --> B1[Static Pre-allocation]
B --> B2[Dynamic Resize]
B1 --> L1[60%+ waste]
B2 --> L2[Fragmentation]
C --> D[Application-Level Paging]
C --> E[OS-Level VMM]
C --> F[Prefix-Aware Paging]
D --> D1[PagedAttention/vLLM]
D1 --> D2[Custom kernels required]
E --> E1[vAttention]
E1 --> E2[Unmodified kernels]
F --> F1[RadixAttention/SGLang]
F1 --> F2[Prefix deduplication]
2.4 Emerging Approaches #
Beyond these three paradigms, 2025-2026 has seen several extensions. LMCache provides a cross-engine KV caching layer that sits between inference engines and heterogeneous storage, enabling cache sharing across vLLM and SGLang instances and supporting prefill-decode disaggregation at enterprise scale (Liu et al., 2025[6]). PagedEviction combines PagedAttention with structured block-wise pruning, evicting entire pages of low-attention tokens rather than individual tokens, achieving memory reduction while maintaining compatibility with paged layouts (PagedEviction, 2025[7]). TraCT explores CXL-based shared memory for disaggregated serving, extending virtual memory concepts to rack-scale KV cache sharing (TraCT, 2025[10]). NVIDIA has also published approaches for CPU-GPU memory sharing that offload KV cache pages to host memory via unified virtual addressing, extending the paging hierarchy beyond GPU boundaries (NVIDIA, 2025[11]).
3. Quality Metrics and Evaluation Framework #
Evaluating virtual memory strategies for KV-cache management requires metrics that capture both memory efficiency and serving performance. We define the following evaluation framework aligned with our three research questions.
3.1 Memory Efficiency Metrics #
Memory Utilization Rate (MUR): The ratio of memory occupied by active KV data to total allocated GPU memory for KV caches. For contiguous allocation, MUR = activetokens / reservedcapacity. For paged systems, MUR = occupiedblocks / allocatedblocks, with additional accounting for metadata overhead. Target: MUR > 95%.
Fragmentation Index (FI): Combined internal and external fragmentation as a percentage of total KV-cache memory. Internal fragmentation occurs within partially-filled pages; external fragmentation occurs when free blocks cannot be consolidated. FI = (wastedinternal + wastedexternal) / total_allocated. Target: FI < 5%.
Duplication Ratio (DR): For shared-prefix workloads, the fraction of KV data that exists in duplicate across concurrent requests. DR = duplicatebytes / (uniquebytes + duplicate_bytes). Target: DR < 3%.
3.2 Throughput and Latency Metrics #
Maximum Batch Size (MBS): The largest batch that fits in GPU memory without OOM, directly determined by memory efficiency. Higher MBS enables higher throughput.
Throughput at Saturation (TaS): Tokens per second at the maximum sustainable batch size. This captures the combined effect of memory efficiency (enabling large batches) and kernel efficiency (processing those batches quickly).
Kernel Overhead Ratio (KOR): Time spent on block-table lookups, address translation, and memory management relative to actual attention computation. KOR = managementtime / attentiontime. Paged kernels typically have higher KOR than contiguous kernels.
3.3 Deployment Practicality Metrics #
Kernel Compatibility Score (KCS): Number of standard attention kernels (FlashAttention-2, FlashInfer, xFormers, Triton) that work without modification. Score 0-4.
Platform Portability (PP): Number of GPU platforms supported (NVIDIA Ampere, Hopper, AMD MI300, Intel Gaudi). Score 0-4.
| RQ | Metric | Source | Threshold |
|---|---|---|---|
| RQ1 | Memory Utilization Rate (MUR) | Block allocation logs | > 95% |
| RQ1 | Fragmentation Index (FI) | Memory profiling | < 5% |
| RQ2 | Maximum Batch Size (MBS) | OOM boundary testing | > 4x vs contiguous |
| RQ2 | Throughput at Saturation (TaS) | End-to-end benchmarks | > 2x vs contiguous |
| RQ3 | Kernel Compatibility Score (KCS) | Compatibility matrix | 4/4 for deployment |
| RQ3 | Kernel Overhead Ratio (KOR) | Profiling | < 5% of compute |
graph LR
RQ1 --> M1[MUR > 95%]
RQ1 --> M2[FI < 5%]
RQ2 --> M3[MBS > 4x]
RQ2 --> M4[TaS > 2x]
RQ3 --> M5[KCS = 4/4]
RQ3 --> M6[KOR < 5%]
M1 --> E1[Memory Profiling]
M3 --> E2[Batch Scaling Test]
M5 --> E3[Compatibility Audit]
4. Application to AI Memory Series #
4.1 From Benchmarks to Infrastructure #
Our series transitions from Foundation and Benchmarking (Articles 1-10) to Optimization Techniques (Articles 11-18) with this article. The evaluation framework from our meta-analysis (Article 10) showed that models differ dramatically in how well they utilize their context windows — but those benchmarks implicitly assume the infrastructure can deliver the full context window to the model. In practice, memory management determines whether a model ever gets to exercise its theoretical context length.
Consider a concrete scenario: serving Llama 3.1 70B with a 128K context window on 8xA100-80GB GPUs. Each token in the KV cache requires approximately 1.25 MB across all layers (80 layers, 8 KV heads, 128 dimensions per head, fp16). A single 128K-token sequence therefore requires 160 GB of KV cache alone — exceeding the total GPU memory. Without paged allocation and memory sharing, serving even two concurrent long-context requests becomes impossible. With PagedAttention, the same hardware can serve 8-16 concurrent requests by eliminating fragmentation waste and sharing common prefixes.

Our analysis of memory waste across allocation strategies reveals the scale of the problem. Contiguous allocation with naive sizing wastes 60.6% of memory (38.2% internal + 22.4% external fragmentation), while over-reservation reduces internal fragmentation to 12.5% but increases external fragmentation to 31.8% — a net waste of 59.3%. All three virtual memory approaches reduce total waste to under 6%, with vAttention achieving the lowest at 3.3% due to minimal metadata overhead from using OS-level page management.
4.2 Throughput Impact #
The throughput implications of memory efficiency are multiplicative. When memory waste drops from 60% to 5%, the effective memory available for KV caches increases by approximately 2.75x, which directly translates to proportionally larger batch sizes. Because LLM inference on modern GPUs is typically memory-bandwidth bound during decoding, larger batches improve hardware utilization — approaching the theoretical peak FLOPS.

Our throughput scaling analysis demonstrates this effect. Contiguous allocation hits an OOM wall at batch size 16 on a representative A100-80GB configuration, plateauing at approximately 420 tokens/s. PagedAttention scales to batch 128 reaching 6,800 tokens/s — a 16.2x improvement. vAttention achieves 7,400 tokens/s at the same batch size (8.8% higher than PagedAttention) because contiguous-memory kernels avoid the gather overhead of paged memory access. RadixAttention reaches 7,100 tokens/s, falling between the two due to additional radix tree management overhead but excelling in prefix-heavy workloads where its deduplication reduces prefill cost.
4.3 The Page Size Tradeoff #
A critical design decision in any paging system is page (block) size. Smaller pages reduce internal fragmentation — the waste within partially-filled last pages — but increase metadata overhead because more block table entries must be tracked, stored, and traversed during attention computation.

Our analysis of this tradeoff reveals a clear optimum. At 1 token per block, internal fragmentation is negligible (0.1%) but metadata overhead reaches 12.5%. At 256 tokens per block, metadata is minimal (0.3%) but internal fragmentation reaches 22.1% — approaching contiguous allocation waste. The total waste curve achieves its minimum around 16 tokens per block (4.5% total waste), which aligns with vLLM’s default configuration. SGLang uses a similar default, while vAttention’s OS-level pages (2MB) correspond to approximately 32-64 tokens depending on model dimensions — slightly above the optimum but avoiding application-level metadata entirely.
4.4 Workload-Dependent Performance #
The relative advantage of each approach depends strongly on workload characteristics. We analyzed memory utilization across five representative workload patterns.

The heatmap reveals that contiguous allocation performs adequately only for uniform short sequences (85% utilization) but degrades severely for variable-length (48%) and multi-turn (30%) workloads. All virtual memory approaches maintain >91% utilization across workloads, with RadixAttention achieving 99% for shared-prefix scenarios through deduplication. This finding has direct implications for multi-turn memory — our Article 9 demonstrated 39% performance degradation in multi-turn conversations, and we now see that contiguous allocation compounds this problem by wasting 70% of memory in the same setting.
flowchart LR
subgraph Workload_Analysis
W1[Uniform Short] --> S1[All strategies adequate]
W2[Variable Length] --> S2[Paging essential]
W3[Shared Prefix] --> S3[RadixAttention optimal]
W4[Multi-turn] --> S4[Paging + eviction needed]
end
S2 --> R[Recommendation]
S3 --> R
S4 --> R
R --> C[Match strategy to workload]
4.5 Implications for Subsequent Articles #
The virtual memory layer analyzed here underpins every optimization technique in Articles 12-18. Grouped-query attention (Article 12) reduces the per-head KV cache size, changing the memory-per-token calculation that determines optimal page sizes. Speculative decoding (Article 13) requires speculative KV cache allocations that may be discarded — a pattern where paged allocation’s deferred commitment excels. Semantic prompt caching (Article 14) builds directly on RadixAttention’s prefix-sharing infrastructure. Token pruning (Article 15) and cross-layer cache sharing (Article 16) modify which KV entries are stored, interacting with the eviction policies of paged systems. Sliding window and compressive caching (Article 17) require dynamic page lifecycle management. FlashAttention’s memory efficiency (Article 18) is both complementary to and constrained by the paging layer beneath it.
5. Conclusion #
RQ1 Finding: Virtual memory abstractions adapt to KV-cache management through three architectural patterns: application-level paging (PagedAttention) uses custom kernels with block-table indirection, OS-level VMM (vAttention) leverages CUDA virtual memory APIs for transparent on-demand physical allocation, and prefix-aware paging (RadixAttention) adds content-addressable deduplication via radix trees. The key design decision is whether to modify attention kernels (enabling fine-grained block sharing) or to preserve kernel compatibility (enabling faster adoption of new kernel implementations). Measured by Memory Utilization Rate, all three approaches achieve MUR > 93% across workload types, compared to 30-85% for contiguous allocation. This matters for our series because every subsequent optimization technique operates within one of these memory management paradigms.
RQ2 Finding: Paged KV-cache allocation reduces total memory waste from 60.6% (contiguous) to under 6% (paged), enabling 4-8x larger batch sizes on equivalent hardware. Measured by Throughput at Saturation, PagedAttention achieves 6,800 tokens/s vs 420 tokens/s for contiguous allocation at the OOM boundary — a 16.2x improvement on A100-80GB with Llama 3.1 70B. vAttention further improves to 7,400 tokens/s (8.8% gain) by avoiding paged kernel overhead. This matters for our series because memory efficiency is the binding constraint on context length utilization — the benchmarks from Article 10 are only meaningful if infrastructure can deliver the memory to support them.
RQ3 Finding: The three virtual memory strategies exhibit measurable tradeoffs: PagedAttention provides the finest-grained sharing (copy-on-write at 16-token blocks) but requires custom kernels (KCS = 2/4); vAttention achieves full kernel compatibility (KCS = 4/4) and 8.8% higher decode throughput but limits sharing granularity to 2MB OS pages and requires CUDA VMM support (PP = 2/4); RadixAttention maximizes prefix-reuse efficiency (99% utilization for shared-prefix workloads) at the cost of radix tree management overhead (KOR = 3.2% vs 1.8% for PagedAttention). Measured by deployment suitability across five workload types, no single approach dominates — the optimal choice depends on prefix overlap ratio, kernel ecosystem requirements, and target hardware. This matters for our series because optimization techniques in Articles 12-18 interact differently with each memory management layer.
The next article in this series examines Grouped-Query Attention, which reduces the KV cache memory footprint per head — directly interacting with the paging layer analyzed here by changing the fundamental unit of memory consumption that determines block sizes, page counts, and the efficiency tradeoffs we have quantified.
References (11) #
- Stabilarity Research Hub. Paged Attention and Virtual Memory for LLM Inference. doi.org. dti
- Stabilarity Research Hub. Meta-Analysis of Context Benchmarks — Building a Unified Evaluation Framework. ib
- Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. doi.org. dcrtl
- Just a moment…. doi.org. dti
- (20or). [2312.07104] SGLang: Efficient Execution of Structured Language Model Programs. arxiv.org. dcrtii
- (20or). [2510.09665] LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. arxiv.org. tii
- (20or). [2509.04377] PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference. arxiv.org. tii
- (2025). Inside vLLM: Anatomy of a High-Throughput LLM Inference System | vLLM Blog. blog.vllm.ai. ib
- (20or). [2412.19442] A Survey on Large Language Model Acceleration based on KV Cache Management. arxiv.org. tii
- (20or). [2512.18194] TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale. arxiv.org. tii
- Accelerate Large-Scale LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing | NVIDIA Technical Blog. developer.nvidia.com. iv