KV-Cache Fundamentals — How Transformers Remember (and Forget)

AI MemoryTechnical Research · Article 1 of 29

Server room representing transformer memory architecture

AI Memory — Article 1 of 30
Oleh Ivchenko · March 2026

Academic Citation: Ivchenko, O. (2026). KV-Cache Fundamentals — How Transformers Remember (and Forget). AI Memory Series. Stabilarity Research Hub, ONPU. DOI: 10.5281/zenodo.19112532^[1]

DOI: 10.5281/zenodo.19112532^[1]Zenodo Archive ORCID

2,794 words · 57% fresh refs · 3 diagrams · 14 references

70stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	7%	○	≥80% from editorially reviewed sources
[t]	Trusted	100%	✓	≥80% from verified, high-quality sources
[a]	DOI	100%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	14%	○	≥80% indexed in CrossRef
[i]	Indexed	93%	✓	≥80% have metadata indexed
[l]	Academic	14%	○	≥80% from journals/conferences/preprints
[f]	Free Access	0%	○	≥80% are freely accessible
[r]	References	14 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,794	✓	Minimum 2,000 words for a full research article. Current: 2,794
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19112532
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	57%	✗	≥80% of references from 2025–2026. Current: 57%
[c]	Data Charts	0	○	Original data charts from reproducible analysis (min 2). Current: 0
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (82 × 60%) + Required (3/5 × 30%) + Optional (1/4 × 10%)

Series: AI Memory — Article 1 of 30

Abstract #

The key-value (KV) cache is the dominant memory structure enabling efficient autoregressive inference in transformer-based large language models (LLMs). While the self-attention mechanism requires quadratic computation over the full sequence during training, the KV-cache converts inference into a linear-time operation by retaining previously computed key and value projections. This article provides a rigorous treatment of KV-cache fundamentals: the mathematical basis for caching in self-attention, memory growth characteristics, cache size calculations for contemporary architectures, and the impact of attention variants — multi-head (MHA), multi-query (MQA), and grouped-query attention (GQA) — on cache footprint. We examine the memory bottleneck that emerges at long context lengths, discuss flash attention’s relationship to caching, and present real-world cache size estimates for models including Llama 3 and GPT-4-class systems. This first article in the AI Memory series establishes the theoretical and practical foundation upon which subsequent articles on cache compression, eviction, and distributed caching will build.

1. Introduction #

The transformer architecture has become the de facto backbone of modern language models, from billion-parameter chatbots to multimodal reasoning systems. At the heart of this architecture lies the self-attention mechanism, which allows each token to attend to every other token in the sequence. During training, this produces an O(n²) computation over sequence length n — a cost that is amortized over the full batch. During inference, however, the autoregressive generation pattern introduces a critical optimization opportunity: the key-value cache.

The KV-cache stores previously computed key (K) and value (V) projection matrices so that, when generating token t+1, the model need not recompute attention over all prior tokens from scratch. Instead, it appends new K and V vectors to the cache and computes attention only for the new query against the full cached history. This converts what would be O(n²) per-token computation into O(n) per step, yielding a total of O(n²) over an entire sequence generation rather than O(n³) without caching [1]^[2].

Yet this efficiency comes at a steep memory cost. For a 70-billion-parameter model operating at 128K context length, the KV-cache alone can consume over 40 GB of GPU memory — rivaling the model weights themselves [2]^[3]. Understanding this trade-off between computational savings and memory pressure is essential for anyone deploying, optimizing, or researching LLM inference systems.

flowchart TD
    A[Input Token x_t] --> B[Linear Projections]
    B --> Q[Query Q_t]
    B --> K[Key K_t]
    B --> V[Value V_t]
    K --> Cache[KV-Cache: append K_t]
    V --> Cache
    Cache --> |All cached K₁..ₜ| Attn[Scaled Dot-Product Attention]
    Cache --> |All cached V₁..ₜ| Attn
    Q --> Attn
    Attn --> O[Output o_t]
    O --> Next[Next Token Prediction]

    style Cache fill:#f9d71c,stroke:#333,stroke-width:2px
    style Attn fill:#87ceeb,stroke:#333,stroke-width:2px

Figure 1. Autoregressive inference with KV-cache. At each generation step, only the new token’s key and value are computed and appended to the cache, while the full cached history participates in the attention computation.

2. Self-Attention and the Case for Caching #

2.1 The Attention Mechanism #

The standard scaled dot-product attention is defined as:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

where Q ∈ ℝ^n×d_k, K ∈ ℝ^n×d_k, and V ∈ ℝ^n×d_v are the query, key, and value matrices respectively, and d_k is the key dimension used for scaling. In multi-head attention (MHA), the input is projected h times into different subspaces, attention is computed independently in each head, and results are concatenated and projected back.

During training on a full sequence of length n, all n tokens are processed simultaneously. The attention score matrix QK^T has dimensions n × n, requiring O(n²d_k) computation and O(n²) memory for the attention weights. This is a necessary cost for learning bidirectional dependencies.

2.2 Autoregressive Generation Without Cache #

During autoregressive inference, the model generates one token at a time. Without caching, generating the t-th token requires computing the full attention over all t tokens seen so far. The key and value projections for all previous tokens must be recomputed from their embeddings. Over the entire generation of n tokens, this results in a total computation of O(n² · d_k) — but the constant factor is enormous, as each forward pass through L layers, each with h heads, recomputes everything from scratch.

2.3 The KV-Cache Optimization #

The fundamental insight is that key and value projections for token i do not change when generating token j > i (in decoder-only architectures with causal masking). Therefore, we can cache K_i and V_i after they are first computed and reuse them for all subsequent generation steps.

With the KV-cache, generating token t+1 requires: (1) computing Q_t+1, K_t+1, V_t+1 from the new token embedding at O(d) per layer; (2) appending K_t+1 and V_t+1 to the cache; and (3) computing attention — Q_t+1 against all cached K_1..t+1 at O(t · d_k) per head per layer. The per-step cost is now O(t) rather than requiring full recomputation, and the overall generation cost remains O(n²) but with a much smaller constant factor [3]^[4].

graph LR
    subgraph Without_KV_Cache[Without KV-Cache]
        A1[Step 1: Compute K,V for token 1] --> A2[Step 2: Recompute K,V for tokens 1-2]
        A2 --> A3[Step 3: Recompute K,V for tokens 1-3]
        A3 --> A4["Step n: Recompute K,V for tokens 1..n (X)"]
    end

    subgraph With_KV_Cache[With KV-Cache]
        B1["Step 1: Compute & cache K₁,V₁"] --> B2["Step 2: Compute & cache K₂,V₂"]
        B2 --> B3["Step 3: Compute & cache K₃,V₃"]
        B3 --> B4["Step n: Compute & cache Kₙ,Vₙ Yes"]
    end

    style A4 fill:#ff6b6b,stroke:#333
    style B4 fill:#51cf66,stroke:#333

Figure 2. Computational comparison: without caching, each step redundantly recomputes all previous projections; with caching, each step only computes the new token’s projections and reuses cached values.

3. Cache Size Calculations #

3.1 The Formula #

The KV-cache size for a single sequence can be precisely calculated as:

Cache Size = 2 × L × n_kv × d_head × s × b

where 2 accounts for both key and value tensors; L is the number of transformer layers; n_kv is the number of key-value heads (equals h for MHA, 1 for MQA, h/g for GQA with g groups); d_head is the per-head dimension (typically d_model / h); s is the sequence length (context window); and b is the bytes per element (2 for FP16/BF16, 1 for INT8).

3.2 Real-World Examples #

Model	Layers	KV Heads	d_head	Cache @ 8K (FP16)	Cache @ 128K (FP16)
Llama 3 8B (GQA-8)	32	8	128	1.07 GB	17.2 GB
Llama 3 70B (GQA-8)	80	8	128	2.68 GB	42.9 GB
GPT-4 class (est. MHA-96)	~120	96	128	~38.7 GB	~770 GB

These calculations reveal a fundamental tension: long contexts are desirable for quality but impose severe memory costs. A single 128K-context request on Llama 3 70B consumes more KV-cache memory than the model weights stored in INT4 quantization [4]^[5].

4. The Memory Bottleneck at Long Contexts #

4.1 Linear Growth, Quadratic Consequences #

While the KV-cache grows linearly with sequence length — O(n) memory — the practical consequences are multiplicative in the system context. Each concurrent request maintains its own cache, meaning a serving system handling B concurrent requests requires B × CacheSize memory. For a 70B-parameter model serving 32 concurrent 128K-context requests, cache memory alone would require 32 × 42.9 GB ≈ 1.37 TB — far exceeding any single GPU’s capacity [5]^[6].

This memory pressure has driven multiple lines of research: (1) cache compression via quantization, reducing from FP16 to INT4 or lower [6]^[7]; (2) cache eviction strategies that selectively discard less-important cached entries [7]; (3) architectural innovations like GQA that structurally reduce cache size [8]; (4) paged memory management that enables efficient sharing and scheduling [9]^[8]; and (5) hardware disaggregation that exploits CXL interconnects for expanded memory pools [10]^[9].

4.2 Memory vs. Compute Bound #

During the autoregressive decode phase, the operation is typically memory-bandwidth bound rather than compute bound. Each generated token requires reading the entire KV-cache from GPU memory to compute attention, but performs relatively little arithmetic. The arithmetic intensity (FLOPs per byte transferred) is low, meaning GPU compute units sit idle waiting for memory transfers [11]^[10].

This characteristic has profound implications for system design. Techniques like speculative decoding attempt to generate multiple tokens in parallel to improve compute utilization, but they further increase KV-cache pressure since speculative branches require their own cached states [12].

5. Attention Variants and Cache Impact #

5.1 Multi-Head Attention (MHA) #

Standard MHA maintains separate K and V projections for each attention head. With h heads, the cache stores 2 × L × h × d_head × s elements. This is the maximum cache footprint for a given architecture.

5.2 Multi-Query Attention (MQA) #

MQA shares a single set of K and V projections across all attention heads, while maintaining separate Q projections. This reduces the KV-cache by a factor of h (the number of heads). For a model with 64 heads, this represents a 64× reduction in cache size — a dramatic saving that enables significantly higher throughput [13]^[11].

However, MQA can degrade model quality because all heads attend to identical key-value representations, limiting the model’s capacity to capture diverse attention patterns.

5.3 Grouped-Query Attention (GQA) #

GQA provides a middle ground by grouping heads into g groups, where each group shares one set of K and V projections. The cache reduction factor is h/g compared to MHA. Llama 3 uses GQA with 8 KV heads across its models, yielding an 8× cache reduction versus MHA [8].

Recent work has further optimized GQA through graph-based query clustering, dynamically assigning queries to groups based on similarity rather than using fixed groupings. This achieves better quality-efficiency trade-offs than static GQA configurations [8].

graph TB
    subgraph MHA[Multi-Head Attention - MHA]
        MHA_Q1[Q₁] --> MHA_KV1[K₁,V₁]
        MHA_Q2[Q₂] --> MHA_KV2[K₂,V₂]
        MHA_Q3[Q₃] --> MHA_KV3[K₃,V₃]
        MHA_Q4[Q₄] --> MHA_KV4[K₄,V₄]
    end

    subgraph MQA[Multi-Query Attention - MQA]
        MQA_Q1[Q₁] --> MQA_KV[K,V shared]
        MQA_Q2[Q₂] --> MQA_KV
        MQA_Q3[Q₃] --> MQA_KV
        MQA_Q4[Q₄] --> MQA_KV
    end

    subgraph GQA[Grouped-Query Attention - GQA]
        GQA_Q1[Q₁] --> GQA_KV1[K₁,V₁ Group A]
        GQA_Q2[Q₂] --> GQA_KV1
        GQA_Q3[Q₃] --> GQA_KV2[K₂,V₂ Group B]
        GQA_Q4[Q₄] --> GQA_KV2
    end

    style MHA_KV1 fill:#ff6b6b
    style MHA_KV2 fill:#ff6b6b
    style MHA_KV3 fill:#ff6b6b
    style MHA_KV4 fill:#ff6b6b
    style MQA_KV fill:#51cf66
    style GQA_KV1 fill:#ffd43b
    style GQA_KV2 fill:#ffd43b

Figure 3. Comparison of attention variants and their KV-cache requirements. MHA (red) maintains separate K,V per head — maximum cache. MQA (green) shares one K,V across all heads — minimum cache. GQA (yellow) groups heads — balanced trade-off.

6. Flash Attention and Its Relationship to Caching #

Flash attention is an I/O-aware attention algorithm that reduces memory reads and writes by tiling the attention computation to exploit GPU SRAM (on-chip memory) rather than HBM (off-chip high-bandwidth memory). Rather than materializing the full n × n attention matrix in HBM, flash attention computes attention in blocks, keeping intermediate results in fast SRAM [3]^[4].

It is critical to distinguish flash attention’s role from the KV-cache:

Flash attention optimizes the computation of attention during a single forward pass — it addresses the O(n²) memory required for the attention score matrix
The KV-cache optimizes across multiple forward passes during autoregressive generation — it addresses redundant recomputation of key and value projections

These optimizations are complementary. Flash attention reduces the memory and time cost of computing attention for a given sequence length, while the KV-cache eliminates redundant computation across generation steps. Modern inference systems employ both simultaneously: flash attention computes the attention operation efficiently within each step, and the KV-cache ensures previous key-value projections are reused across steps.

7. Emerging Approaches to Cache Management #

The memory pressure of KV-caching has sparked a rich landscape of optimization techniques that subsequent articles in this series will explore in depth:

Cache Quantization: Reducing the precision of cached keys and values from FP16 to INT8, INT4, or even lower bit-widths. Dynamic quantization frameworks adapt precision based on token importance, achieving compression ratios of 4–8× with minimal quality degradation [6]^[7][4]^[5].

Cache Eviction and Compression: Rather than storing all tokens, selective eviction discards less-important entries based on attention scores or other importance metrics. Research has shown that value projections, not just attention scores, carry critical importance signals for determining which cache entries to retain [7].

Paged Attention: Borrowed from operating system virtual memory concepts, paged attention manages KV-cache memory in fixed-size blocks, enabling efficient allocation, deallocation, and sharing across requests. The vLLM system demonstrated that paged attention eliminates memory fragmentation and enables cache sharing across requests with common prefixes [9]^[8].

Sparse and Structured Attention: Computing attention over only a subset of cached positions reduces both computation and effective cache utilization. Multi-context sparse attention operates across shared KV-caches, enabling efficient serving of batched requests that share context [7].

Hardware Disaggregation: CXL-based disaggregated memory systems enable KV-caches to reside in a separate memory pool accessed over high-bandwidth interconnects, decoupling cache capacity from local GPU memory [10]^[9].

8. Discussion #

The KV-cache represents a fundamental trade-off in transformer inference: exchanging memory for computation. As context windows have expanded from 2K tokens in early GPT models to 128K+ in modern systems, this trade-off has become increasingly acute. The cache that was once a minor bookkeeping structure now dominates GPU memory allocation during inference.

Several trends will shape the future of KV-caching: (1) Context lengths will continue to grow, as applications in code generation, document analysis, and multi-turn dialogue demand ever-longer contexts. (2) Architectural innovations such as multi-head latent attention mechanisms will reduce cache footprint without the quality trade-offs of MQA [5]^[6]. (3) System-level optimization including paged attention, prefix caching, and cross-request sharing will move from research to standard infrastructure [9]^[8]. (4) The combination of GQA with INT4 quantization yields multiplicative savings — a 70B model’s 128K cache drops from 42.9 GB to approximately 1.3 GB [4]^[5][6]^[7].

9. Conclusion #

The KV-cache is not merely an optimization — it is a foundational mechanism that makes transformer-based autoregressive generation practical at scale. Without it, the computational cost of generating even short sequences from billion-parameter models would be prohibitive. Yet as models grow larger and contexts longer, the memory cost of maintaining these caches has become a primary bottleneck in LLM deployment.

Understanding KV-cache fundamentals — how caches grow, what determines their size, how attention variants affect footprint, and where flash attention fits in — is prerequisite knowledge for the increasingly sophisticated optimization techniques being developed by the research community. The subsequent articles in this series will build on this foundation to explore cache compression, eviction policies, distributed caching architectures, and the system-level innovations that make modern long-context LLM serving possible.

References #

[1] T. Sandholm, L. Cheng, and B. A. Huberman, “SkyMemory: A LEO Edge Cache for Transformer Inference Optimization and Scale Out,” in 2026 IEEE 23rd Consumer Communications & Networking Conference (CCNC), IEEE, 2026. DOI: 10.1109/ccnc65079.2026.11366312^[2]

[2] K. Chen, Z. Zhou, and Y. Chen, “Area- and Utilization-Efficient LLM Accelerator With Fused Speculative Decoding for Edge-Side Inference,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2026. DOI: 10.1109/tvlsi.2026.3659893^[3]

[3] Z. Li, L. Wu, and Y. Yang, “A 16×16 High-Utilization Systolic Array Hardware Accelerator for Long-Sequence Flash-Attention Computation in Transformer,” in 2025 IEEE 16th International Conference on ASIC (ASICON), IEEE, 2025. DOI: 10.1109/asicon66040.2025.11326298^[4]

[4] Y. Wu, R. Lin, and J. Que, “KVC-Q: A high-fidelity and dynamic KV Cache quantization framework for long-context large language models,” Journal of Systems Architecture, Elsevier, 2026. DOI: 10.1016/j.sysarc.2026.103699^[5]

[5] N. Kim, J. Lee, and S. Lee, “MDLA: Multi-Head Latent Linear Attention for Low-Complexity and Memory-Efficient KV Cache,” in 2026 International Conference on Electronics, Information, and Communication (ICEIC), IEEE, 2026. DOI: 10.1109/iceic69189.2026.11386168^[6]

[6] J. Fan and C.-M. Chen, “Efficient Multimodal Large Language Model via Dynamic KV Cache Quantization,” in Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2026. DOI: 10.1609/aaai.v40i25.39241^[7]

[7] Z. Cao, Q. Si, and J. Zhang, “Sparse Attention Across Multiple-Context KV Cache,” in Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2026. DOI: 10.1609/aaai.v40i36.40266

[8] L. Zheng, Y. Zhang, and L. Shen, “Grouped query attention supported with graph-based query clustering,” Knowledge-Based Systems, Elsevier, 2026. DOI: 10.1016/j.knosys.2026.115311

[9] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” in Proceedings of the 29th Symposium on Operating Systems Principles, ACM, 2023. DOI: 10.1145/3600006.3613165^[8]

[10] D. Liu and Y. Yu, “CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving,” in Proceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ACM, 2026. DOI: 10.1145/3748173.3779188^[9]

[11] M. Michalec, S. Tannu, and G. Sohi, “Reducing LLM Inference Memory Bandwidth Via Frequent Exponent Value Encoding,” IEEE Computer Architecture Letters, IEEE, 2026. DOI: 10.1109/lca.2026.3671166^[10]

[12] L. Shi, Z. Li, and L. Zhang, “Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios,” in Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2026. DOI: 10.1609/aaai.v40i39.40576

[13] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai, “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, ACL, 2023. DOI: 10.18653/v1/2023.emnlp-main.298^[11]

Series: AI Memory — Article 1 of 30
⬅️ Previous: None (series start) | ➡️ Next: Article coming soon

References (11) #

Stabilarity Research Hub. KV-Cache Fundamentals — How Transformers Remember (and Forget). doi.org. d t i
(2026). SkyMemory: A LEO Edge Cache for Transformer Inference Optimization and Scale Out | IEEE Conference Publication | IEEE Xplore. doi.org. d t i
(2026). Area- and Utilization-Efficient LLM Accelerator With Fused Speculative Decoding for Edge-Side Inference | IEEE Journals & Magazine | IEEE Xplore. doi.org. d t i
(2025). A 16×16 High-Utilization Systolic Array Hardware Accelerator for Long-Sequence Flash-Attention Computation in Transformer | IEEE Conference Publication | IEEE Xplore. doi.org. d t i
(2026). Redirecting. doi.org. d t i
(2026). MDLA: Multi-Head Latent Linear Attention for Low-Complexity and Memory-Efficient KV Cache | IEEE Conference Publication | IEEE Xplore. doi.org. d t i
403 Forbidden. doi.org. d t i
Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. doi.org. d c r t l
Liu, Dong; Yu, Yanxuan. (2026). CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving. doi.org. d c t i l
(2026). Reducing LLM Inference Memory Bandwidth Via Frequent Exponent Value Encoding | IEEE Journals & Magazine | IEEE Xplore. doi.org. d t i
(2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints – ACL Anthology. doi.org. d t i

Version History · 2 revisions

Rev	Date	Status	Action	By	Size
v1	Mar 19, 2026	DRAFT	Initial draft First version created	(w) Author	19,958 (+19958)
v2	Mar 19, 2026	CURRENT	Published Article published to research hub	(w) Author	19,986 (+28)