Cross-Architecture Memory Comparison — Llama vs Mistral vs Gemma vs Qwen
DOI: 10.5281/zenodo.19183148[1] · View on Zenodo (CERN)
| Badge | Metric | Value | Status | Description |
|---|---|---|---|---|
| [s] | Reviewed Sources | 0% | ○ | ≥80% from editorially reviewed sources |
| [t] | Trusted | 79% | ○ | ≥80% from verified, high-quality sources |
| [a] | DOI | 64% | ○ | ≥80% have a Digital Object Identifier |
| [b] | CrossRef | 0% | ○ | ≥80% indexed in CrossRef |
| [i] | Indexed | 100% | ✓ | ≥80% have metadata indexed |
| [l] | Academic | 7% | ○ | ≥80% from journals/conferences/preprints |
| [f] | Free Access | 36% | ○ | ≥80% are freely accessible |
| [r] | References | 14 refs | ✓ | Minimum 10 references required |
| [w] | Words [REQ] | 2,222 | ✓ | Minimum 2,000 words for a full research article. Current: 2,222 |
| [d] | DOI [REQ] | ✓ | ✓ | Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19183148 |
| [o] | ORCID [REQ] | ✓ | ✓ | Author ORCID verified for academic identity |
| [p] | Peer Reviewed [REQ] | — | ✗ | Peer reviewed by an assigned reviewer |
| [h] | Freshness [REQ] | 58% | ✗ | ≥80% of references from 2025–2026. Current: 58% |
| [c] | Data Charts | 5 | ✓ | Original data charts from reproducible analysis (min 2). Current: 5 |
| [g] | Code | — | ○ | Source code available on GitHub |
| [m] | Diagrams | 3 | ✓ | Mermaid architecture/flow diagrams. Current: 3 |
| [x] | Cited by | 0 | ○ | Referenced by 0 other hub article(s) |
Abstract #
The proliferation of open-source large language model families in 2026 — each adopting distinct attention mechanisms and KV-cache configurations — creates a fragmented landscape where memory footprint varies by up to 4.6x across architectures at identical context lengths. This article provides a systematic cross-architecture comparison of KV-cache memory behavior across four dominant model families: Llama (Meta), Mistral (Mistral AI), Gemma (Google), and Qwen (Alibaba). We formulate three research questions addressing raw cache dimensionality differences, the impact of hybrid attention strategies on effective memory utilization, and architecture-specific tolerance to cache quantization. Our analysis, grounded in architecture specifications and recent empirical studies, reveals that Gemma 3’s hybrid sliding-window design achieves the highest memory efficiency at 23,273 tokens per GB of KV-cache — 3.1x more efficient than Llama 3.1 — while Qwen 3’s aggressive 8:1 query-to-KV head ratio delivers the smallest raw cache footprint among standard GQA models. Q4 cache quantization shows architecture-dependent tolerance, with Gemma 3 exhibiting a 0.7% perplexity improvement while Llama 3.1 degrades by 2.8%. These findings establish quantitative selection criteria for deployment-constrained environments and directly inform the optimization techniques examined in subsequent articles of the AI Memory series.
1. Introduction #
In the previous article, we established that KV-cache compression techniques — quantization, eviction, and pruning — achieve varying trade-offs between memory savings and accuracy preservation, with hybrid approaches like KVTC achieving up to 20x compression while maintaining reasoning accuracy (Ivchenko, 2026[2]). However, those benchmarks treated architecture as a controlled variable, evaluating compression techniques within single model families. In practice, the choice of base architecture determines the starting KV-cache footprint before any compression is applied — and this baseline varies dramatically across model families.
The four dominant open-source LLM families as of Q1 2026 — Llama, Mistral, Gemma, and Qwen — have each converged on grouped-query attention (GQA) as the standard replacement for multi-head attention, yet their implementations diverge significantly in the number of KV heads, head dimensions, and the addition of architectural innovations like sliding-window attention and hybrid attention patterns (Chen et al., 2025[3]). These differences are not cosmetic: they determine how much GPU memory is consumed by the KV-cache during inference, how cache size scales with context length, and how amenable each architecture is to post-training compression.
Research Questions #
RQ1: How do raw KV-cache dimensions (number of KV heads, head dimension, layer count) translate into concrete memory footprint differences across Llama, Mistral, Gemma, and Qwen architectures at equivalent model sizes?
RQ2: To what extent do hybrid attention strategies (sliding-window + global attention) reduce effective KV-cache memory compared to uniform global attention, and at what context lengths does this advantage become significant?
RQ3: How does architecture-specific attention design influence tolerance to KV-cache quantization (FP16 to Q4), and which architectural features predict better quantization resilience?
These questions matter for the AI Memory series because understanding the architectural starting point is prerequisite to selecting appropriate compression and optimization strategies — the subject of the next eleven articles in this series.
2. Existing Approaches (2026 State of the Art) #
2.1 Grouped-Query Attention as the Universal Standard #
By Q1 2026, GQA has become the de facto attention mechanism across all major open-source LLM families. The original GQA proposal demonstrated that sharing key-value heads across multiple query heads preserves model quality while reducing KV-cache size proportionally to the grouping ratio (Ainslie et al., 2023[4]). Current implementations vary in their grouping strategy: Llama and Mistral use 4:1 ratios (8 KV heads for 32 query heads), while Qwen 3 and Gemma 3 adopt more aggressive 8:1 ratios (4 KV heads for 32 query heads), halving KV-cache per layer compared to the 4:1 approach (Qwen Team, 2025[5]).
Recent work on cost-optimal GQA configurations demonstrates that commonly used ratios are suboptimal for long-context scenarios. Chen et al. show that jointly optimizing model size and GQA configuration can reduce both memory usage and FLOPs by more than 50% compared to Llama 3’s GQA with no degradation in capabilities, suggesting that the field is still converging on optimal attention resource allocation (Chen et al., 2025[3]).
2.2 Hybrid Attention Architectures #
Gemma 3 introduced a hybrid attention pattern that alternates between sliding-window attention (SWA) layers and global attention layers at a 5:1 ratio — 40 SWA layers with a 1,024-token window and 8 global attention layers in the 12B variant. This design caps KV-cache growth for 83% of layers at the window size regardless of input length, producing sub-linear cache scaling (Gemma Team, 2025[6]). The Modular team categorizes this as part of the “hybrid era” of KV-cache design, where models maintain multiple cache types within a single architecture (Modular, 2026[7]).
Mistral previously pioneered sliding-window attention in Mistral 7B v0.1 but notably removed it in subsequent versions (v0.3, Small 3), reverting to full global attention — suggesting that the engineering complexity and prefix-caching incompatibility of SWA outweighed its memory benefits for their deployment targets (MLX-LM, 2026[8]).
2.3 Multi-Head Latent Attention (MLA) #
DeepSeek V3 introduced an alternative approach — multi-head latent attention — that compresses key-value representations through low-rank projections rather than head sharing. Architecture-aware benchmarking on AMD MI325X GPUs reveals that MLA and GQA require fundamentally different optimization strategies: MLA models cannot use KV-cache offloading and require block size 1, while GQA models benefit from both offloading and larger block sizes (Georgiou et al., 2026[9]). This architectural divergence means deployment infrastructure must be tailored to attention mechanism type.
2.4 Llama 4’s MoE Evolution #
Meta’s Llama 4 Scout maintains GQA with 8 KV heads but shifts to a Mixture-of-Experts architecture with 109B total parameters and 17B active per token. While MoE does not directly reduce KV-cache size (cache depends on attention architecture, not FFN routing), the interplay between expert routing overhead and cache memory creates unique deployment constraints — Scout is 35-40% slower than dense models of similar active parameter count despite identical KV-cache mechanics (Meta, 2026[10]).
flowchart TD
A[Attention Mechanism Landscape 2026] --> B[GQA Family]
A --> C[MLA Family]
B --> D[Standard GQA\nLlama, Mistral, Qwen]
B --> E[Hybrid GQA + SWA\nGemma 2/3]
B --> F[GQA + MoE\nLlama 4, Qwen 3 MoE]
C --> G[Low-Rank KV\nDeepSeek V3/R1]
D --> H[4:1 ratio\n8 KV heads]
D --> I[8:1 ratio\n4 KV heads]
E --> J[Sub-linear\ncache scaling]
G --> K[Asymmetric\nK/V dims]
3. Quality Metrics & Evaluation Framework #
To evaluate our research questions, we define three complementary metrics grounded in recent literature:
KV-Cache Bytes per Token (CBT). The raw memory consumed per cached token, calculated as 2 x nkvheads x headdim x nlayers x bytesperelement. This metric captures architectural decisions independent of input length and serves as the primary comparator for RQ1. We compute CBT in FP16 (2 bytes per element) as the baseline representation.
Memory Efficiency Ratio (MER). Defined as the maximum supported context length divided by the total KV-cache memory at that length (tokens per GB). This composite metric captures how effectively an architecture converts memory investment into context capacity, directly addressing RQ2. For hybrid architectures, MER accounts for sliding-window layers that cap their cache contribution.
Quantization Resilience Score (QRS). The percentage change in perplexity when transitioning from FP16 to Q4 KV-cache quantization. Negative values indicate improvement (better), positive values indicate degradation. This metric addresses RQ3 and draws on empirical measurements from Shkolnikov et al. across three architectures (Shkolnikov et al., 2026[11]).
| RQ | Metric | Source | Threshold |
|---|---|---|---|
| RQ1 | Cache Bytes per Token (CBT) | Architecture specifications | Reported in KB |
| RQ2 | Memory Efficiency Ratio (MER) | Computed from specs + context limits | >10K tokens/GB = efficient |
| RQ3 | Quantization Resilience Score (QRS) | Shkolnikov et al., 2026 | <3% degradation = resilient |
graph LR
RQ1 --> M1[CBT: Cache Bytes/Token] --> E1[Raw footprint\ncomparison]
RQ2 --> M2[MER: Tokens per GB] --> E2[Effective memory\nutilization]
RQ3 --> M3[QRS: Perplexity delta] --> E3[Quantization\ntolerance]
M1 --> M2
M2 --> M3
4. Application to Our Case #
4.1 Raw KV-Cache Footprint Analysis (RQ1) #
We compute CBT for eight model variants spanning the four families. Figure 1 presents both raw and effective (sliding-window-adjusted) cache sizes.

The raw CBT ranges from 56 KB (Qwen 2.5 7B) to 192 KB (Llama 4 Scout and Gemma 3 12B) — a 3.4x spread. This variation stems from three independent architectural parameters:
KV head count is the dominant factor. Qwen 3 and Gemma 3 use only 4 KV heads versus 8 for Llama and Mistral, immediately halving the per-layer cache contribution. However, Gemma 3 compensates with a larger head dimension (256 vs 128), so its per-layer cost actually exceeds Qwen’s despite having the same KV head count.
Layer count amplifies head-level differences across the full model. Gemma 3’s 48 layers versus Qwen 3’s 36 layers means 33% more cache accumulation, explaining why Gemma 3 has higher raw CBT despite fewer KV heads.
Head dimension creates a hidden multiplier. Gemma’s 256-dimensional heads store 2x the information per head compared to the 128-dimensional heads used by Llama, Mistral, and Qwen, partially offsetting its KV-head reduction.

4.2 Hybrid Attention Memory Savings (RQ2) #
The effective KV-cache at 8K context tells a dramatically different story. Gemma 3’s hybrid architecture reduces effective cache by 78.6% compared to its raw footprint because 40 of 48 layers only cache a 1,024-token window regardless of input length.

At short contexts (1K tokens), the sliding-window advantage vanishes — all tokens fit within the window, so hybrid and global architectures consume identical memory. The crossover occurs precisely at the window size (1,024 tokens for Gemma 3). Beyond this threshold, Gemma 3’s cache grows sub-linearly while competitors scale linearly.
At 128K context, the Memory Efficiency Ratio reveals the full impact:

Gemma 3 achieves 23,273 tokens per GB — 3.1x more efficient than Llama 3.1 (7,619 tokens/GB) and 3.8x more efficient than Llama 4 Scout (5,079 tokens/GB). Qwen 3’s aggressive 8:1 GQA ratio without sliding window places it at 13,617 tokens/GB — competitive but unable to match hybrid attention’s sub-linear scaling.
This has direct deployment implications. The Sparse Frontier analysis demonstrates that longer sequences tolerate higher attention sparsity, and Gemma’s lower attention-to-total-computation ratio at 64K tokens means sparse prefilling reduces a smaller fraction of total FLOPs compared to Qwen (Nawrot et al., 2026[12]). In other words, Gemma’s architectural efficiency at the cache level comes with a trade-off: less room for additional sparsity-based optimization because the architecture already exploits locality through sliding windows.
4.3 Quantization Tolerance by Architecture (RQ3) #
The Q4 KV-cache quantization study by Shkolnikov et al. provides the first controlled cross-architecture comparison of cache quantization impact. Testing on Gemma 3 12B, Llama 3.1 8B, and DeepSeek-Coder-V2-Lite 16B reveals striking architectural differences:

Gemma 3 shows a -0.7% perplexity change (slight improvement) under Q4 quantization, while Llama 3.1 degrades by +2.8% and DeepSeek by +3.0%. This divergence likely reflects Gemma 3’s architectural characteristics: the sliding-window layers cache only local context, where adjacent token representations exhibit higher redundancy and compress more cleanly under quantization. Global attention layers, which must preserve long-range dependencies in the cache, are more sensitive to precision reduction.
The practical implication is significant: Q4 quantization provides a 4x memory reduction for all architectures, but Gemma 3 achieves this essentially for free while Llama and DeepSeek pay a measurable accuracy cost. Combined with Gemma 3’s already-superior MER, Q4 quantization amplifies the efficiency gap — Gemma 3 at Q4 achieves approximately 93,000 tokens per GB, compared to Llama 3.1 at Q4 achieving approximately 30,500 tokens per GB.
The FlowQKV work on sliding-window-aware cache scheduling further confirms that SWA layers benefit from specialized dataflow optimization, where bandwidth-aware scheduling can overlap data movement with computation across compute tiles, minimizing the memory bottleneck that quantization also addresses (FlowQKV, 2026[13]).
graph TB
subgraph Architecture_Selection
A[Deployment Constraint] --> B{Memory Budget}
B -->|Tight < 8GB| C[Qwen 3 8B\nSmallest raw CBT]
B -->|Medium 8-24GB| D{Context Need}
B -->|Large > 24GB| E[Llama 4 Scout\nMax capability]
D -->|Long > 32K| F[Gemma 3 12B\nBest MER]
D -->|Short < 8K| G[Any GQA model\nSimilar effective cost]
end
subgraph Optimization
F --> H[Q4 quantization\n-0.7% perplexity]
C --> I[Q4 quantization\n+2-3% perplexity]
G --> J[FP16 sufficient\nat short context]
end
5. Conclusion #
RQ1 Finding: Raw KV-cache footprint varies by 3.4x across architectures at equivalent model sizes, driven primarily by KV head count (4 vs 8), head dimension (128 vs 256), and layer depth (28-48). Measured by Cache Bytes per Token (CBT) = 56-192 KB in FP16. This matters for our series because the starting cache footprint determines the compression headroom available to techniques evaluated in Articles 11-18 — architectures with smaller raw CBT (Qwen) have less to gain from compression, while those with larger CBT (Gemma, Llama 4) benefit proportionally more.
RQ2 Finding: Hybrid sliding-window attention reduces effective KV-cache by up to 78.6% compared to global-only GQA at 8K context, with the advantage growing logarithmically at longer contexts. Measured by Memory Efficiency Ratio (MER): Gemma 3 = 23,273 tokens/GB vs Llama 3.1 = 7,619 tokens/GB (3.1x advantage). This matters for our series because Article 11 (Paged Attention) and Article 17 (Sliding Window and Compressive Caching) will examine how infrastructure-level optimizations interact with architecture-level memory designs — and our finding that hybrid attention already exploits locality suggests diminishing returns from additional locality-based compression on Gemma-family models.
RQ3 Finding: Q4 KV-cache quantization shows architecture-dependent tolerance, with Gemma 3 exhibiting -0.7% perplexity change (improvement) versus +2.8% for Llama 3.1 and +3.0% for DeepSeek. Measured by Quantization Resilience Score (QRS) at the Q4 compression level. This matters for our series because Article 12 (Grouped-Query Attention design) and Article 16 (Cross-Layer KV-Cache Sharing) will need to account for the fact that architectures with built-in locality (sliding window) are inherently more quantization-friendly, suggesting that attention design and compression strategy should be co-optimized rather than treated independently.
The next article in this series examines Prompt Caching Efficiency — measuring reuse patterns across real workloads and how architectural differences in KV-cache structure affect cache hit rates in production serving systems.
References (13) #
- Stabilarity Research Hub. Cross-Architecture Memory Comparison — Llama vs Mistral vs Gemma vs Qwen. doi.org. dti
- Stabilarity Research Hub. KV-Cache Compression Benchmarks — Quantization vs Eviction vs Pruning. ib
- (2025). [2503.09579] Cost-Optimal Grouped-Query Attention for Long-Context Modeling. doi.org. dti
- (2023). [2305.13245] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. doi.org. dti
- (2025). [2505.09388] Qwen3 Technical Report. doi.org. dti
- (2025). [2503.19786] Gemma 3 Technical Report. doi.org. dti
- Modular: The Five Eras of KVCache. modular.com. iv
- Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba) · Issue #980 · ml-explore/mlx-lm · GitHub. github.com. ir
- (2026). [2603.10031] Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study. doi.org. dti
- (2026). [2601.11659] The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes. doi.org. dti
- (2026). [2603.04428] Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices. doi.org. dti
- (2025). [2504.17768] The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs. doi.org. dti
- (20or). [2602.06063] Mapping Gemma3 onto an Edge Dataflow Architecture. arxiv.org. tii