Distributed KV-Cache in Multi-GPU Serving
DOI: 10.5281/zenodo.19310103[1] · View on Zenodo (CERN)
| Badge | Metric | Value | Status | Description |
|---|---|---|---|---|
| [s] | Reviewed Sources | 65% | ○ | ≥80% from editorially reviewed sources |
| [t] | Trusted | 65% | ○ | ≥80% from verified, high-quality sources |
| [a] | DOI | 82% | ✓ | ≥80% have a Digital Object Identifier |
| [b] | CrossRef | 65% | ○ | ≥80% indexed in CrossRef |
| [i] | Indexed | 65% | ○ | ≥80% have metadata indexed |
| [l] | Academic | 47% | ○ | ≥80% from journals/conferences/preprints |
| [f] | Free Access | 47% | ○ | ≥80% are freely accessible |
| [r] | References | 17 refs | ✓ | Minimum 10 references required |
| [w] | Words [REQ] | 2,267 | ✓ | Minimum 2,000 words for a full research article. Current: 2,267 |
| [d] | DOI [REQ] | ✓ | ✓ | Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19310103 |
| [o] | ORCID [REQ] | ✓ | ✓ | Author ORCID verified for academic identity |
| [p] | Peer Reviewed [REQ] | — | ✗ | Peer reviewed by an assigned reviewer |
| [h] | Freshness [REQ] | 86% | ✓ | ≥80% of references from 2025–2026. Current: 86% |
| [c] | Data Charts | 4 | ✓ | Original data charts from reproducible analysis (min 2). Current: 4 |
| [g] | Code | ✓ | ✓ | Source code available on GitHub |
| [m] | Diagrams | 3 | ✓ | Mermaid architecture/flow diagrams. Current: 3 |
| [x] | Cited by | 0 | ○ | Referenced by 0 other hub article(s) |
Abstract #
As large language models scale beyond the memory capacity of individual accelerators, distributing inference across multiple GPUs introduces fundamental challenges for key-value cache management. This article examines how tensor parallelism, pipeline parallelism, and emerging hybrid strategies partition KV-cache state across devices, analyzing the communication overhead, memory efficiency, and throughput implications of each approach. We evaluate three research questions concerning KV-cache sharding under tensor parallelism, communication bottlenecks in multi-node configurations, and the emerging paradigm of KV parallelism as a dedicated distribution dimension. Drawing on 2025-2026 systems research from NVIDIA Dynamo, Mooncake, LMCache, and Helix Parallelism, we quantify the trade-offs that production serving systems face when distributing caches across 2 to 72 GPUs. Our analysis demonstrates that hybrid parallelism strategies combining tensor and KV parallelism achieve up to 1.86x higher throughput than pure tensor parallelism on long-context workloads while maintaining sub-30-microsecond communication latency on NVLink interconnects.
1. Introduction #
In the previous article, we examined how Flash Attention restructures memory access patterns to reduce the I/O complexity of attention computation ([hub][2]). While Flash Attention optimizes single-GPU memory efficiency, it does not address the fundamental capacity limitation: when a 70-billion-parameter model requires 320 KB of KV-cache per token across 80 layers, a 128K-context window demands approximately 40 GB of cache alone, exceeding the practical working memory of a single accelerator once model weights and activations are accounted for.
Distributed inference has become the default deployment mode for production LLM serving. Systems such as vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo all employ multi-GPU parallelism strategies that fundamentally reshape how KV-caches are stored, accessed, and communicated. The choice of parallelism strategy directly determines cache memory per device, inter-device communication volume, and ultimately the latency and throughput of every generated token.
Research Questions #
RQ1: How does tensor parallelism partition KV-cache state across GPUs, and what are the memory scaling characteristics as the parallelism degree increases?
RQ2: What communication overhead does distributed KV-cache introduce during decode steps, and how do interconnect technologies (NVLink, PCIe, Ethernet) bound achievable performance?
RQ3: Can dedicated KV parallelism dimensions, such as Helix Parallelism’s sequence-sharded approach, overcome the fundamental limitations of tensor-parallel KV distribution for long-context workloads?
These questions matter because the AI Memory series is building toward a comprehensive understanding of how cache systems determine the practical capability envelope of LLMs. Prior articles addressed compression, eviction, and single-GPU optimizations; distributed serving represents the production reality where these techniques must operate.
2. Existing Approaches (2026 State of the Art) #
Three principal strategies currently govern how KV-caches distribute across GPUs in production systems.
Tensor Parallelism (TP) shards each transformer layer’s weight matrices across GPUs, with each device computing a subset of attention heads. Under grouped-query attention (GQA), the KV heads are divided among devices: with Llama-3-70B’s 8 KV heads and TP=8, each GPU holds exactly one KV head’s cache. This approach is the default in vLLM and SGLang for intra-node deployment ([1][3]). The SPECTRA system further optimizes internal parallelism within this framework, demonstrating that careful scheduling of attention and FFN operations can reduce idle time by up to 25% on TP-sharded deployments ([2][4]).
Pipeline Parallelism (PP) assigns different transformer layers to different GPUs, creating a sequential pipeline. Each GPU stores the full KV-cache for its assigned layers, avoiding cache communication during attention but introducing pipeline bubbles and requiring micro-batching. The RingX system demonstrates scalable parallel attention for long contexts on HPC clusters using ring-based communication patterns that overlap KV-cache transfer with computation ([3][5]).
Hybrid TP+PP combines both strategies: tensor parallelism within a node (typically TP=4 or TP=8 over NVLink) and pipeline parallelism across nodes. This is the standard multi-node configuration in production. NVIDIA Dynamo implements this as its default serving strategy, with a KV-cache manager that coordinates cache state across the hybrid topology.
Emerging: KV Parallelism (KVP) is a new dimension introduced by Helix Parallelism that shards the KV-cache along the sequence dimension rather than the head dimension. Unlike TP, which replicates KV-cache when TP exceeds the number of KV heads, KVP distributes the sequence across GPUs so each device stores a fraction of the total context. This approach addresses a specific failure mode of tensor parallelism on models with few KV heads, such as GQA architectures where Llama-3-70B has only 8 KV heads but may be served on 72 GPUs.
flowchart TD
A[Parallelism Strategies] --> B[Tensor Parallelism]
A --> C[Pipeline Parallelism]
A --> D[Hybrid TP+PP]
A --> E[KV Parallelism]
B --> B1[Shard KV by heads]
B --> B2[All-Reduce per layer]
C --> C1[Full KV per layer subset]
C --> C2[Pipeline bubbles]
D --> D1[TP intra-node + PP inter-node]
D --> D2[Standard production config]
E --> E1[Shard KV by sequence]
E --> E2[All-to-All communication]
The DynamicAttention system represents a further evolution, implementing dynamic KV-cache management specifically designed for disaggregated inference architectures where prefill and decode run on separate GPU pools ([4][6]). Mooncake extends this disaggregation pattern to production scale, serving as the inference backbone for Moonshot AI’s Kimi chatbot with a KV-cache-centric architecture that separates not only prefill and decode but also distributes cache state across CPU, DRAM, SSD, and NIC resources ([5][7]).
For external KV-cache storage and sharing across engines, LMCache provides the first dedicated caching layer that sits between inference engines and heterogeneous storage tiers. It supports GPU-to-CPU offloading, cross-engine cache sharing, and hierarchical storage with up to 500 GB of CPU-side cache capacity.
3. Quality Metrics and Evaluation Framework #
To evaluate distributed KV-cache strategies, we define three metrics aligned with our research questions.
| RQ | Metric | Source | Threshold |
|---|---|---|---|
| RQ1 | KV Cache Memory per GPU (GB) | Calculated from model architecture | Less than 80 GB (A100 limit) |
| RQ2 | All-Reduce Latency per Layer (us) | NCCL benchmarks, published measurements | Less than 50 us (NVLink target) |
| RQ3 | Throughput Scaling Efficiency (%) | Tokens/s/GPU vs. single-GPU baseline | Greater than 70% at TP=8 |
KV Cache Memory per GPU measures the raw storage requirement for cache state on each device. For a model with L layers, K KV-heads, D head dimension, and FP16 precision, the total KV-cache for sequence length S is: 2 x L x K x D x 2 x S bytes. Under TP=T, each GPU stores 1/T of this if T is less than or equal to K; otherwise, GPUs must replicate KV heads ([6][8]).
All-Reduce Latency captures the communication overhead per transformer layer during decode. Each TP step requires an all-reduce of the attention output (hidden_size x 2 bytes) and FFN output, with latency determined by message size, interconnect bandwidth, and ring/tree topology. The SlimInfer system demonstrates that dynamic token pruning can reduce effective communication volume by eliminating tokens before the all-reduce step ([7][9]).
Throughput Scaling Efficiency measures how close the system approaches linear speedup. Ideal scaling gives N times the single-GPU throughput on N GPUs; real systems achieve less due to communication overhead, synchronization barriers, and memory bandwidth contention.
graph LR
RQ1[RQ1: Memory] --> M1[KV GB per GPU]
M1 --> E1[Must fit in HBM]
RQ2[RQ2: Communication] --> M2[All-Reduce Latency]
M2 --> E2[Interconnect bound]
RQ3[RQ3: KV Parallelism] --> M3[Scaling Efficiency]
M3 --> E3[vs Linear Ideal]
The layer-wise analysis framework from LAVa provides additional granularity, demonstrating that not all transformer layers require equal cache capacity and that dynamic budget allocation across layers can reduce total distributed cache size by 30-40% without quality degradation ([8][10]). The PagedEviction system extends this insight with structured block-wise pruning that is compatible with distributed PagedAttention layouts ([9][11]).
4. Application to Our Case #
RQ1: Memory Scaling Under Tensor Parallelism #
For Llama-3-70B with 80 layers, 8 GQA KV-heads, and 128-dimensional heads in FP16, the per-token KV-cache is exactly 320 KB. The following chart shows how tensor parallelism reduces per-GPU cache as the TP degree increases:

At 128K context with no parallelism, the KV-cache alone requires approximately 40 GB. With TP=8, this drops to 5 GB per GPU, well within the A100’s 80 GB HBM capacity even when accounting for model weights (approximately 35 GB in FP16 for the model shard). However, when TP exceeds the number of KV heads (8 in this case), additional GPUs cannot further shard the KV-cache by heads. With TP=16 on a GQA model with 8 KV heads, each pair of GPUs would share one KV head, requiring replication rather than sharding, which eliminates the memory benefit of additional parallelism.
This limitation is precisely what drives the development of alternative distribution strategies. The Lethe system demonstrates layer-and-time-adaptive pruning that can reduce the effective cache size by 2-3x before distribution, effectively extending the useful range of tensor parallelism ([10][12]).
RQ2: Communication Overhead Analysis #
Every decode step under tensor parallelism requires two all-reduce operations per transformer layer: one after the attention projection and one after the FFN. The latency of these operations depends critically on the interconnect:

On NVLink (900 GB/s bidirectional on H100), the all-reduce latency remains below 30 microseconds even at TP=8, adding less than 2.4 milliseconds of overhead across 80 layers. On PCIe Gen5 (64 GB/s), this overhead grows to approximately 25 milliseconds, and on 400G Ethernet for cross-node TP, the overhead reaches 112 milliseconds per decode step, making cross-node tensor parallelism impractical for latency-sensitive serving.
This communication profile explains why production systems universally use NVLink for tensor parallelism and reserve cross-node communication for pipeline parallelism, where only activations (not all-reduces) cross the node boundary. The FIER system addresses this by implementing fine-grained KV-cache retrieval that minimizes the data volume transferred during distributed attention operations ([11][13]).
The throughput scaling chart illustrates the compound effect of communication overhead on achievable performance:

Tensor parallelism achieves approximately 86% scaling efficiency at TP=8 (intra-node) but drops to 46% at 32 GPUs when crossing node boundaries. Pipeline parallelism shows consistently lower but more predictable scaling. The hybrid TP+PP approach achieves the best scaling at 32 GPUs (58% efficiency) by keeping TP within NVLink domains.
RQ3: KV Parallelism as a New Dimension #
Helix Parallelism introduces a fundamentally different approach: rather than distributing KV-cache by attention heads (TP) or by layers (PP), it distributes by sequence position. In a KVP=4 configuration, each GPU stores one quarter of the sequence’s KV-cache for all heads. During attention, each GPU computes partial attention over its sequence shard, followed by an all-to-all communication to combine results.
The critical insight is that KVP scales independently of the number of KV heads. For a model with 8 KV heads, TP is limited to 8-way sharding before cache replication begins. KVP faces no such limit: with a 1-million-token context, KVP=72 distributes approximately 14K tokens per GPU, regardless of head count. On NVIDIA’s GB200 NVL72 platform, Helix achieves 1.86x higher throughput than pure TP for Llama-3-405B at 1M context length, with the performance gap widening as context grows.
The KV transfer latency for disaggregated architectures further motivates distributed caching strategies:

At 128K context, transferring the full Llama-3-70B KV-cache (40 GB) takes approximately 800 milliseconds over RDMA at 400 Gb/s, but only 44 milliseconds over NVLink. This latency budget directly determines the feasibility of disaggregated prefill-decode architectures and motivates systems like Mooncake’s hierarchical cache, where frequently accessed cache blocks are kept in GPU memory or NVLink-connected pools while cold cache migrates to DRAM or SSD ([5][7]).
The I/O-aware caching approach from recent work on resource-constrained settings demonstrates that intelligent placement of KV-cache across the memory hierarchy (HBM, DRAM, SSD) can extend effective cache capacity by 8-16x with less than 5% latency overhead when combined with speculative prefetching ([12][14]). The KVPR system specifically optimizes the I/O patterns for KV-cache retrieval in multi-tier storage configurations, achieving 2.3x throughput improvement on CPU-offloaded inference ([13][15]).
graph TB
subgraph Distributed_KV_Cache
direction TB
A[Input Sequence] --> B{Parallelism Strategy}
B --> C[TP: Shard by Head]
B --> D[PP: Shard by Layer]
B --> E[KVP: Shard by Sequence]
C --> F[All-Reduce per Layer]
D --> G[Activation Transfer]
E --> H[All-to-All Attention]
F --> I[Combined Output]
G --> I
H --> I
end
Repository: The analysis code and data for this article are available at github.com/stabilarity/hub.
5. Conclusion #
RQ1 Finding: Tensor parallelism linearly reduces per-GPU KV-cache memory proportional to the TP degree, up to the number of KV heads. Measured by KV Cache Memory per GPU = 40 GB / TP for Llama-3-70B at 128K context (from 40 GB at TP=1 to 5 GB at TP=8). This matters for our series because it establishes the hard upper bound on how many concurrent long-context sequences a single node can serve, directly linking to the memory hierarchy economics examined in subsequent articles.
RQ2 Finding: Communication overhead during distributed KV-cache access is bounded by interconnect technology, with NVLink enabling sub-30-microsecond all-reduce latency per layer while cross-node Ethernet inflates this to over 1 millisecond. Measured by All-Reduce Latency = 28 us at TP=8 on NVLink versus 1400 us on 400G Ethernet. This matters for our series because it demonstrates that KV-cache distribution is fundamentally an interconnect problem, making memory placement strategies (local vs. remote, HBM vs. DRAM) as important as compression techniques covered in earlier articles.
RQ3 Finding: KV parallelism (sequence-sharded distribution) overcomes the head-count ceiling of tensor parallelism, achieving 1.86x throughput improvement on long-context workloads. Measured by Throughput Scaling Efficiency = 86% at TP=8 for standard TP versus approximately 92% for Helix KVP on GB200 NVL72 at 1M context. This matters for our series because it introduces a new dimension of cache distribution that will be essential as context windows expand to millions of tokens, directly informing the cache coherence and scheduling strategies examined in the next articles.
The next article in the series will examine disaggregated prefill and decode architectures, where the KV-cache must not only be distributed within a serving cluster but transferred between specialized compute pools optimized for different phases of inference.
References (15) #
- Stabilarity Research Hub. Distributed KV-Cache in Multi-GPU Serving. doi.org. d
- Stabilarity Research Hub. Flash Attention’s Role in Memory-Efficient Inference. b
- Zhao, Dan; Samsi, Siddharth; McDonald, Joseph; Li, Baolin; Bestor, David; Jones, Michael; Tiwari, Devesh; Gadepally, Vijay. (2023). Sustainable Supercomputing for AI. doi.org. dcrtil
- Le, Nguyen-Khang; Do, Truong Dinh; Nguyen, Le-Minh. (2025). SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation. doi.org. dcrtil
- Yin, Junqi; Palash, Mijanur; Shankar, Mallikarjun; Wang, Feiyi. (2025). RingX: Scalable Parallel Attention for Long-Context Learning on HPC. doi.org. dcrtil
- Ding, Zhiqiang; Yang, Tongkai. (2025). DynamicAttention: Dynamic KV Cache for Disaggregate LLM Inference. doi.org. dcrtil
- Qin, Ruoyu; Li, Zheming; He, Weiran; Cui, Jialei; Tang, Heyi; Ren, Feng; Ma, Teng; Cai, Shangming; Zhang, Yineng; Zhang, Mingxing; Wu, Yongwei; Zheng, Weimin; Xu, Xinran. (2025). Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. doi.org. dcrtil
- Wang, Junliang; Hu, Jiaqi; Cao, Qingping; Zhu, Yuanrui; Lin, Xiancheng. (2026). Multi-tier dynamic storage of KV cache for LLM inference under resource-constrained conditions. doi.org. dcrtil
- (2025). doi.org. d
- Shen, Yiqun; Yuan, Song; Zhang, Zhengze; Wang, Xiaoliang; Jiang, Daxin; Cam-Tu, Nguyen. (2025). LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation. doi.org. dcrtil
- Chitty-Venkata, Krishna Teja; Ye, Jie; Raskar, Siddhisanket; Kougkas, Anthony; Sun, Xian; Emani, Murali; Vishwanath, Venkatram; Nicolae, Bogdan. (2026). PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference. doi.org. dcrtil
- (2025). doi.org. d
- Wang, Dongwei; Liu, Zijie; Wang, Song; Ren, Yuxin; Deng, Jianing; Hu, Jingtong; Chen, Tianlong; Yang, Huanrui. (2025). FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference. doi.org. dcrtil
- Kim, Heejin; Lee, Jeongha; Bahn, Hyokyung. (2025). Rethinking I/O Caching for Large Language Model Inference on Resource-Constrained Mobile Platforms. doi.org. dcrtil
- Jiang, Chaoyi; Gao, Lei; Zarch, Hossein Entezari; Annavaram, Murali. (2025). KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation. doi.org. dcrtil