AI Memory - Stabilarity Hub

Distributed KV-Cache in Multi-GPU Serving

Posted on March 29, 2026March 29, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19310103 83stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	58%	○	≥80% from editorially reviewed sources
[t]	Trusted	89%	✓	≥80% from verified, high-quality sources
[a]	DOI	79%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	58%	○	≥80% indexed in CrossRef
[i]	Indexed	84%	✓	≥80% have metadata indexed
[l]	Academic	79%	○	≥80% from journals/conferences/preprints
[f]	Free Access	84%	✓	≥80% are freely accessible
[r]	References	19 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,267	✓	Minimum 2,000 words for a full research article. Current: 2,267
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19310103
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	71%	✓	≥60% of references from 2025–2026. Current: 71%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (86 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

As large language models scale beyond the memory capacity of individual accelerators, distributing inference across multiple GPUs introduces fundamental challenges for key-value cache management. This article examines how tensor parallelism, pipeline parallelism, and emerging hybrid strategies partition KV-cache state across devices, analyzing the communication overhead, memory efficiency, and ...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19310103 83stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	58%	○	≥80% from editorially reviewed sources
[t]	Trusted	89%	✓	≥80% from verified, high-quality sources
[a]	DOI	79%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	58%	○	≥80% indexed in CrossRef
[i]	Indexed	84%	✓	≥80% have metadata indexed
[l]	Academic	79%	○	≥80% from journals/conferences/preprints
[f]	Free Access	84%	✓	≥80% are freely accessible
[r]	References	19 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,267	✓	Minimum 2,000 words for a full research article. Current: 2,267
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19310103
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	71%	✓	≥60% of references from 2025–2026. Current: 71%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (86 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

AI Memory Read More

Flash Attention’s Role in Memory-Efficient Inference

Posted on March 29, 2026March 29, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19303451 81stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	48%	○	≥80% from editorially reviewed sources
[t]	Trusted	91%	✓	≥80% from verified, high-quality sources
[a]	DOI	70%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	48%	○	≥80% indexed in CrossRef
[i]	Indexed	83%	✓	≥80% have metadata indexed
[l]	Academic	70%	○	≥80% from journals/conferences/preprints
[f]	Free Access	96%	✓	≥80% are freely accessible
[r]	References	23 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,895	✓	Minimum 2,000 words for a full research article. Current: 2,895
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19303451
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	67%	✓	≥60% of references from 2025–2026. Current: 67%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (82 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

Flash Attention has become the foundational kernel technology enabling memory-efficient inference in large language models (LLMs), transforming how attention computation interacts with GPU memory hierarchies. This article investigates three research questions: (1) how does Flash Attention's tiling strategy reduce peak memory consumption compared to standard attention, and what are the theoretic...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19303451 81stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	48%	○	≥80% from editorially reviewed sources
[t]	Trusted	91%	✓	≥80% from verified, high-quality sources
[a]	DOI	70%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	48%	○	≥80% indexed in CrossRef
[i]	Indexed	83%	✓	≥80% have metadata indexed
[l]	Academic	70%	○	≥80% from journals/conferences/preprints
[f]	Free Access	96%	✓	≥80% are freely accessible
[r]	References	23 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,895	✓	Minimum 2,000 words for a full research article. Current: 2,895
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19303451
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	67%	✓	≥60% of references from 2025–2026. Current: 67%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (82 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

AI Memory Read More

Sliding Window and Compressive Caching for Infinite Context

Posted on March 28, 2026March 30, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19299498 81stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	23%	○	≥80% from editorially reviewed sources
[t]	Trusted	88%	✓	≥80% from verified, high-quality sources
[a]	DOI	77%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	23%	○	≥80% indexed in CrossRef
[i]	Indexed	85%	✓	≥80% have metadata indexed
[l]	Academic	81%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	96%	✓	≥80% are freely accessible
[r]	References	26 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,252	✓	Minimum 2,000 words for a full research article. Current: 2,252
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19299498
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	70%	✓	≥60% of references from 2025–2026. Current: 70%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (82 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

As large language models (LLMs) scale to context windows exceeding one million tokens, the key-value (KV) cache grows linearly and becomes the dominant memory bottleneck during autoregressive inference. Sliding window attention and compressive caching represent two complementary families of techniques that bound memory usage while preserving access to long-range context. This article investigat...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19299498 81stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	23%	○	≥80% from editorially reviewed sources
[t]	Trusted	88%	✓	≥80% from verified, high-quality sources
[a]	DOI	77%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	23%	○	≥80% indexed in CrossRef
[i]	Indexed	85%	✓	≥80% have metadata indexed
[l]	Academic	81%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	96%	✓	≥80% are freely accessible
[r]	References	26 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,252	✓	Minimum 2,000 words for a full research article. Current: 2,252
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19299498
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	70%	✓	≥60% of references from 2025–2026. Current: 70%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (82 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

AI Memory Read More

Cross-Layer KV-Cache Sharing

Posted on March 28, 2026March 29, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19291014 80stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	13%	○	≥80% from editorially reviewed sources
[t]	Trusted	91%	✓	≥80% from verified, high-quality sources
[a]	DOI	78%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	13%	○	≥80% indexed in CrossRef
[i]	Indexed	83%	✓	≥80% have metadata indexed
[l]	Academic	78%	○	≥80% from journals/conferences/preprints
[f]	Free Access	96%	✓	≥80% are freely accessible
[r]	References	23 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,141	✓	Minimum 2,000 words for a full research article. Current: 2,141
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19291014
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	65%	✓	≥60% of references from 2025–2026. Current: 65%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (81 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

As large language models (LLMs) scale to billions of parameters and context windows stretch beyond 128K tokens, the key-value (KV) cache becomes the dominant memory bottleneck during inference. Cross-layer KV-cache sharing represents a family of techniques that exploit redundancy in key and value representations across transformer layers to reduce cache memory without retraining. This article i...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19291014 80stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	13%	○	≥80% from editorially reviewed sources
[t]	Trusted	91%	✓	≥80% from verified, high-quality sources
[a]	DOI	78%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	13%	○	≥80% indexed in CrossRef
[i]	Indexed	83%	✓	≥80% have metadata indexed
[l]	Academic	78%	○	≥80% from journals/conferences/preprints
[f]	Free Access	96%	✓	≥80% are freely accessible
[r]	References	23 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,141	✓	Minimum 2,000 words for a full research article. Current: 2,141
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19291014
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	65%	✓	≥60% of references from 2025–2026. Current: 65%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (81 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

AI Memory Read More

Token Pruning and Attention Sparsity

Posted on March 28, 2026March 28, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19269070 79stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	63%	○	≥80% from editorially reviewed sources
[t]	Trusted	89%	✓	≥80% from verified, high-quality sources
[a]	DOI	74%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	63%	○	≥80% indexed in CrossRef
[i]	Indexed	84%	✓	≥80% have metadata indexed
[l]	Academic	74%	○	≥80% from journals/conferences/preprints
[f]	Free Access	89%	✓	≥80% are freely accessible
[r]	References	19 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,304	✓	Minimum 2,000 words for a full research article. Current: 2,304
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19269070
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	75%	✓	≥60% of references from 2025–2026. Current: 75%
[c]	Data Charts	0	○	Original data charts from reproducible analysis (min 2). Current: 0
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (84 × 60%) + Required (4/5 × 30%) + Optional (2/4 × 10%)

This article investigates token pruning and attention sparsity as complementary strategies for reducing KV-cache memory consumption during large language model inference. Building on our series analysis of semantic prompt caching, we examine how selective token removal and sparse attention patterns can achieve 50-80% memory reduction while preserving generation quality. Three research questions...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19269070 79stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	63%	○	≥80% from editorially reviewed sources
[t]	Trusted	89%	✓	≥80% from verified, high-quality sources
[a]	DOI	74%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	63%	○	≥80% indexed in CrossRef
[i]	Indexed	84%	✓	≥80% have metadata indexed
[l]	Academic	74%	○	≥80% from journals/conferences/preprints
[f]	Free Access	89%	✓	≥80% are freely accessible
[r]	References	19 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,304	✓	Minimum 2,000 words for a full research article. Current: 2,304
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19269070
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	75%	✓	≥60% of references from 2025–2026. Current: 75%
[c]	Data Charts	0	○	Original data charts from reproducible analysis (min 2). Current: 0
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (84 × 60%) + Required (4/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More

Semantic Prompt Caching — Beyond Exact Match

Posted on March 24, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19211071 59stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	86%	✓	≥80% from verified, high-quality sources
[a]	DOI	7%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	86%	✓	≥80% have metadata indexed
[l]	Academic	71%	○	≥80% from journals/conferences/preprints
[f]	Free Access	100%	✓	≥80% are freely accessible
[r]	References	14 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,336	✓	Minimum 2,000 words for a full research article. Current: 2,336
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19211071
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	33%	✗	≥60% of references from 2025–2026. Current: 33%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (60 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

Prompt caching has emerged as a critical optimization for large language model (LLM) serving, yet production systems overwhelmingly rely on exact-match strategies that miss semantically equivalent queries. This article investigates semantic prompt caching — systems that identify and serve cached responses for semantically similar (but not identical) prompts using embedding-based similarity dete...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19211071 59stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	86%	✓	≥80% from verified, high-quality sources
[a]	DOI	7%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	86%	✓	≥80% have metadata indexed
[l]	Academic	71%	○	≥80% from journals/conferences/preprints
[f]	Free Access	100%	✓	≥80% are freely accessible
[r]	References	14 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,336	✓	Minimum 2,000 words for a full research article. Current: 2,336
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19211071
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	33%	✗	≥60% of references from 2025–2026. Current: 33%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (60 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More

Speculative Decoding and Cache Reuse

Posted on March 24, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19210815 61stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	90%	✓	≥80% from verified, high-quality sources
[a]	DOI	5%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	90%	✓	≥80% have metadata indexed
[l]	Academic	81%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	100%	✓	≥80% are freely accessible
[r]	References	21 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,662	✓	Minimum 2,000 words for a full research article. Current: 2,662
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19210815
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	21%	✗	≥60% of references from 2025–2026. Current: 21%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (63 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

Speculative decoding has emerged as a transformative inference optimization that breaks the sequential bottleneck of autoregressive generation by drafting multiple tokens in parallel and verifying them in a single forward pass. This article examines three research questions at the intersection of speculative decoding and KV cache management: how draft-then-verify architectures interact with cac...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19210815 61stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	90%	✓	≥80% from verified, high-quality sources
[a]	DOI	5%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	90%	✓	≥80% have metadata indexed
[l]	Academic	81%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	100%	✓	≥80% are freely accessible
[r]	References	21 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,662	✓	Minimum 2,000 words for a full research article. Current: 2,662
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19210815
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	21%	✗	≥60% of references from 2025–2026. Current: 21%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (63 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More

Grouped-Query Attention — Cache-Efficient Architecture Design

Posted on March 24, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19209159 73stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	4%	○	≥80% from editorially reviewed sources
[t]	Trusted	92%	✓	≥80% from verified, high-quality sources
[a]	DOI	79%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	4%	○	≥80% indexed in CrossRef
[i]	Indexed	88%	✓	≥80% have metadata indexed
[l]	Academic	83%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	100%	✓	≥80% are freely accessible
[r]	References	24 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,403	✓	Minimum 2,000 words for a full research article. Current: 2,403
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19209159
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	36%	✗	≥60% of references from 2025–2026. Current: 36%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (83 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

As large language models scale beyond hundreds of billions of parameters and context windows extend to millions of tokens, the key-value (KV) cache required for attention computation becomes the dominant memory bottleneck during inference. Grouped-Query Attention (GQA) addresses this by allowing multiple query heads to share fewer key-value heads, reducing cache footprint while preserving model...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19209159 73stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	4%	○	≥80% from editorially reviewed sources
[t]	Trusted	92%	✓	≥80% from verified, high-quality sources
[a]	DOI	79%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	4%	○	≥80% indexed in CrossRef
[i]	Indexed	88%	✓	≥80% have metadata indexed
[l]	Academic	83%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	100%	✓	≥80% are freely accessible
[r]	References	24 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,403	✓	Minimum 2,000 words for a full research article. Current: 2,403
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19209159
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	36%	✗	≥60% of references from 2025–2026. Current: 36%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (83 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More

Paged Attention and Virtual Memory for LLM Inference

Posted on March 24, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19203099 59stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	13%	○	≥80% from editorially reviewed sources
[t]	Trusted	73%	○	≥80% from verified, high-quality sources
[a]	DOI	27%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	13%	○	≥80% indexed in CrossRef
[i]	Indexed	80%	✓	≥80% have metadata indexed
[l]	Academic	60%	○	≥80% from journals/conferences/preprints
[f]	Free Access	87%	✓	≥80% are freely accessible
[r]	References	15 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,912	✓	Minimum 2,000 words for a full research article. Current: 2,912
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19203099
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	31%	✗	≥60% of references from 2025–2026. Current: 31%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (60 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

As large language models scale to billions of parameters and millions of context tokens, the key-value (KV) cache that stores attention states becomes the dominant memory bottleneck during inference. Traditional contiguous memory allocation for KV caches leads to severe fragmentation — wasting 40-60% of available GPU memory — and fundamentally limits serving throughput. This article investigate...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19203099 59stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	13%	○	≥80% from editorially reviewed sources
[t]	Trusted	73%	○	≥80% from verified, high-quality sources
[a]	DOI	27%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	13%	○	≥80% indexed in CrossRef
[i]	Indexed	80%	✓	≥80% have metadata indexed
[l]	Academic	60%	○	≥80% from journals/conferences/preprints
[f]	Free Access	87%	✓	≥80% are freely accessible
[r]	References	15 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,912	✓	Minimum 2,000 words for a full research article. Current: 2,912
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19203099
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	31%	✗	≥60% of references from 2025–2026. Current: 31%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (60 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More

Meta-Analysis of Context Benchmarks — Building a Unified Evaluation Framework

Posted on March 24, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19199439 61stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	16%	○	≥80% from editorially reviewed sources
[t]	Trusted	89%	✓	≥80% from verified, high-quality sources
[a]	DOI	5%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	89%	✓	≥80% have metadata indexed
[l]	Academic	79%	○	≥80% from journals/conferences/preprints
[f]	Free Access	84%	✓	≥80% are freely accessible
[r]	References	19 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,528	✓	Minimum 2,000 words for a full research article. Current: 2,528
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19199439
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	29%	✗	≥60% of references from 2025–2026. Current: 29%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (63 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

The rapid expansion of context windows — from 4K tokens to 10M tokens in models like Llama 4 — has produced a proliferation of evaluation benchmarks, yet no unified framework exists for comparing long-context capabilities across these disparate tests. This article presents a meta-analysis of ten major context benchmarks (NIAH, RULER, LongBench v2, InfiniteBench, BABILong, NoLiMa, LongGenBench, ...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19199439 61stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	16%	○	≥80% from editorially reviewed sources
[t]	Trusted	89%	✓	≥80% from verified, high-quality sources
[a]	DOI	5%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	89%	✓	≥80% have metadata indexed
[l]	Academic	79%	○	≥80% from journals/conferences/preprints
[f]	Free Access	84%	✓	≥80% are freely accessible
[r]	References	19 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,528	✓	Minimum 2,000 words for a full research article. Current: 2,528
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19199439
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	29%	✗	≥60% of references from 2025–2026. Current: 29%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (63 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More