AI Memory - Stabilarity Hub

Distributed KV-Cache in Multi-GPU Serving

Posted on March 29, 2026March 29, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19310103 75stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	65%	○	≥80% from editorially reviewed sources
[t]	Trusted	65%	○	≥80% from verified, high-quality sources
[a]	DOI	82%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	65%	○	≥80% indexed in CrossRef
[i]	Indexed	65%	○	≥80% have metadata indexed
[l]	Academic	47%	○	≥80% from journals/conferences/preprints
[f]	Free Access	47%	○	≥80% are freely accessible
[r]	References	17 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,267	✓	Minimum 2,000 words for a full research article. Current: 2,267
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19310103
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	86%	✓	≥80% of references from 2025–2026. Current: 86%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (72 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

As large language models scale beyond the memory capacity of individual accelerators, distributing inference across multiple GPUs introduces fundamental challenges for key-value cache management. This article examines how tensor parallelism, pipeline parallelism, and emerging hybrid strategies partition KV-cache state across devices, analyzing the communication overhead, memory efficiency, and ...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19310103 75stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	65%	○	≥80% from editorially reviewed sources
[t]	Trusted	65%	○	≥80% from verified, high-quality sources
[a]	DOI	82%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	65%	○	≥80% indexed in CrossRef
[i]	Indexed	65%	○	≥80% have metadata indexed
[l]	Academic	47%	○	≥80% from journals/conferences/preprints
[f]	Free Access	47%	○	≥80% are freely accessible
[r]	References	17 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,267	✓	Minimum 2,000 words for a full research article. Current: 2,267
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19310103
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	86%	✓	≥80% of references from 2025–2026. Current: 86%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (72 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

AI Memory Read More

Flash Attention’s Role in Memory-Efficient Inference

Posted on March 29, 2026March 29, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19303451 68stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	55%	○	≥80% from editorially reviewed sources
[t]	Trusted	55%	○	≥80% from verified, high-quality sources
[a]	DOI	75%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	55%	○	≥80% indexed in CrossRef
[i]	Indexed	50%	○	≥80% have metadata indexed
[l]	Academic	25%	○	≥80% from journals/conferences/preprints
[f]	Free Access	45%	○	≥80% are freely accessible
[r]	References	20 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,893	✓	Minimum 2,000 words for a full research article. Current: 2,893
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19303451
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	80%	✓	≥80% of references from 2025–2026. Current: 80%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (60 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

Flash Attention has become the foundational kernel technology enabling memory-efficient inference in large language models (LLMs), transforming how attention computation interacts with GPU memory hierarchies. This article investigates three research questions: (1) how does Flash Attention's tiling strategy reduce peak memory consumption compared to standard attention, and what are the theoretic...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19303451 68stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	55%	○	≥80% from editorially reviewed sources
[t]	Trusted	55%	○	≥80% from verified, high-quality sources
[a]	DOI	75%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	55%	○	≥80% indexed in CrossRef
[i]	Indexed	50%	○	≥80% have metadata indexed
[l]	Academic	25%	○	≥80% from journals/conferences/preprints
[f]	Free Access	45%	○	≥80% are freely accessible
[r]	References	20 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,893	✓	Minimum 2,000 words for a full research article. Current: 2,893
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19303451
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	80%	✓	≥80% of references from 2025–2026. Current: 80%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (60 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

AI Memory Read More

Sliding Window and Compressive Caching for Infinite Context

Posted on March 28, 2026March 30, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19299498 61stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	26%	○	≥80% from editorially reviewed sources
[t]	Trusted	35%	○	≥80% from verified, high-quality sources
[a]	DOI	83%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	26%	○	≥80% indexed in CrossRef
[i]	Indexed	35%	○	≥80% have metadata indexed
[l]	Academic	22%	○	≥80% from journals/conferences/preprints
[f]	Free Access	30%	○	≥80% are freely accessible
[r]	References	23 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,250	✓	Minimum 2,000 words for a full research article. Current: 2,250
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19299498
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	80%	✓	≥80% of references from 2025–2026. Current: 80%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (49 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

As large language models (LLMs) scale to context windows exceeding one million tokens, the key-value (KV) cache grows linearly and becomes the dominant memory bottleneck during autoregressive inference. Sliding window attention and compressive caching represent two complementary families of techniques that bound memory usage while preserving access to long-range context. This article investigat...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19299498 61stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	26%	○	≥80% from editorially reviewed sources
[t]	Trusted	35%	○	≥80% from verified, high-quality sources
[a]	DOI	83%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	26%	○	≥80% indexed in CrossRef
[i]	Indexed	35%	○	≥80% have metadata indexed
[l]	Academic	22%	○	≥80% from journals/conferences/preprints
[f]	Free Access	30%	○	≥80% are freely accessible
[r]	References	23 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,250	✓	Minimum 2,000 words for a full research article. Current: 2,250
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19299498
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	80%	✓	≥80% of references from 2025–2026. Current: 80%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (49 × 60%) + Required (4/5 × 30%) + Optional (3/4 × 10%)

AI Memory Read More

Cross-Layer KV-Cache Sharing

Posted on March 28, 2026March 29, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19291014 54stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	15%	○	≥80% from editorially reviewed sources
[t]	Trusted	35%	○	≥80% from verified, high-quality sources
[a]	DOI	85%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	15%	○	≥80% indexed in CrossRef
[i]	Indexed	30%	○	≥80% have metadata indexed
[l]	Academic	15%	○	≥80% from journals/conferences/preprints
[f]	Free Access	25%	○	≥80% are freely accessible
[r]	References	20 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,141	✓	Minimum 2,000 words for a full research article. Current: 2,141
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19291014
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	76%	✗	≥80% of references from 2025–2026. Current: 76%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (47 × 60%) + Required (3/5 × 30%) + Optional (3/4 × 10%)

As large language models (LLMs) scale to billions of parameters and context windows stretch beyond 128K tokens, the key-value (KV) cache becomes the dominant memory bottleneck during inference. Cross-layer KV-cache sharing represents a family of techniques that exploit redundancy in key and value representations across transformer layers to reduce cache memory without retraining. This article i...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19291014 54stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	15%	○	≥80% from editorially reviewed sources
[t]	Trusted	35%	○	≥80% from verified, high-quality sources
[a]	DOI	85%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	15%	○	≥80% indexed in CrossRef
[i]	Indexed	30%	○	≥80% have metadata indexed
[l]	Academic	15%	○	≥80% from journals/conferences/preprints
[f]	Free Access	25%	○	≥80% are freely accessible
[r]	References	20 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,141	✓	Minimum 2,000 words for a full research article. Current: 2,141
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19291014
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	76%	✗	≥80% of references from 2025–2026. Current: 76%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (47 × 60%) + Required (3/5 × 30%) + Optional (3/4 × 10%)

AI Memory Read More

Token Pruning and Attention Sparsity

Posted on March 28, 2026March 28, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19269070 72stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	75%	○	≥80% from editorially reviewed sources
[t]	Trusted	75%	○	≥80% from verified, high-quality sources
[a]	DOI	81%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	75%	○	≥80% indexed in CrossRef
[i]	Indexed	75%	○	≥80% have metadata indexed
[l]	Academic	75%	○	≥80% from journals/conferences/preprints
[f]	Free Access	88%	✓	≥80% are freely accessible
[r]	References	16 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,298	✓	Minimum 2,000 words for a full research article. Current: 2,298
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19269070
[o]	ORCID [REQ]	✗	✗	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	92%	✓	≥80% of references from 2025–2026. Current: 92%
[c]	Data Charts	0	○	Original data charts from reproducible analysis (min 2). Current: 0
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (82 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

This article investigates token pruning and attention sparsity as complementary strategies for reducing KV-cache memory consumption during large language model inference. Building on our series analysis of semantic prompt caching, we examine how selective token removal and sparse attention patterns can achieve 50-80% memory reduction while preserving generation quality. Three research questions...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19269070 72stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	75%	○	≥80% from editorially reviewed sources
[t]	Trusted	75%	○	≥80% from verified, high-quality sources
[a]	DOI	81%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	75%	○	≥80% indexed in CrossRef
[i]	Indexed	75%	○	≥80% have metadata indexed
[l]	Academic	75%	○	≥80% from journals/conferences/preprints
[f]	Free Access	88%	✓	≥80% are freely accessible
[r]	References	16 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,298	✓	Minimum 2,000 words for a full research article. Current: 2,298
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19269070
[o]	ORCID [REQ]	✗	✗	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	92%	✓	≥80% of references from 2025–2026. Current: 92%
[c]	Data Charts	0	○	Original data charts from reproducible analysis (min 2). Current: 0
[g]	Code	✓	✓	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (82 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More

Semantic Prompt Caching — Beyond Exact Match

Posted on March 24, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19211071 63stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	91%	✓	≥80% from verified, high-quality sources
[a]	DOI	9%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	100%	✓	≥80% have metadata indexed
[l]	Academic	73%	○	≥80% from journals/conferences/preprints
[f]	Free Access	91%	✓	≥80% are freely accessible
[r]	References	11 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,328	✓	Minimum 2,000 words for a full research article. Current: 2,328
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19211071
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	0%	✗	≥80% of references from 2025–2026. Current: 0%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (66 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

Prompt caching has emerged as a critical optimization for large language model (LLM) serving, yet production systems overwhelmingly rely on exact-match strategies that miss semantically equivalent queries. This article investigates semantic prompt caching — systems that identify and serve cached responses for semantically similar (but not identical) prompts using embedding-based similarity dete...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19211071 63stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	91%	✓	≥80% from verified, high-quality sources
[a]	DOI	9%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	100%	✓	≥80% have metadata indexed
[l]	Academic	73%	○	≥80% from journals/conferences/preprints
[f]	Free Access	91%	✓	≥80% are freely accessible
[r]	References	11 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,328	✓	Minimum 2,000 words for a full research article. Current: 2,328
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19211071
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	0%	✗	≥80% of references from 2025–2026. Current: 0%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (66 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More

Speculative Decoding and Cache Reuse

Posted on March 24, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19210815 63stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	94%	✓	≥80% from verified, high-quality sources
[a]	DOI	6%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	100%	✓	≥80% have metadata indexed
[l]	Academic	83%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	94%	✓	≥80% are freely accessible
[r]	References	18 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,662	✓	Minimum 2,000 words for a full research article. Current: 2,662
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19210815
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	13%	✗	≥80% of references from 2025–2026. Current: 13%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (67 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

Speculative decoding has emerged as a transformative inference optimization that breaks the sequential bottleneck of autoregressive generation by drafting multiple tokens in parallel and verifying them in a single forward pass. This article examines three research questions at the intersection of speculative decoding and KV cache management: how draft-then-verify architectures interact with cac...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19210815 63stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	94%	✓	≥80% from verified, high-quality sources
[a]	DOI	6%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	100%	✓	≥80% have metadata indexed
[l]	Academic	83%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	94%	✓	≥80% are freely accessible
[r]	References	18 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,662	✓	Minimum 2,000 words for a full research article. Current: 2,662
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19210815
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	13%	✗	≥80% of references from 2025–2026. Current: 13%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (67 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More

Grouped-Query Attention — Cache-Efficient Architecture Design

Posted on March 24, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19209159 69stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	5%	○	≥80% from editorially reviewed sources
[t]	Trusted	95%	✓	≥80% from verified, high-quality sources
[a]	DOI	90%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	5%	○	≥80% indexed in CrossRef
[i]	Indexed	90%	✓	≥80% have metadata indexed
[l]	Academic	10%	○	≥80% from journals/conferences/preprints
[f]	Free Access	19%	○	≥80% are freely accessible
[r]	References	21 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,403	✓	Minimum 2,000 words for a full research article. Current: 2,403
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19209159
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	42%	✗	≥80% of references from 2025–2026. Current: 42%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (76 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

As large language models scale beyond hundreds of billions of parameters and context windows extend to millions of tokens, the key-value (KV) cache required for attention computation becomes the dominant memory bottleneck during inference. Grouped-Query Attention (GQA) addresses this by allowing multiple query heads to share fewer key-value heads, reducing cache footprint while preserving model...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19209159 69stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	5%	○	≥80% from editorially reviewed sources
[t]	Trusted	95%	✓	≥80% from verified, high-quality sources
[a]	DOI	90%	✓	≥80% have a Digital Object Identifier
[b]	CrossRef	5%	○	≥80% indexed in CrossRef
[i]	Indexed	90%	✓	≥80% have metadata indexed
[l]	Academic	10%	○	≥80% from journals/conferences/preprints
[f]	Free Access	19%	○	≥80% are freely accessible
[r]	References	21 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,403	✓	Minimum 2,000 words for a full research article. Current: 2,403
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19209159
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	42%	✗	≥80% of references from 2025–2026. Current: 42%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (76 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More

Paged Attention and Virtual Memory for LLM Inference

Posted on March 24, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19203099 61stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	17%	○	≥80% from editorially reviewed sources
[t]	Trusted	75%	○	≥80% from verified, high-quality sources
[a]	DOI	33%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	17%	○	≥80% indexed in CrossRef
[i]	Indexed	92%	✓	≥80% have metadata indexed
[l]	Academic	50%	○	≥80% from journals/conferences/preprints
[f]	Free Access	67%	○	≥80% are freely accessible
[r]	References	12 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,912	✓	Minimum 2,000 words for a full research article. Current: 2,912
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19203099
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	40%	✗	≥80% of references from 2025–2026. Current: 40%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (63 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

As large language models scale to billions of parameters and millions of context tokens, the key-value (KV) cache that stores attention states becomes the dominant memory bottleneck during inference. Traditional contiguous memory allocation for KV caches leads to severe fragmentation — wasting 40-60% of available GPU memory — and fundamentally limits serving throughput. This article investigate...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19203099 61stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	17%	○	≥80% from editorially reviewed sources
[t]	Trusted	75%	○	≥80% from verified, high-quality sources
[a]	DOI	33%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	17%	○	≥80% indexed in CrossRef
[i]	Indexed	92%	✓	≥80% have metadata indexed
[l]	Academic	50%	○	≥80% from journals/conferences/preprints
[f]	Free Access	67%	○	≥80% are freely accessible
[r]	References	12 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,912	✓	Minimum 2,000 words for a full research article. Current: 2,912
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19203099
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	40%	✗	≥80% of references from 2025–2026. Current: 40%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (63 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More

Meta-Analysis of Context Benchmarks — Building a Unified Evaluation Framework

Posted on March 24, 2026 by

Technical Research

Technical Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19199439 64stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	19%	○	≥80% from editorially reviewed sources
[t]	Trusted	94%	✓	≥80% from verified, high-quality sources
[a]	DOI	6%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	100%	✓	≥80% have metadata indexed
[l]	Academic	81%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	75%	○	≥80% are freely accessible
[r]	References	16 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,526	✓	Minimum 2,000 words for a full research article. Current: 2,526
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19199439
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	14%	✗	≥80% of references from 2025–2026. Current: 14%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (68 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

The rapid expansion of context windows — from 4K tokens to 10M tokens in models like Llama 4 — has produced a proliferation of evaluation benchmarks, yet no unified framework exists for comparing long-context capabilities across these disparate tests. This article presents a meta-analysis of ten major context benchmarks (NIAH, RULER, LongBench v2, InfiniteBench, BABILong, NoLiMa, LongGenBench, ...

Show moreHide

Technical Research by Oleh Ivchenko DOI: 10.5281/zenodo.19199439 64stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	19%	○	≥80% from editorially reviewed sources
[t]	Trusted	94%	✓	≥80% from verified, high-quality sources
[a]	DOI	6%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	100%	✓	≥80% have metadata indexed
[l]	Academic	81%	✓	≥80% from journals/conferences/preprints
[f]	Free Access	75%	○	≥80% are freely accessible
[r]	References	16 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,526	✓	Minimum 2,000 words for a full research article. Current: 2,526
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19199439
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	14%	✗	≥80% of references from 2025–2026. Current: 14%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (68 × 60%) + Required (3/5 × 30%) + Optional (2/4 × 10%)

AI Memory Read More