The dominant paradigm for AI memory — fixed-size context windows processed through self-attention — faces fundamental scalability barriers as large language models are deployed in long-horizon agentic tasks requiring hundreds of interaction sessions. This article investigates the transition from fixed context windows to persistent memory architectures through three research questions addressing...
Category: AI Memory
Research series on AI memory systems — KV-cache, context windows, attention memory, retrieval-augmented memory, and memory-efficient inference architectures
Biological Memory Models and Their AI Analogues
The rapid expansion of AI memory architectures — from KV-caches and retrieval-augmented generation to parametric weight storage — has proceeded largely without systematic reference to the biological memory systems that inspired them. This article investigates three research questions about the structural and functional parallels between biological memory systems (hippocampal-cortical consolidat...
Retrieval-Augmented Memory vs Pure Attention Memory
The expansion of large language model context windows to 128K+ tokens has reopened a fundamental architectural question: should AI systems remember through retrieval from external stores or through attention over internally maintained representations? This article investigates three research questions about the comparative performance of retrieval-augmented memory (RAM) and pure attention memor...
Cache-Augmented Retrieval — RAG Meets KV-Cache
Retrieval-Augmented Generation (RAG) has become the dominant paradigm for grounding large language models in external knowledge, yet its runtime retrieval overhead imposes latency and consistency penalties that limit production deployability. Cache-Augmented Generation (CAG) proposes an inversion of this paradigm: preload all relevant documents into the model's key-value (KV) cache before queri...
The Economics of Context Caching — Cost Models and Break-Even
Context caching has emerged as the primary mechanism for reducing inference costs in large language model (LLM) deployments, yet the economics governing when caching becomes cost-effective remain poorly formalized. This article investigates three research questions addressing (1) how key-value (KV) cache storage costs scale with model architecture and context length, (2) at what request reuse f...
Production Cache Monitoring — Metrics and Capacity Planning
As key-value (KV) cache systems become the dominant memory consumer in production large language model (LLM) inference, the ability to monitor cache behavior and plan capacity proactively determines whether deployments meet service-level objectives (SLOs) or suffer unpredictable degradation. This article investigates three research questions addressing (1) which monitoring metrics most reliably...
Cache Coherence in Multi-Tenant Deployments
As large language model (LLM) inference platforms scale to serve dozens or hundreds of concurrent tenants on shared GPU clusters, the key-value (KV) cache—the dominant consumer of GPU memory—becomes both a performance bottleneck and a security surface. This article investigates cache coherence challenges that arise when multiple tenants share KV-cache state in production LLM serving systems. We...
Memory Hierarchy — DRAM, HBM, and SSD-Backed Caches
Large language model inference demands massive key-value (KV) cache storage that frequently exceeds GPU high-bandwidth memory (HBM) capacity, forcing system designers to exploit multi-tier memory hierarchies spanning HBM, host DRAM, and NVMe SSDs. This article investigates three research questions: how bandwidth and latency characteristics of each memory tier constrain KV cache serving throughp...
Cache-Aware Request Scheduling and Batching
Efficient large language model (LLM) inference depends critically on how requests are scheduled and batched relative to the key-value (KV) cache state across GPU memory. Traditional scheduling strategies — round-robin, least-loaded, and even continuous batching — treat the KV cache as a passive byproduct of inference rather than an active scheduling constraint. This article investigates three r...
Disaggregated Prefill and Decode Architectures
Large language model inference comprises two computationally distinct phases — prefill and decode — that exhibit fundamentally different hardware utilization profiles. Colocating both phases on the same GPU leads to resource contention and suboptimal utilization, a problem that disaggregated architectures address by separating prefill and decode onto dedicated hardware pools. This article inves...