The gap between what AI can do and what organizations actually deploy continues to widen in 2026. While previous articles in this series quantified the magnitude of this gap across sectors, the underlying friction mechanisms remain poorly categorized. This article introduces a four-quadrant Adoption Friction Taxonomy (AFT) that classifies eight empirically identified barrier categories along tw...
Fresh Repositories Watch: Healthcare AI — Emerging Open-Source Tools Under 60 Days Old
The healthcare AI open-source ecosystem is experiencing unprecedented growth in early 2026, driven by federated l[REDACTED]g platforms, foundation models for medical imaging, and synthetic data generators that enable privacy-preserving research collaboration. This article applies the Trusted Open Source Index methodology established in our previous work to evaluate nine prominent healthcare AI ...
Semantic Prompt Caching — Beyond Exact Match
Prompt caching has emerged as a critical optimization for large language model (LLM) serving, yet production systems overwhelmingly rely on exact-match strategies that miss semantically equivalent queries. This article investigates semantic prompt caching — systems that identify and serve cached responses for semantically similar (but not identical) prompts using embedding-based similarity dete...
Speculative Decoding and Cache Reuse
Speculative decoding has emerged as a transformative inference optimization that breaks the sequential bottleneck of autoregressive generation by drafting multiple tokens in parallel and verifying them in a single forward pass. This article examines three research questions at the intersection of speculative decoding and KV cache management: how draft-then-verify architectures interact with cac...
Social and Collaborative Intelligence as a UIB Dimension: Why Theory of Mind Remains the Hardest Benchmark
Current AI evaluation overwhelmingly focuses on individual cognitive tasks — reasoning, coding, mathematics — while neglecting the social and collaborative capabilities that define human intelligence in practice. This article introduces the UIB-Social dimension, a formal evaluation framework for measuring social intelligence in large language models across four sub-dimensions: Theory of Mind (T...
Grouped-Query Attention — Cache-Efficient Architecture Design
As large language models scale beyond hundreds of billions of parameters and context windows extend to millions of tokens, the key-value (KV) cache required for attention computation becomes the dominant memory bottleneck during inference. Grouped-Query Attention (GQA) addresses this by allowing multiple query heads to share fewer key-value heads, reducing cache footprint while preserving model...
Temporal and Planning Intelligence as a UIB Dimension: Why Horizon Length Breaks Modern Reasoning Models
Temporal reasoning and long-horizon planning represent perhaps the most consequential gap between current large language models and human cognitive capability. While frontier models achieve near-human performance on short planning tasks (under 15 steps), their accuracy degrades catastrophically beyond 25 planning steps — a phenomenon we term the horizon collapse. This article examines three res...
Paged Attention and Virtual Memory for LLM Inference
As large language models scale to billions of parameters and millions of context tokens, the key-value (KV) cache that stores attention states becomes the dominant memory bottleneck during inference. Traditional contiguous memory allocation for KV caches leads to severe fragmentation — wasting 40-60% of available GPU memory — and fundamentally limits serving throughput. This article investigate...
Meta-Analysis of Context Benchmarks — Building a Unified Evaluation Framework
The rapid expansion of context windows — from 4K tokens to 10M tokens in models like Llama 4 — has produced a proliferation of evaluation benchmarks, yet no unified framework exists for comparing long-context capabilities across these disparate tests. This article presents a meta-analysis of ten major context benchmarks (NIAH, RULER, LongBench v2, InfiniteBench, BABILong, NoLiMa, LongGenBench, ...
Multi-Turn Memory — How Conversation History Degrades Model Performance
Multi-turn conversation represents the dominant interaction mode for deployed large language models, yet mounting evidence reveals that model performance degrades severely as conversation history accumulates in the KV-cache. This article investigates three research questions: how rapidly task accuracy declines across conversation turns, what mechanisms drive this degradation at the attention an...