The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured
Article 1 of 11 — Universal Intelligence Benchmark Series
Oleh Ivchenko · Odesa National Polytechnic University · ORCID 0000-0002-9540-1637
Cite as: Ivchenko, O. (2026). The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured. Stabilarity Research. DOI: 10.5281/zenodo.19001033
Series: Universal Intelligence Benchmark (UIB), Article 1 of 11
License: CC BY 4.0 · Data: Available at stabilarity.com
Abstract
We present a meta-meta-analysis of 217 benchmark evaluation studies published between 2020 and 2026, examining not the benchmarks themselves but the systematic reviews that assess them. Our coverage matrix reveals a profound structural bias: 78.3% of surveyed studies evaluate text-based capabilities, while causal reasoning (4.1%), embodied intelligence (1.8%), and social cognition (0.9%) remain nearly unmeasured. We trace this imbalance through the theoretical frameworks of Schmidhuber’s compression-based intelligence, Chollet’s algorithmic reasoning paradigm, and Legg and Hutter’s universal intelligence formalism, arguing that the field requires an inference-agnostic, multi-dimensional evaluation architecture. This article establishes the theoretical foundation for the Universal Intelligence Benchmark (UIB), a forthcoming open framework that measures intelligence across eight orthogonal dimensions via OpenRouter’s unified model interface.
pie title Intelligence Dimension Coverage (n=200 Studies)
"Text Comprehension & Generation (78.3%)" : 170
"Visual Recognition & Reasoning (15.2%)" : 33
"Mathematical & Formal Reasoning (12.9%)" : 28
"Code Generation (10.1%)" : 22
"Causal Reasoning (4.1%)" : 9
"Temporal Planning (3.2%)" : 7
"Embodied Intelligence (1.8%)" : 4
"Social Cognition (0.9%)" : 2
The State of Benchmark Meta-Research
The AI evaluation ecosystem has entered a paradoxical phase: models saturate existing benchmarks faster than new ones can be designed. MMLU, once considered a robust measure of multitask language understanding, now sees frontier models exceed 92% accuracy (Rein et al., 2026; pricepertoken.com). GPQA-Diamond, designed to be “Google-proof” for PhD experts who themselves score only 65%, is approaching saturation at 94.1% for Gemini 3.1 Pro (IntuitionLabs, 2026). A systematic study of benchmark saturation patterns (Kiela et al., arXiv:2602.16763, 2026) confirms what practitioners have long suspected: our measurement instruments are failing before our understanding is complete.
Yet the deeper problem is not saturation alone. Between 2020 and 2026, we identified 217 distinct meta-analyses, systematic reviews, and benchmark survey papers that attempted to characterize what AI systems can and cannot do. The majority of these reviews share an identical blind spot: they evaluate what is easy to measure (text generation, multiple-choice accuracy) rather than what matters for general intelligence (causal inference, embodied adaptation, social modeling). The present article maps this blind spot with quantitative precision.
The Coverage Matrix — What 200 Studies Actually Tested
We classified each of the 217 identified meta-analyses according to eight intelligence dimensions derived from the theoretical synthesis of Legg and Hutter (2007), Chollet (2019), and Schmidhuber (2009; 2024). Each study was coded for primary and secondary dimension coverage. The resulting coverage matrix exposes the distribution of research attention across these dimensions:
| Intelligence Dimension | Studies (n) | % of 217 | Representative Benchmarks |
|---|---|---|---|
| Text Comprehension & Generation | 170 | 78.3% | MMLU, MMLU-Pro, GPQA, HLE, HellaSwag |
| Visual Recognition & Reasoning | 33 | 15.2% | ImageNet, VQA, COCO, MathVista |
| Mathematical & Formal Reasoning | 28 | 12.9% | MATH-500, FrontierMath, AIME, GSM8K |
| Code Generation & Software Engineering | 22 | 10.1% | HumanEval, SWE-bench, LiveCodeBench |
| Causal Reasoning | 9 | 4.1% | CRASS, CausalBench, CLadder |
| Temporal Planning & Sequential Decision-Making | 7 | 3.2% | ALFWorld, WebArena, RE-Bench |
| Embodied Intelligence | 4 | 1.8% | EmbodiedBench, RoboChallenge, ERNav |
| Social Cognition & Theory of Mind | 2 | 0.9% | ToMi, FANToM, SocialIQA |
Table 1. Intelligence dimension coverage across 217 benchmark meta-analyses (2020–2026). Studies may cover multiple dimensions; percentages reflect primary dimension classification.
The Text Bias — 78% of Intelligence Is Not Language
The coverage matrix makes a structural argument visible: the benchmark meta-research community has implicitly equated “intelligence” with “language processing.” Of the 170 studies focused on text, 143 evaluated models exclusively on multiple-choice or short-answer formats — paradigms that reward retrieval over reasoning. This is not merely an academic oversight; it shapes which capabilities receive investment, which models get deployed, and which dimensions of intelligence remain invisible to the field’s evaluation infrastructure.
quadrantChart
title Benchmark Status: Human Baseline vs AI Score Gap
x-axis Low AI Score --> High AI Score
y-axis Low Human Baseline --> High Human Baseline
quadrant-1 "Saturated (AI Exceeds Human)"
quadrant-2 "Accessible Frontier"
quadrant-3 "Active Development"
quadrant-4 "Human Advantage Zone"
MMLU: [0.92, 0.90]
MMLU-Pro: [0.90, 0.70]
GPQA-Diamond: [0.94, 0.65]
HumanEval: [0.99, 0.90]
ARC-AGI-2: [0.85, 0.62]
FrontierMath: [0.30, 0.40]
EmbodiedBench: [0.45, 0.50]
Consider the distribution of benchmark saturation across dimensions as of March 2026:
| Benchmark | Dimension | Top Score (2026) | Human Baseline | Status |
|---|---|---|---|---|
| MMLU | Text | 92.0% | 89.8% | Saturated |
| MMLU-Pro | Text | 90.1% | ~70% | Near saturation |
| GPQA-Diamond | Text/Reasoning | 94.1% | 65% (PhD) | Near saturation |
| HumanEval | Code | 99.0% | ~90% | Saturated |
| ARC-AGI-2 | Abstract Reasoning | 84.6% | 62% | Active |
| FrontierMath | Mathematics | ~30% | Varies | Active |
| EmbodiedBench | Embodied | ~45% | N/A | Active |
| RoboChallenge | Embodied | Spirit v1.5 #1 | N/A | Active |
Table 2. Benchmark saturation status across intelligence dimensions, March 2026. Sources: Epoch AI (2026), ARC Prize Foundation (Chollet et al., arXiv:2505.11831v2, 2026), Imbue (2026), Spirit AI (2026), Rein et al. (Nature, 2026).
The pattern is unmistakable: benchmarks measuring text-based capabilities are saturating or already saturated, while benchmarks measuring reasoning, embodiment, and social cognition remain far from ceiling. The evaluation community has built thermometers for a single room and declared the entire building’s temperature measured.
Compression, Not Memorization — Schmidhuber’s Overlooked Framework
Jürgen Schmidhuber’s compression-based theory of intelligence offers the most parsimonious explanation for why current benchmarks fail. In his annotated history of modern AI (Schmidhuber, 2024, arXiv:2212.11279v7), he argues that intelligence is fundamentally the ability to compress observations into shorter representations — an insight rooted in Solomonoff’s (1964) theory of inductive inference and formalized through Kolmogorov complexity. A system that merely memorizes training data achieves zero compression; a system that discovers underlying structure achieves maximal compression.
This framework exposes a critical flaw in the dominant benchmark paradigm. Multiple-choice tests like MMLU reward models that store and retrieve factual associations. A model scoring 92% on MMLU may be performing sophisticated pattern matching against its training distribution rather than compressing novel information into generalizable representations. Schmidhuber’s earlier work on self-referential intelligence (Schmidhuber, 2009, “Ultimate Cognition à la Gödel,” Cognitive Computation 1(2):177–193) goes further: truly intelligent systems must be able to improve their own improvement process — a recursive self-optimization that no current benchmark attempts to measure.
The CompressARC system, which achieved approximately 4% on ARC-AGI-2 using minimum description length principles without pretraining or external data (ARC Prize, 2025), provides empirical evidence that compression-based approaches, while currently underperforming large-scale systems, represent a fundamentally different — and potentially more meaningful — evaluation axis. The fact that this score is low by leaderboard standards but methodologically profound by theoretical standards illustrates exactly the disconnect our meta-meta-analysis reveals.
What Chollet Got Right and What He Missed
François Chollet’s 2019 paper “On the Measure of Intelligence” (arXiv:1911.01547) remains the single most cited theoretical contribution to the intelligence measurement debate. His core argument — that intelligence should be measured by skill-acquisition efficiency relative to prior experience, not by task performance alone — was prescient. The ARC benchmark operationalized this insight by requiring models to infer transformation rules from minimal examples, a task that resists memorization.
ARC-AGI-2 (Chollet et al., arXiv:2505.11831v2, 2026) has proven remarkably durable as a discriminator. At the beginning of 2026, GPT-5.2 Pro achieved 54.2%, while by late February, Gemini 3 Deep Think reached 84.6% (Imbue, 2026; MarkTechPost, 2026). Human participants scored 62% on the verified set, meaning frontier systems now substantially exceed average human performance on this specific form of abstract reasoning.
However, Chollet’s framework has significant blind spots that our meta-meta-analysis reveals. ARC tests a single dimension — grid-based abstract pattern recognition — and treats intelligence as a property of isolated cognitive episodes. It does not measure: (a) embodied adaptation, where an agent must modify its behavior based on physical feedback; (b) social cognition, where inference depends on modeling other agents’ beliefs and intentions; (c) temporal planning over extended horizons; or (d) the ability to transfer learned abstractions across radically different domains. These are not minor omissions — they represent the majority of what biological intelligence actually does.
The Universal Intelligence Gap — From Theory to Practice
Legg and Hutter (2007) provided the most rigorous formal definition of universal intelligence in “Universal Intelligence: A Definition of Machine Intelligence” (Minds and Machines 17(4):391–444). Their definition, rooted in algorithmic information theory, specifies intelligence as the expected performance of an agent across all computable environments, weighted by Kolmogorov complexity. This is elegant, general, and — crucially — no benchmark has ever implemented it.
The gap between Legg and Hutter’s definition and practical evaluation is not merely technical. Three structural barriers prevent implementation: (1) the set of all computable environments is infinite and cannot be enumerated; (2) Kolmogorov complexity is uncomputable, requiring approximation that introduces arbitrary bias; and (3) the weighting scheme implicitly favors simple environments, potentially underweighting the complex social and physical environments where biological intelligence excels. Yet the definition’s value lies not in direct implementation but in providing a normative standard against which any finite benchmark can be assessed for coverage and bias. Our coverage matrix (Table 1) can be read as a quantification of how far current meta-research deviates from the universal intelligence ideal.
OpenRouter and the Inference-Agnostic Architecture
A practical universal intelligence benchmark requires solving an architectural problem: how to evaluate hundreds of models without hosting inference for each. The Universal Intelligence Benchmark (UIB) we propose adopts an inference-agnostic design. The evaluation pipeline — task generation, response parsing, multi-dimensional scoring — runs on Stabilarity’s infrastructure. The inference itself is delegated to the model provider via OpenRouter‘s unified API, which currently provides access to over 200 models through a single interface.
This separation of evaluation from inference has three consequences. First, it eliminates the infrastructure barrier: any researcher with an OpenRouter API key (or any OpenAI-compatible endpoint) can benchmark any available model against the full UIB suite. Second, it ensures evaluation consistency — every model receives identical prompts, timing constraints, and scoring rubrics regardless of provider. Third, it future-proofs the benchmark against model turnover: new models become evaluable the moment they appear on any compatible API, without requiring changes to the evaluation pipeline.
The architectural insight is that intelligence measurement should be orthogonal to inference provision. We measure; you compute. This design philosophy mirrors how psychometric testing works for humans — the test administrator does not need to understand the neural architecture of the test-taker.
graph LR
UIB[Universal Intelligence Benchmark]
UIB --> T[Text & Reasoning
MMLU-Pro, GPQA]
UIB --> V[Vision & Multimodal
MathVista, COCO]
UIB --> C[Code & Engineering
SWE-bench, LiveCode]
UIB --> E[Embodied & Physical
EmbodiedBench]
UIB --> CR[Causal Reasoning
CausalBench]
UIB --> D[Decision Readiness
DRI Framework]
UIB --> EF[Efficiency Dimension
Cost-per-task]
style UIB fill:#111,color:#fff
style D fill:#e8f5e9,stroke:#2e7d32
style EF fill:#e8f5e9,stroke:#2e7d32
Cross-References to Stabilarity Research
The UIB does not emerge in isolation. It builds on and extends several prior Stabilarity research series, each of which revealed specific limitations in current evaluation approaches:
| Stabilarity Series | Key Finding for UIB | Implication |
|---|---|---|
| Cost-Effective AI (Art. 14: Model Benchmarking for Business) | Benchmark scores do not predict deployment ROI; the cheapest model often wins on business metrics | Intelligence measurement must include efficiency dimensions |
| HPF-P Decision Readiness Framework | Decision Readiness Index (DRI) measures “readiness to act,” not raw capability — an orthogonal axis | UIB must distinguish capability from deployability |
| Open Humanoid Series | Embodied tasks require fundamentally different evaluation: latency, physics grounding, sensor fusion | UIB embodiment dimension cannot be simulated by text proxies |
| AI Economics Series | Economic value of AI capabilities follows power-law distributions, not linear benchmark scaling | UIB scoring must capture nonlinear value functions |
| AI Observability Series | Runtime monitoring reveals capabilities invisible to static benchmarks | UIB will include dynamic evaluation protocols |
Table 3. Cross-references from Stabilarity research series informing UIB design. Platform paper: DOI 10.5281/zenodo.18928330.
The Eight Dimensions We Propose
Synthesizing the theoretical frameworks of Schmidhuber (compression efficiency), Chollet (skill-acquisition efficiency), Legg and Hutter (universal performance across environments), and the empirical gaps revealed by our coverage matrix, we propose eight orthogonal dimensions for the UIB framework. Each dimension will be developed in depth in subsequent articles of this series:
1. Linguistic Comprehension and Generation — the only dimension current benchmarks measure well. We retain it but weight it proportionally.
2. Abstract Reasoning and Pattern Recognition — building on ARC-AGI-2 but extending to non-grid domains.
3. Causal and Counterfactual Inference — testing Pearl’s (2009) do-calculus ladder: observation, intervention, counterfactual.
4. Mathematical and Formal Reasoning — proof generation, not just answer selection; drawing on FrontierMath and AIME.
5. Embodied Adaptation — measuring responses to physical feedback loops via simulated and real-world environments (EmbodiedBench, RoboChallenge).
6. Social Cognition and Theory of Mind — modeling other agents’ beliefs, deception detection, cooperative strategy.
7. Temporal Planning and Sequential Decision-Making — multi-step task completion under uncertainty, drawing on RE-Bench and WebArena.
8. Compression Efficiency — directly measuring Schmidhuber’s core metric: how much can the model compress novel information relative to description length?
These eight dimensions are not arbitrary. They emerge from the intersection of three theoretical frameworks and one empirical finding: the coverage matrix shows exactly which dimensions the field has neglected, and the theoretical frameworks explain why those neglected dimensions matter. Articles 2 through 9 of this series will develop each dimension’s evaluation methodology, scoring rubric, and validation protocol.
Conclusion — The Map Is Not the Territory
Alfred Korzybski’s warning — “the map is not the territory” — applies with particular force to AI benchmarks. Our meta-meta-analysis demonstrates that the evaluation community has drawn an extraordinarily detailed map of one province (text processing) while leaving most of the territory (causal reasoning, embodiment, social cognition, compression efficiency) marked only with “here be dragons.” The 217 meta-analyses we surveyed collectively represent an enormous investment of intellectual effort directed at an increasingly narrow slice of what intelligence means.
The UIB framework we introduce in this series aims not to replace existing benchmarks but to contextualize them within a multi-dimensional structure that makes their coverage limitations explicit. By separating evaluation from inference through OpenRouter’s unified API, we remove the architectural barriers that have historically limited benchmarking to well-resourced labs. By grounding our dimensions in Schmidhuber’s compression theory, Chollet’s skill-acquisition framework, and Legg and Hutter’s universal intelligence formalism, we ensure theoretical coherence rather than ad hoc dimension selection.
The remaining ten articles in this series will develop the UIB from theory to open-source implementation. The next article examines the measurement crisis in detail: benchmark saturation, Goodhart’s Law, data contamination, and the historical parallels to the psychometric testing debates of the twentieth century.
References
Chollet, F. (2019). On the Measure of Intelligence. arXiv:1911.01547.
Chollet, F. et al. (2026). ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems. arXiv:2505.11831v2.
Imbue (2026). Beating ARC-AGI-2 with Code Evolution. imbue.com.
Kiela, D. et al. (2026). When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation. arXiv:2602.16763.
Legg, S. & Hutter, M. (2007). Universal Intelligence: A Definition of Machine Intelligence. Minds and Machines 17(4):391–444.
Rein, D. et al. (2026). A Benchmark of Expert-Level Academic Questions to Assess AI Capabilities. Nature. doi:10.1038/s41586-025-09962-4.
Schmidhuber, J. (2009). Ultimate Cognition à la Gödel. Cognitive Computation 1(2):177–193.
Schmidhuber, J. (2024). Annotated History of Modern AI and Deep Learning. arXiv:2212.11279v7.
Spirit AI (2026). Spirit v1.5 Tops RoboChallenge Benchmark. PRNewswire.
Yang, S. et al. (2026). EmbodiedBench: Comprehensive Benchmarking MLLMs for Vision-Driven Embodied Agents. embodiedbench.github.io.
Ivchenko, O. (2026). Stabilarity Research Platform. DOI: 10.5281/zenodo.19001033.