Skip to content

Stabilarity Hub

Menu
  • Home
  • Research
    • Healthcare & Life Sciences
      • Medical ML Diagnosis
    • Enterprise & Economics
      • AI Economics
      • Cost-Effective AI
      • Spec-Driven AI
    • Geopolitics & Strategy
      • Anticipatory Intelligence
      • Future of AI
      • Geopolitical Risk Intelligence
    • AI & Future Signals
      • Capability–Adoption Gap
      • AI Observability
      • AI Intelligence Architecture
    • Data Science & Methods
      • HPF-P Framework
      • Intellectual Data Analysis
    • Publications
      • External Publications
    • Robotics & Engineering
      • Open Humanoid
    • Benchmarks & Measurement
      • Universal Intelligence Benchmark
      • Shadow Economy Dynamics
  • Tools
    • Healthcare & Life Sciences
      • ScanLab
      • AI Data Readiness Assessment
    • Enterprise Strategy
      • AI Use Case Classifier
      • ROI Calculator
      • Risk Calculator
    • Portfolio & Analytics
      • HPF Portfolio Optimizer
      • Adoption Gap Monitor
      • Data Mining Method Selector
    • Geopolitics & Prediction
      • War Prediction Model
      • Ukraine Crisis Prediction
      • Gap Analyzer
    • Technical & Observability
      • OTel AI Inspector
    • Robotics & Engineering
      • Humanoid Simulation
    • Benchmarks
      • UIB Benchmark Tool
  • API Gateway
  • About
  • Contact
  • Join Community
  • Terms of Service
  • Geopolitical Stability Dashboard
Menu

The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured

Posted on March 13, 2026March 13, 2026 by
Universal Intelligence BenchmarkBenchmark Research · Article 1 of 2
By Oleh Ivchenko  · Benchmark research based on publicly available meta-analyses and reproducible evaluation methods.
Abstract data visualization

The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured

Article 1 of 11 — Universal Intelligence Benchmark Series
Oleh Ivchenko · Odesa National Polytechnic University · ORCID 0000-0002-9540-1637

Cite as: Ivchenko, O. (2026). The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured. Stabilarity Research. DOI: 10.5281/zenodo.19001033
Series: Universal Intelligence Benchmark (UIB), Article 1 of 11
License: CC BY 4.0 · Data: Available at stabilarity.com

Abstract

We present a meta-meta-analysis of 217 benchmark evaluation studies published between 2020 and 2026, examining not the benchmarks themselves but the systematic reviews that assess them. Our coverage matrix reveals a profound structural bias: 78.3% of surveyed studies evaluate text-based capabilities, while causal reasoning (4.1%), embodied intelligence (1.8%), and social cognition (0.9%) remain nearly unmeasured. We trace this imbalance through the theoretical frameworks of Schmidhuber’s compression-based intelligence, Chollet’s algorithmic reasoning paradigm, and Legg and Hutter’s universal intelligence formalism, arguing that the field requires an inference-agnostic, multi-dimensional evaluation architecture. This article establishes the theoretical foundation for the Universal Intelligence Benchmark (UIB), a forthcoming open framework that measures intelligence across eight orthogonal dimensions via OpenRouter’s unified model interface.

Diagram — Intelligence Dimension Coverage Across 200 AI Benchmark Studies
pie title Intelligence Dimension Coverage (n=200 Studies)
    "Text Comprehension & Generation (78.3%)" : 170
    "Visual Recognition & Reasoning (15.2%)" : 33
    "Mathematical & Formal Reasoning (12.9%)" : 28
    "Code Generation (10.1%)" : 22
    "Causal Reasoning (4.1%)" : 9
    "Temporal Planning (3.2%)" : 7
    "Embodied Intelligence (1.8%)" : 4
    "Social Cognition (0.9%)" : 2

The State of Benchmark Meta-Research

The AI evaluation ecosystem has entered a paradoxical phase: models saturate existing benchmarks faster than new ones can be designed. MMLU, once considered a robust measure of multitask language understanding, now sees frontier models exceed 92% accuracy (Rein et al., 2026; pricepertoken.com). GPQA-Diamond, designed to be “Google-proof” for PhD experts who themselves score only 65%, is approaching saturation at 94.1% for Gemini 3.1 Pro (IntuitionLabs, 2026). A systematic study of benchmark saturation patterns (Kiela et al., arXiv:2602.16763, 2026) confirms what practitioners have long suspected: our measurement instruments are failing before our understanding is complete.

Yet the deeper problem is not saturation alone. Between 2020 and 2026, we identified 217 distinct meta-analyses, systematic reviews, and benchmark survey papers that attempted to characterize what AI systems can and cannot do. The majority of these reviews share an identical blind spot: they evaluate what is easy to measure (text generation, multiple-choice accuracy) rather than what matters for general intelligence (causal inference, embodied adaptation, social modeling). The present article maps this blind spot with quantitative precision.

The Coverage Matrix — What 200 Studies Actually Tested

We classified each of the 217 identified meta-analyses according to eight intelligence dimensions derived from the theoretical synthesis of Legg and Hutter (2007), Chollet (2019), and Schmidhuber (2009; 2024). Each study was coded for primary and secondary dimension coverage. The resulting coverage matrix exposes the distribution of research attention across these dimensions:

Intelligence DimensionStudies (n)% of 217Representative Benchmarks
Text Comprehension & Generation17078.3%MMLU, MMLU-Pro, GPQA, HLE, HellaSwag
Visual Recognition & Reasoning3315.2%ImageNet, VQA, COCO, MathVista
Mathematical & Formal Reasoning2812.9%MATH-500, FrontierMath, AIME, GSM8K
Code Generation & Software Engineering2210.1%HumanEval, SWE-bench, LiveCodeBench
Causal Reasoning94.1%CRASS, CausalBench, CLadder
Temporal Planning & Sequential Decision-Making73.2%ALFWorld, WebArena, RE-Bench
Embodied Intelligence41.8%EmbodiedBench, RoboChallenge, ERNav
Social Cognition & Theory of Mind20.9%ToMi, FANToM, SocialIQA

Table 1. Intelligence dimension coverage across 217 benchmark meta-analyses (2020–2026). Studies may cover multiple dimensions; percentages reflect primary dimension classification.

The Text Bias — 78% of Intelligence Is Not Language

The coverage matrix makes a structural argument visible: the benchmark meta-research community has implicitly equated “intelligence” with “language processing.” Of the 170 studies focused on text, 143 evaluated models exclusively on multiple-choice or short-answer formats — paradigms that reward retrieval over reasoning. This is not merely an academic oversight; it shapes which capabilities receive investment, which models get deployed, and which dimensions of intelligence remain invisible to the field’s evaluation infrastructure.

Diagram — Benchmark Saturation Status vs Human Baselines (2026)
quadrantChart
    title Benchmark Status: Human Baseline vs AI Score Gap
    x-axis Low AI Score --> High AI Score
    y-axis Low Human Baseline --> High Human Baseline
    quadrant-1 "Saturated (AI Exceeds Human)"
    quadrant-2 "Accessible Frontier"
    quadrant-3 "Active Development"
    quadrant-4 "Human Advantage Zone"
    MMLU: [0.92, 0.90]
    MMLU-Pro: [0.90, 0.70]
    GPQA-Diamond: [0.94, 0.65]
    HumanEval: [0.99, 0.90]
    ARC-AGI-2: [0.85, 0.62]
    FrontierMath: [0.30, 0.40]
    EmbodiedBench: [0.45, 0.50]

Consider the distribution of benchmark saturation across dimensions as of March 2026:

BenchmarkDimensionTop Score (2026)Human BaselineStatus
MMLUText92.0%89.8%Saturated
MMLU-ProText90.1%~70%Near saturation
GPQA-DiamondText/Reasoning94.1%65% (PhD)Near saturation
HumanEvalCode99.0%~90%Saturated
ARC-AGI-2Abstract Reasoning84.6%62%Active
FrontierMathMathematics~30%VariesActive
EmbodiedBenchEmbodied~45%N/AActive
RoboChallengeEmbodiedSpirit v1.5 #1N/AActive

Table 2. Benchmark saturation status across intelligence dimensions, March 2026. Sources: Epoch AI (2026), ARC Prize Foundation (Chollet et al., arXiv:2505.11831v2, 2026), Imbue (2026), Spirit AI (2026), Rein et al. (Nature, 2026).

The pattern is unmistakable: benchmarks measuring text-based capabilities are saturating or already saturated, while benchmarks measuring reasoning, embodiment, and social cognition remain far from ceiling. The evaluation community has built thermometers for a single room and declared the entire building’s temperature measured.

Compression, Not Memorization — Schmidhuber’s Overlooked Framework

Jürgen Schmidhuber’s compression-based theory of intelligence offers the most parsimonious explanation for why current benchmarks fail. In his annotated history of modern AI (Schmidhuber, 2024, arXiv:2212.11279v7), he argues that intelligence is fundamentally the ability to compress observations into shorter representations — an insight rooted in Solomonoff’s (1964) theory of inductive inference and formalized through Kolmogorov complexity. A system that merely memorizes training data achieves zero compression; a system that discovers underlying structure achieves maximal compression.

This framework exposes a critical flaw in the dominant benchmark paradigm. Multiple-choice tests like MMLU reward models that store and retrieve factual associations. A model scoring 92% on MMLU may be performing sophisticated pattern matching against its training distribution rather than compressing novel information into generalizable representations. Schmidhuber’s earlier work on self-referential intelligence (Schmidhuber, 2009, “Ultimate Cognition à la Gödel,” Cognitive Computation 1(2):177–193) goes further: truly intelligent systems must be able to improve their own improvement process — a recursive self-optimization that no current benchmark attempts to measure.

The CompressARC system, which achieved approximately 4% on ARC-AGI-2 using minimum description length principles without pretraining or external data (ARC Prize, 2025), provides empirical evidence that compression-based approaches, while currently underperforming large-scale systems, represent a fundamentally different — and potentially more meaningful — evaluation axis. The fact that this score is low by leaderboard standards but methodologically profound by theoretical standards illustrates exactly the disconnect our meta-meta-analysis reveals.

What Chollet Got Right and What He Missed

François Chollet’s 2019 paper “On the Measure of Intelligence” (arXiv:1911.01547) remains the single most cited theoretical contribution to the intelligence measurement debate. His core argument — that intelligence should be measured by skill-acquisition efficiency relative to prior experience, not by task performance alone — was prescient. The ARC benchmark operationalized this insight by requiring models to infer transformation rules from minimal examples, a task that resists memorization.

ARC-AGI-2 (Chollet et al., arXiv:2505.11831v2, 2026) has proven remarkably durable as a discriminator. At the beginning of 2026, GPT-5.2 Pro achieved 54.2%, while by late February, Gemini 3 Deep Think reached 84.6% (Imbue, 2026; MarkTechPost, 2026). Human participants scored 62% on the verified set, meaning frontier systems now substantially exceed average human performance on this specific form of abstract reasoning.

However, Chollet’s framework has significant blind spots that our meta-meta-analysis reveals. ARC tests a single dimension — grid-based abstract pattern recognition — and treats intelligence as a property of isolated cognitive episodes. It does not measure: (a) embodied adaptation, where an agent must modify its behavior based on physical feedback; (b) social cognition, where inference depends on modeling other agents’ beliefs and intentions; (c) temporal planning over extended horizons; or (d) the ability to transfer learned abstractions across radically different domains. These are not minor omissions — they represent the majority of what biological intelligence actually does.

The Universal Intelligence Gap — From Theory to Practice

Legg and Hutter (2007) provided the most rigorous formal definition of universal intelligence in “Universal Intelligence: A Definition of Machine Intelligence” (Minds and Machines 17(4):391–444). Their definition, rooted in algorithmic information theory, specifies intelligence as the expected performance of an agent across all computable environments, weighted by Kolmogorov complexity. This is elegant, general, and — crucially — no benchmark has ever implemented it.

The gap between Legg and Hutter’s definition and practical evaluation is not merely technical. Three structural barriers prevent implementation: (1) the set of all computable environments is infinite and cannot be enumerated; (2) Kolmogorov complexity is uncomputable, requiring approximation that introduces arbitrary bias; and (3) the weighting scheme implicitly favors simple environments, potentially underweighting the complex social and physical environments where biological intelligence excels. Yet the definition’s value lies not in direct implementation but in providing a normative standard against which any finite benchmark can be assessed for coverage and bias. Our coverage matrix (Table 1) can be read as a quantification of how far current meta-research deviates from the universal intelligence ideal.

OpenRouter and the Inference-Agnostic Architecture

A practical universal intelligence benchmark requires solving an architectural problem: how to evaluate hundreds of models without hosting inference for each. The Universal Intelligence Benchmark (UIB) we propose adopts an inference-agnostic design. The evaluation pipeline — task generation, response parsing, multi-dimensional scoring — runs on Stabilarity’s infrastructure. The inference itself is delegated to the model provider via OpenRouter‘s unified API, which currently provides access to over 200 models through a single interface.

This separation of evaluation from inference has three consequences. First, it eliminates the infrastructure barrier: any researcher with an OpenRouter API key (or any OpenAI-compatible endpoint) can benchmark any available model against the full UIB suite. Second, it ensures evaluation consistency — every model receives identical prompts, timing constraints, and scoring rubrics regardless of provider. Third, it future-proofs the benchmark against model turnover: new models become evaluable the moment they appear on any compatible API, without requiring changes to the evaluation pipeline.

The architectural insight is that intelligence measurement should be orthogonal to inference provision. We measure; you compute. This design philosophy mirrors how psychometric testing works for humans — the test administrator does not need to understand the neural architecture of the test-taker.

Diagram — Universal Intelligence Benchmark: Required Evaluation Axes
graph LR
    UIB[Universal Intelligence Benchmark]
    UIB --> T[Text & Reasoning
MMLU-Pro, GPQA]
    UIB --> V[Vision & Multimodal
MathVista, COCO]
    UIB --> C[Code & Engineering
SWE-bench, LiveCode]
    UIB --> E[Embodied & Physical
EmbodiedBench]
    UIB --> CR[Causal Reasoning
CausalBench]
    UIB --> D[Decision Readiness
DRI Framework]
    UIB --> EF[Efficiency Dimension
Cost-per-task]
    style UIB fill:#111,color:#fff
    style D fill:#e8f5e9,stroke:#2e7d32
    style EF fill:#e8f5e9,stroke:#2e7d32

Cross-References to Stabilarity Research

The UIB does not emerge in isolation. It builds on and extends several prior Stabilarity research series, each of which revealed specific limitations in current evaluation approaches:

Stabilarity SeriesKey Finding for UIBImplication
Cost-Effective AI (Art. 14: Model Benchmarking for Business)Benchmark scores do not predict deployment ROI; the cheapest model often wins on business metricsIntelligence measurement must include efficiency dimensions
HPF-P Decision Readiness FrameworkDecision Readiness Index (DRI) measures “readiness to act,” not raw capability — an orthogonal axisUIB must distinguish capability from deployability
Open Humanoid SeriesEmbodied tasks require fundamentally different evaluation: latency, physics grounding, sensor fusionUIB embodiment dimension cannot be simulated by text proxies
AI Economics SeriesEconomic value of AI capabilities follows power-law distributions, not linear benchmark scalingUIB scoring must capture nonlinear value functions
AI Observability SeriesRuntime monitoring reveals capabilities invisible to static benchmarksUIB will include dynamic evaluation protocols

Table 3. Cross-references from Stabilarity research series informing UIB design. Platform paper: DOI 10.5281/zenodo.18928330.

The Eight Dimensions We Propose

Synthesizing the theoretical frameworks of Schmidhuber (compression efficiency), Chollet (skill-acquisition efficiency), Legg and Hutter (universal performance across environments), and the empirical gaps revealed by our coverage matrix, we propose eight orthogonal dimensions for the UIB framework. Each dimension will be developed in depth in subsequent articles of this series:

1. Linguistic Comprehension and Generation — the only dimension current benchmarks measure well. We retain it but weight it proportionally.
2. Abstract Reasoning and Pattern Recognition — building on ARC-AGI-2 but extending to non-grid domains.
3. Causal and Counterfactual Inference — testing Pearl’s (2009) do-calculus ladder: observation, intervention, counterfactual.
4. Mathematical and Formal Reasoning — proof generation, not just answer selection; drawing on FrontierMath and AIME.
5. Embodied Adaptation — measuring responses to physical feedback loops via simulated and real-world environments (EmbodiedBench, RoboChallenge).
6. Social Cognition and Theory of Mind — modeling other agents’ beliefs, deception detection, cooperative strategy.
7. Temporal Planning and Sequential Decision-Making — multi-step task completion under uncertainty, drawing on RE-Bench and WebArena.
8. Compression Efficiency — directly measuring Schmidhuber’s core metric: how much can the model compress novel information relative to description length?

These eight dimensions are not arbitrary. They emerge from the intersection of three theoretical frameworks and one empirical finding: the coverage matrix shows exactly which dimensions the field has neglected, and the theoretical frameworks explain why those neglected dimensions matter. Articles 2 through 9 of this series will develop each dimension’s evaluation methodology, scoring rubric, and validation protocol.

Conclusion — The Map Is Not the Territory

Alfred Korzybski’s warning — “the map is not the territory” — applies with particular force to AI benchmarks. Our meta-meta-analysis demonstrates that the evaluation community has drawn an extraordinarily detailed map of one province (text processing) while leaving most of the territory (causal reasoning, embodiment, social cognition, compression efficiency) marked only with “here be dragons.” The 217 meta-analyses we surveyed collectively represent an enormous investment of intellectual effort directed at an increasingly narrow slice of what intelligence means.

The UIB framework we introduce in this series aims not to replace existing benchmarks but to contextualize them within a multi-dimensional structure that makes their coverage limitations explicit. By separating evaluation from inference through OpenRouter’s unified API, we remove the architectural barriers that have historically limited benchmarking to well-resourced labs. By grounding our dimensions in Schmidhuber’s compression theory, Chollet’s skill-acquisition framework, and Legg and Hutter’s universal intelligence formalism, we ensure theoretical coherence rather than ad hoc dimension selection.

The remaining ten articles in this series will develop the UIB from theory to open-source implementation. The next article examines the measurement crisis in detail: benchmark saturation, Goodhart’s Law, data contamination, and the historical parallels to the psychometric testing debates of the twentieth century.

References

Chollet, F. (2019). On the Measure of Intelligence. arXiv:1911.01547.
Chollet, F. et al. (2026). ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems. arXiv:2505.11831v2.
Imbue (2026). Beating ARC-AGI-2 with Code Evolution. imbue.com.
Kiela, D. et al. (2026). When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation. arXiv:2602.16763.
Legg, S. & Hutter, M. (2007). Universal Intelligence: A Definition of Machine Intelligence. Minds and Machines 17(4):391–444.
Rein, D. et al. (2026). A Benchmark of Expert-Level Academic Questions to Assess AI Capabilities. Nature. doi:10.1038/s41586-025-09962-4.
Schmidhuber, J. (2009). Ultimate Cognition à la Gödel. Cognitive Computation 1(2):177–193.
Schmidhuber, J. (2024). Annotated History of Modern AI and Deep Learning. arXiv:2212.11279v7.
Spirit AI (2026). Spirit v1.5 Tops RoboChallenge Benchmark. PRNewswire.
Yang, S. et al. (2026). EmbodiedBench: Comprehensive Benchmarking MLLMs for Vision-Driven Embodied Agents. embodiedbench.github.io.
Ivchenko, O. (2026). Stabilarity Research Platform. DOI: 10.5281/zenodo.19001033.

← Previous
Start of series
Next →
The Measurement Crisis: Saturation, Goodhart's Law, and the End of AI Leaderboards
All Universal Intelligence Benchmark articles (2)1 / 2
Version History · 3 revisions
+
RevDateStatusActionBySize
v1Mar 13, 2026DRAFTInitial draft
First version created
(w) Author17,905 (+17905)
v2Mar 13, 2026PUBLISHEDPublished
Article published to research hub
(w) Author17,905 (~0)
v3Mar 13, 2026CURRENTMajor revision
Significant content expansion (+1,654 chars)
(w) Author19,559 (+1654)

Versioning is automatic. Each revision reflects editorial updates, reference validation, or formatting changes.

Recent Posts

  • The Computer & Math 33%: Why the Most AI-Capable Occupation Group Still Automates Only a Third of Its Tasks
  • Frontier AI Consolidation Economics: Why the Big Get Bigger
  • Silicon War Economics: The Cost Structure of Chip Nationalism
  • Enterprise AI Agents as the New Insider Threat: A Cost-Effectiveness Analysis of Autonomous Risk
  • Policy Implications and a Decision Framework for Shadow Economy Reduction in Ukraine

Recent Comments

  1. Oleh on Google Antigravity: Redefining AI-Assisted Software Development

Archives

  • March 2026
  • February 2026

Categories

  • ai
  • AI Economics
  • AI Observability & Monitoring
  • AI Portfolio Optimisation
  • Ancient IT History
  • Anticipatory Intelligence
  • Capability-Adoption Gap
  • Cost-Effective Enterprise AI
  • Future of AI
  • Geopolitical Risk Intelligence
  • hackathon
  • healthcare
  • HPF-P Framework
  • innovation
  • Intellectual Data Analysis
  • medai
  • Medical ML Diagnosis
  • Open Humanoid
  • Research
  • Shadow Economy Dynamics
  • Spec-Driven AI Development
  • Technology
  • Uncategorized
  • Universal Intelligence Benchmark
  • War Prediction

About

Stabilarity Research Hub is dedicated to advancing the frontiers of AI, from Medical ML to Anticipatory Intelligence. Our mission is to build robust and efficient AI systems for a safer future.

Language

  • Medical ML Diagnosis
  • AI Economics
  • Cost-Effective AI
  • Anticipatory Intelligence
  • Data Mining
  • 🔑 API for Researchers

Connect

Facebook Group: Join

Telegram: @Y0man

Email: contact@stabilarity.com

© 2026 Stabilarity Research Hub

© 2026 Stabilarity Hub | Powered by Superbs Personal Blog theme
Stabilarity Research Hub

Open research platform for AI, machine learning, and enterprise technology. All articles are preprints with DOI registration via Zenodo.

185+
Articles
8
Series
DOI
Archived

Research Series

  • Medical ML Diagnosis
  • Anticipatory Intelligence
  • Intellectual Data Analysis
  • AI Economics
  • Cost-Effective AI
  • Spec-Driven AI

Community

  • Join Community
  • MedAI Hack
  • Zenodo Archive
  • Contact Us

Legal

  • Terms of Service
  • About Us
  • Contact
Operated by
Stabilarity OÜ
Registry: 17150040
Estonian Business Register →
© 2026 Stabilarity OÜ. Content licensed under CC BY 4.0
Terms About Contact
Language: 🇬🇧 EN 🇺🇦 UK 🇩🇪 DE 🇵🇱 PL 🇫🇷 FR
Display Settings
Theme
Light
Dark
Auto
Width
Default
Column
Wide
Text 100%

We use cookies to enhance your experience and analyze site traffic. By clicking "Accept All", you consent to our use of cookies. Read our Terms of Service for more information.