Universal Intelligence Benchmark

The Measurement Crisis: Saturation, Goodhart’s Law, and the End of AI Leaderboards

Posted on March 13, 2026March 13, 2026 by

Benchmark Research

Benchmark Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19007432

The AI evaluation ecosystem is in crisis. Frontier models now exceed 90% accuracy on MMLU, 95% on HumanEval, and 93% on HellaSwag — scores that were considered unattainable three years ago. This saturation is not evidence of intelligence; it is evidence that our instruments have failed. We argue that three convergent forces have rendered current AI leaderboards meaningless: (1) benchmark satura...

Show moreHide

Benchmark Research by Oleh Ivchenko DOI: 10.5281/zenodo.19007432

Universal Intelligence Benchmark Read More

The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured

Posted on March 13, 2026March 13, 2026 by

Benchmark Research

Benchmark Research by Oleh Ivchenko · DOI: 10.5281/zenodo.19001033

We present a meta-meta-analysis of 217 benchmark evaluation studies published between 2020 and 2026, examining not the benchmarks themselves but the systematic reviews that assess them. Our coverage matrix reveals a profound structural bias: 78.3% of surveyed studies evaluate text-based capabilities, while causal reasoning (4.1%), embodied intelligence (1.8%), and social cognition (0.9%) remain...

Show moreHide

Benchmark Research by Oleh Ivchenko DOI: 10.5281/zenodo.19001033

Universal Intelligence Benchmark Read More