Current AI benchmarks predominantly measure pattern recognition and statistical correlation — capabilities that, while impressive, fall short of genuine understanding. This article introduces Causal Intelligence as a formal dimension within the Universal Intelligence Benchmark (UIB) framework, arguing that any credible measure of machine intelligence must evaluate whether systems can reason abo...
Category: Universal Intelligence Benchmark
Inference-agnostic intelligence measurement framework. Meta-research and novel benchmarks for AI beyond text generation.
Inference-Agnostic Intelligence: The UIB Theoretical Framework
Current AI benchmarks measure narrow task performance — accuracy on question sets, code generation pass rates, or image recognition scores. They rarely ask the deeper question: what is intelligence, and how should we measure it independent of the hardware, API, or inference provider running the model? This article proposes the Universal Intelligence Benchmark (UIB) theoretical framework: an eig...
The Measurement Crisis: Saturation, Goodhart’s Law, and the End of AI Leaderboards
The AI evaluation ecosystem is in crisis. Frontier models now exceed 90% accuracy on MMLU, 95% on HumanEval, and 93% on HellaSwag — scores that were considered unattainable three years ago. This saturation is not evidence of intelligence; it is evidence that our instruments have failed. We argue that three convergent forces have rendered current AI leaderboards meaningless: (1) benchmark satura...
The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured
We present a meta-meta-analysis of 217 benchmark evaluation studies published between 2020 and 2026, examining not the benchmarks themselves but the systematic reviews that assess them. Our coverage matrix reveals a profound structural bias: 78.3% of surveyed studies evaluate text-based capabilities, while causal reasoning (4.1%), embodied intelligence (1.8%), and social cognition (0.9%) remain...