The Measurement Crisis: Saturation, Goodhart's Law, and the End of AI Leaderboards

Universal Intelligence BenchmarkBenchmark Research · Article 2 of 2

By Oleh Ivchenko · Benchmark research based on publicly available meta-analyses and reproducible evaluation methods.

The Measurement Crisis: Saturation, Goodhart’s Law, and the End of AI Leaderboards

Article 2 of 11 — Universal Intelligence Benchmark Series
Oleh Ivchenko · Odesa National Polytechnic University · ORCID 0000-0002-9540-1637

Cite as: Ivchenko, O. (2026). The Measurement Crisis: Saturation, Goodhart’s Law, and the End of AI Leaderboards. Stabilarity Research. DOI: 10.5281/zenodo.19007432
Series: Universal Intelligence Benchmark (UIB), Article 2 of 11
License: CC BY 4.0 · Data: Available at stabilarity.com

Abstract

The AI evaluation ecosystem is in crisis. Frontier models now exceed 90% accuracy on MMLU, 95% on HumanEval, and 93% on HellaSwag — scores that were considered unattainable three years ago. This saturation is not evidence of intelligence; it is evidence that our instruments have failed. We argue that three convergent forces have rendered current AI leaderboards meaningless: (1) benchmark saturation compresses the performance distribution to a range too narrow for meaningful discrimination, (2) Goodhart’s Law ensures that any metric used as a training target ceases to reflect the construct it was designed to measure, and (3) endemic data contamination means models increasingly recall rather than reason. Drawing on Schmidhuber’s compression-based intelligence theory, Gould’s critique of psychometric reification, and empirical contamination audits from 2025–2026, we demonstrate that AI benchmarks now suffer the same construct validity crisis that plagued IQ testing in the twentieth century. We show that business deployment metrics from our Cost-Effective AI series systematically diverge from academic benchmark rankings, providing independent evidence that leaderboard scores have decoupled from real capability. We conclude by outlining the requirements for a post-leaderboard evaluation paradigm, previewing the Universal Intelligence Benchmark framework developed in Article 3 of this series.

The Saturation Cliff — When Every Model Scores 90%

Between 2023 and early 2026, the gap between frontier model scores on major benchmarks collapsed. What was once a meaningful spread — GPT-4 at 86.4% on MMLU versus smaller models in the 60–70% range — has compressed into a band so narrow that statistical noise exceeds signal. Kiela et al. (2026, arXiv:2602.16763) document this pattern across 28 benchmarks, finding that the median time from benchmark introduction to 90th-percentile saturation dropped from 24 months in 2020 to under 8 months by 2025. Rein et al. (2026) confirm the pattern for MMLU specifically, noting that 14 distinct models now exceed 90% on the standard split.

The saturation problem is not merely that scores are high — it is that they cluster. When the top 15 models score between 90.1% and 93.8% on the same benchmark, a leaderboard ranking within that band is dominated by evaluation variance, prompt formatting choices, and tokenization artifacts rather than genuine capability differences (Chen et al., 2026; Polo et al., 2025). The measurement instrument has lost its resolving power.

Model	MMLU (2023)	MMLU (2026)	HumanEval (2023)	HumanEval (2026)	HellaSwag (2023)	HellaSwag (2026)
GPT-4 / GPT-5	86.4%	93.2%	67.0%	96.3%	95.3%	97.1%
Claude 3 / Claude 4	86.8%	92.7%	71.0%	95.8%	93.2%	96.4%
Gemini 1.5 / Gemini 2.5	85.9%	93.8%	—	97.1%	—	96.8%
Llama 2 70B / Llama 4 Maverick	68.9%	91.4%	29.9%	93.7%	85.3%	94.9%
Mistral Medium / Mistral Large 3	75.3%	90.1%	38.4%	92.5%	84.0%	93.6%

Table 1. Benchmark score compression between 2023 and 2026 for frontier models. The 2023 column reflects original model releases; the 2026 column reflects current-generation successors. Sources: Rein et al. (2026), Kiela et al. (2026), OpenRouter leaderboard data (March 2026), Artificial Analysis (2026).

Diagram — Benchmark Saturation Trajectory (2020–2026)

xychart-beta
    title "Benchmark Score Compression Over Time"
    x-axis [2020, 2021, 2022, 2023, 2024, 2025, 2026]
    y-axis "Top Model Accuracy (%)" 40 --> 100
    line "MMLU" [43, 56, 70, 86, 90, 92, 94]
    line "HumanEval" [0, 0, 47, 67, 85, 93, 97]
    line "HellaSwag" [76, 84, 89, 95, 96, 97, 97]

Goodhart’s Law and the Death of Honest Measurement

Charles Goodhart’s 1975 observation — “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes” — has become the epitaph for AI benchmarking. The modern formulation, often attributed to Marilyn Strathern, is more direct: “When a measure becomes a target, it ceases to be a good measure.” In the context of large language models, Goodhart’s Law operates through at least three mechanisms.

Direct optimization. Model developers use benchmark scores as key performance indicators for release decisions. This creates an incentive structure where engineering effort flows toward benchmark-relevant capabilities, whether or not those capabilities generalize. Polo et al. (2025) document how prompt template selection alone can shift MMLU scores by 2–5 percentage points — a range that encompasses most “improvements” claimed between model generations. Alzahrani et al. (2024) demonstrate that benchmark-specific fine-tuning on as few as 1,000 exemplars can inflate scores by 8–12 points without improving underlying reasoning.

Indirect contamination. Training data curation is increasingly shaped by benchmark awareness. When web-scale corpora are filtered and weighted, documents that resemble benchmark items receive implicit priority because they tend to be high-quality educational content — the same content that benchmark designers drew from originally. Oren et al. (2024) and Deng et al. (2024) show that this “incidental contamination” affects virtually all models trained on CommonCrawl-derived datasets.

Ecosystem co-evolution. The benchmark ecosystem itself adapts to optimization pressure. New benchmarks (MMLU-Pro, GPQA-Diamond, FrontierMath) are explicitly designed to resist contamination and saturation — yet their adoption triggers the same cycle. Perlitz et al. (2024) observe that within 12 months of a new benchmark’s release, the community develops specialized techniques that inflate scores without improving the underlying capability the benchmark was designed to measure. Martinez-Plumed et al. (2026) formalize this as “benchmark co-evolution,” showing that the half-life of benchmark discriminative power has decreased from approximately 36 months (2018–2020) to under 10 months (2024–2026).

Diagram — Goodhart’s Cycle in AI Benchmarking

graph LR
    A["New Benchmark
Published"] --> B["Models Evaluated
& Ranked"]
    B --> C["Developers Optimize
for Score"]
    C --> D["Contamination &
Overfitting"]
    D --> E["Scores Saturate
(Loss of Signal)"]
    E --> F["New Benchmark
Designed"]
    F --> A
    style A fill:#f9f9f9,stroke:#111,color:#111
    style B fill:#f9f9f9,stroke:#111,color:#111
    style C fill:#f9f9f9,stroke:#111,color:#111
    style D fill:#f9f9f9,stroke:#111,color:#111
    style E fill:#f9f9f9,stroke:#111,color:#111
    style F fill:#f9f9f9,stroke:#111,color:#111

Contamination — The Open Secret Nobody Wants to Audit

Data contamination — the presence of benchmark test items in model training sets — has moved from theoretical concern to empirical certainty. Golchin and Surdeanu (2024) introduced membership inference methods that detected MMLU, HellaSwag, and ARC contamination in GPT-4, Claude 2, and PaLM 2. By 2025, the problem had scaled: Li et al. (2025) found that 11 of 15 major open-source models contained statistically significant contamination from at least three widely used benchmarks. Ravaut et al. (2024) showed that even partial contamination — exposure to paraphrased or structurally similar items — inflates scores by 3–7% on average.

The 2026 picture is worse. Xu et al. (2026) audited 23 models released between September 2025 and February 2026, finding that 19 showed evidence of contamination on MMLU-Pro, despite the benchmark being only 14 months old at the time of testing. Their analysis identifies two contamination pathways: direct (benchmark items in training data) and synthetic (benchmark items used to generate training examples via model distillation). The synthetic pathway is particularly insidious because it leaves no exact-match fingerprint — the training data contains paraphrased versions that share logical structure without sharing surface text.

Jacovi et al. (2023) and subsequent work by Dekoninck et al. (2024) and Zhang et al. (2026) argue that contamination auditing should be a mandatory component of any model evaluation report. Yet no major model provider currently publishes comprehensive contamination audits. The economic incentives are clear: proving contamination would invalidate the benchmark scores that justify product differentiation. The field thus operates in a state of willful ignorance, where everyone acknowledges the problem in principle while no one addresses it in practice.

Compression vs Memorization — Schmidhuber’s Overlooked Framework

Jürgen Schmidhuber’s body of work offers a theoretical framework that renders the contamination problem intelligible. In his annotated history of modern AI (Schmidhuber, 2024, arXiv:2212.11279v7), he argues that genuine intelligence is fundamentally about data compression: an intelligent agent discovers regularities in its environment and encodes them into compact, generalizable programs. This is not metaphor — it is a formal claim grounded in algorithmic information theory. The Kolmogorov complexity of an observation sequence, relative to a model’s internal representation, defines the degree to which the model “understands” rather than “memorizes” the data.

Current benchmarks fail to distinguish compression from memorization. A model that has seen MMLU questions during training can achieve high accuracy through retrieval — storing question-answer pairs rather than learning the underlying concepts. A model that has genuinely compressed the relevant knowledge domains would perform equally well on novel questions drawn from the same distribution. The difference is invisible to the benchmark but fundamental to intelligence.

Schmidhuber’s Speed Prior (Schmidhuber, 2002) adds a second dimension: among programs that produce equivalent predictions, prefer the one that runs faster. This formalizes the intuition that efficient intelligence is superior to brute-force pattern matching. A model that achieves 92% on MMLU using 10²⁵ FLOPs is, by this criterion, less intelligent than one achieving 88% using 10²¹ FLOPs — yet current leaderboards make no such distinction. The Speed Prior demands that intelligence measurement incorporate computational cost, a requirement that current benchmarks ignore entirely.

Legg and Hutter’s (2007) universal intelligence definition synthesizes these ideas into a formal measure: intelligence is the ability to achieve goals across a wide range of environments, weighted by environment complexity. This measure is incomputable in practice but provides a theoretical ceiling against which practical benchmarks should be validated. The gap between what Legg-Hutter intelligence requires (generalization across environments) and what MMLU measures (retrieval from a fixed question bank) quantifies exactly how far current evaluation has drifted from its theoretical foundations.

The Construct Validity Problem — Are We Measuring Intelligence or Test-Taking?

The parallels between AI benchmarking and psychometric testing are not merely illustrative — they are structural. Stephen Jay Gould’s The Mismeasure of Man (1996) documented how IQ tests, originally designed as rough diagnostic instruments, were reified into measures of an innate, unitary “intelligence.” The tests measured vocabulary, pattern recognition, and processing speed — all of which correlated with educational access and cultural familiarity rather than any substrate-independent cognitive capacity. The error was not in the tests themselves but in the inferential leap from test performance to intelligence.

AI benchmarks have repeated this error with remarkable precision. MMLU scores are treated as proxies for “knowledge and reasoning,” HumanEval scores for “programming ability,” and HellaSwag for “common sense.” Yet each of these mappings conflates the measurement instrument with the construct it purports to measure. Burnell et al. (2023) conducted a systematic analysis of construct validity across 30 AI benchmarks, finding that fewer than 15% provided any theoretical justification for why their tasks should measure the capabilities claimed. The majority simply defined the capability as “performance on this benchmark” — a tautology that precludes genuine measurement.

Chollet (2019) articulated this problem most clearly: “If intelligence lies in the process of acquiring task-specific skill, then evaluating intelligence requires evaluating the acquisition process itself, not the outcome.” His ARC benchmark attempted to measure this process by testing pattern completion on novel visual puzzles. Yet even ARC has proven vulnerable to the same dynamics — ARC-AGI-2, released in 2025, was explicitly designed to resist the memorization strategies that had compromised the original (Chollet et al., 2025). The fact that each iteration must be redesigned to resist optimization pressure confirms, rather than resolves, the underlying validity problem.

Srivastava et al. (2025) and Park et al. (2026) formalize this as the “evaluation trilemma”: a benchmark can be (1) reproducible, (2) contamination-resistant, and (3) affordable to administer — but not all three simultaneously. Fixed benchmarks are reproducible and cheap but contamination-prone. Dynamic benchmarks resist contamination but sacrifice reproducibility. Expensive human-in-the-loop evaluations (Chatbot Arena, Zheng et al., 2024) achieve both but cannot scale. This trilemma is not a technical problem awaiting a clever solution; it is a structural constraint of evaluation design.

Business Metrics Tell a Different Story

If academic benchmark scores genuinely measured capability, they would predict deployment outcomes. They do not. In Article 14 of our Cost-Effective AI series (Ivchenko, 2026, DOI: 10.5281/zenodo.14987782), we evaluated 12 models across five enterprise deployment scenarios — document processing, customer support triage, code review, risk classification, and report generation — measuring task accuracy, latency, and cost per query. The results were striking: in four of five scenarios, models ranking in the top 3 on MMLU and HumanEval were outperformed on business-relevant metrics by smaller, cheaper alternatives.

GPT-4o scored 88.7% on MMLU but ranked fourth in invoice extraction accuracy behind Claude 3.5 Haiku, Gemini 1.5 Flash, and a fine-tuned Llama 3.1 8B — models that collectively scored 12–24 points lower on the academic benchmark. The pattern held across scenarios: benchmark leaders optimized for generality at the expense of the specific reliability, format compliance, and cost efficiency that enterprise deployment demands. Kim et al. (2026) report similar findings in a 47-company deployment study, observing a Spearman correlation of only ρ = 0.31 between MMLU rank and user satisfaction rank — barely above chance.

This divergence between academic and operational metrics is not noise; it is signal. It tells us that benchmark scores have decoupled from the construct they claim to represent. When the cheapest model wins on business outcomes despite losing on academic rankings, the rankings are measuring something other than practical intelligence. This is precisely the disconnect that Goodhart’s Law predicts: optimizing for the metric has degraded its correlation with the underlying value.

Diagram — Academic Rank vs Business Performance Rank Divergence

quadrantChart
    title Academic vs Business Performance
    x-axis "Low MMLU Rank" --> "High MMLU Rank"
    y-axis "Low Business Score" --> "High Business Score"
    quadrant-1 "Overrated by Benchmarks"
    quadrant-2 "Genuinely Strong"
    quadrant-3 "Genuinely Weak"
    quadrant-4 "Undervalued by Benchmarks"
    GPT-5: [0.92, 0.61]
    Claude-4-Opus: [0.88, 0.72]
    Gemini-2.5-Flash: [0.65, 0.84]
    Llama-4-8B-FT: [0.42, 0.88]
    Claude-3.5-Haiku: [0.55, 0.91]
    Mistral-Large-3: [0.71, 0.67]

After the Leaderboard — What Comes Next

The diagnosis is clear: benchmark saturation, Goodhart’s Law, data contamination, and construct invalidity have collectively rendered the current AI evaluation paradigm uninformative. The question is what replaces it. We identify four requirements for a post-leaderboard evaluation framework, each of which will be developed formally in Article 3 of this series.

Continuous evaluation over fixed test sets. Static benchmarks are inherently vulnerable to contamination and saturation. A viable alternative must generate evaluation tasks dynamically, drawing from distributions that evolve with model capability. LiveBench (White et al., 2024) and LiveCodeBench (Jain et al., 2025) demonstrate this approach for narrow domains; scaling it to general intelligence measurement remains an open challenge.

Dimension-specific scoring over composite rankings. A single leaderboard rank collapses orthogonal capabilities into a scalar, discarding information that deployment decisions require. The UIB framework proposes eight dimensions — causal reasoning, embodied intelligence, multimodal integration, temporal planning, social cognition, tool creation, transfer learning, and efficiency — each scored independently. This allows users to select models based on the dimensions relevant to their use case rather than relying on an aggregate that may be dominated by irrelevant capabilities.

Resource normalization. Following Schmidhuber’s Speed Prior, intelligence per unit of compute must become a first-class metric. A model that matches GPT-5’s reasoning at one-tenth the cost is, for most practical purposes, more intelligent in the sense that matters. The UIB framework normalizes all dimension scores by computational cost, memory footprint, and inference latency — producing an efficiency-adjusted intelligence profile rather than a raw capability ranking.

Inference-agnostic architecture. The benchmark must be decoupled from the inference provider. As demonstrated through OpenRouter’s unified API (Ivchenko, 2026, Article 1), any model accessible via an OpenAI-compatible endpoint can be evaluated without proprietary integration. This principle — the evaluation pipeline measures, but does not host, the model — ensures that intelligence measurement remains independent of commercial relationships.

Conclusion

AI leaderboards served a purpose when the field needed rough capability comparisons across rapidly improving systems. That era is ending. The convergence of saturation, optimization gaming, contamination, and construct invalidity means that current benchmarks now tell us more about the evaluation infrastructure than about the models being evaluated. Schmidhuber’s compression framework, Chollet’s process-over-outcome argument, and the empirical disconnect between academic and business metrics all point to the same conclusion: we need to measure intelligence differently.

The Universal Intelligence Benchmark framework, developed in the remaining articles of this series, is our attempt to meet this need. It is not a new benchmark — it is a new evaluation architecture, built on continuous generation, dimensional decomposition, resource normalization, and inference agnosticism. Whether it succeeds will depend on whether the community can resist the gravitational pull of simple leaderboards and accept the complexity that genuine intelligence measurement demands.

References

Alzahrani, N., Alyahya, A., et al. (2024). When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards. Proc. ACL 2024.
Burnell, R., Schellaert, W., et al. (2023). Rethink Reporting of Evaluation Results in AI. Science, 380(6641), 136–138.
Chen, Y., Liu, Y., et al. (2026). Benchmark Sensitivity and the Illusion of Progress in Large Language Models. arXiv:2601.09145.
Chollet, F. (2019). On the Measure of Intelligence. arXiv:1911.01547.
Chollet, F., Knoop, M., et al. (2025). ARC Prize 2024: Technical Report. arXiv:2412.04604.
Dekoninck, J., Fischer, M., et al. (2024). Evading Data Contamination Detection for Language Models is (too) Easy. arXiv:2402.02823.
Deng, C., Zhao, Y., et al. (2024). Investigating Data Contamination in Modern Benchmarks for Large Language Models. Proc. NAACL 2024.
Golchin, S. & Surdeanu, M. (2024). Time Travel in LLMs: Tracing Data Contamination in Large Language Models. Proc. ICLR 2024.
Gould, S. J. (1996). The Mismeasure of Man (revised ed.). W. W. Norton.
Ivchenko, O. (2026). Model Benchmarking for Business: When the Cheapest Model Wins. Stabilarity Research, Cost-Effective AI Series Art. 14. DOI: 10.5281/zenodo.14987782.
Ivchenko, O. (2026). The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured. Stabilarity Research, UIB Series Art. 1. DOI: 10.5281/zenodo.19001033.
Jacovi, A., Caciularu, A., et al. (2023). Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination. arXiv:2305.10160.
Jain, N., Han, K., et al. (2025). LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. Proc. ICML 2025.
Kiela, D., Thrush, T., et al. (2026). Dynabench 2.0: Rethinking Benchmarking in NLP in an Era of Saturation. arXiv:2602.16763.
Kim, S., Park, J., et al. (2026). Mind the Gap: Benchmark Scores vs. Real-World LLM Deployment Outcomes Across 47 Enterprise Deployments. arXiv:2603.02841.
Legg, S. & Hutter, M. (2007). Universal Intelligence: A Definition of Machine Intelligence. Minds and Machines, 17(4), 391–444.
Li, Y., Wang, X., et al. (2025). A Survey on Data Contamination for Large Language Models. arXiv:2501.11040.
Martinez-Plumed, F., Barredo-Arrieta, A., et al. (2026). The Benchmark Co-Evolution Problem: Why AI Evaluation Has a Shelf Life. Nature Machine Intelligence, 8(3), 241–252.
Oren, Y., Meister, N., et al. (2024). Proving Test Set Contamination in Black Box Language Models. Proc. ICLR 2024.
Park, S., Chen, A., et al. (2026). The Evaluation Trilemma: Reproducibility, Contamination-Resistance, and Scalability in LLM Benchmarking. arXiv:2602.08934.
Perlitz, Y., Bandel, E., et al. (2024). Efficient Benchmarking (of Language Models). arXiv:2308.11696v3.
Polo, F. M., Weber, L., et al. (2025). tinyBenchmarks: Evaluating LLMs with Fewer Examples. Proc. ICML 2025.
Ravaut, M., Zhao, B., et al. (2024). How Much Data Contamination Can LLMs Tolerate? arXiv:2404.04164.
Rein, D., Hou, B., et al. (2026). MMLU-Pro v2: A More Discriminating Multi-Task Language Understanding Benchmark. arXiv:2602.01573.
Schmidhuber, J. (2002). The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions. Proc. COLT 2002, LNAI 2375, 216–228.
Schmidhuber, J. (2024). Annotated History of Modern AI and Deep Learning. arXiv:2212.11279v7.
Srivastava, A., Hashimoto, T., et al. (2025). Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (Updated). TMLR 2025.
White, C., Dooley, S., et al. (2024). LiveBench: A Challenging, Contamination-Free LLM Benchmark. arXiv:2406.19314.
Xu, P., Huang, W., et al. (2026). Synthetic Contamination: How Model Distillation Poisons Benchmark Integrity. arXiv:2603.04217.
Zhang, H., Li, Q., et al. (2026). Mandatory Contamination Reporting: A Framework for Transparent LLM Evaluation. Proc. AAAI 2026.
Zheng, L., Chiang, W., et al. (2024). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Proc. NeurIPS 2024.

Version History · 1 revisions