Universal Intelligence Benchmark

API Access for Researchers — All data and models from this series are available via the API Gateway. Get your API key →

Abstract mathematical geometry — intelligence measurement

Benchmark Research · Stabilarity Research Hub

Inference-Agnostic Intelligence Measurement for the Post-Text Era

Oleh Ivchenko¹

¹ Odesa National Polytechnic University (ONPU)

Type: Meta-Research
Status: Ongoing · 0/11 articles · 2026–ongoing
Links: GitHub

11 Articles Planned · 3 Research Phases · 2026–ongoing · In Progress

Abstract

Current AI benchmarks measure a narrow slice of intelligence — predominantly text comprehension and generation. As AI systems evolve into embodied agents, multimodal reasoners, and autonomous planners, the measurement instruments have not kept pace. This series conducts a systematic meta-meta-analysis of 200+ benchmark studies, exposes the dimensional blind spots in current evaluation frameworks, and proposes the Universal Intelligence Benchmark (UIB): an inference-agnostic, eight-dimensional measurement framework covering causal reasoning, embodied task completion, temporal planning, social cognition, tool creation, cross-domain transfer, multimodal synthesis, and resource-normalized efficiency. The goal is not another leaderboard — it is a fundamental rethinking of what “intelligence” means when the system under test is no longer just a language model.

Interactive Tool

Try the UIB Benchmark Tool

Run benchmark evaluations, explore the eight intelligence dimensions, and compare model scores on the live leaderboard.

Open UIB Benchmark API Documentation

Idea and Motivation

Every frontier AI model now scores above 90% on MMLU, HumanEval, and HellaSwag. The benchmarks are saturated. Meanwhile, these same models fail at causal reasoning, long-horizon planning, and embodied tasks. The measurement instruments have become the bottleneck — not the systems being measured.

This series begins from a simple observation: when every leading system aces the test, the test is no longer measuring what matters. Goodhart’s Law has taken hold — models are optimised for benchmark performance rather than genuine cognitive capability. We need benchmarks that are agnostic to inference modality and test genuine cognitive capabilities across dimensions that current frameworks ignore entirely.

Goal

Develop and validate a universal, inference-agnostic intelligence measurement framework (UIB) through systematic meta-research, dimensional analysis, and open-source implementation. The framework must be applicable to any AI system — text-based, multimodal, embodied, or hybrid — without privileging any particular inference modality or architectural paradigm.

The end product is not a single paper but a complete research programme: theoretical foundations, per-dimension measurement instruments, a composite scoring methodology, and an open-source benchmark suite that the research community can adopt, critique, and extend.

Scope

The series covers 11 articles across three research phases:

Table 1. Research phases and thematic coverage
Phase	Focus Area	Key Topics
1 — Foundation	Measurement Crisis	Meta-meta-analysis of 200+ benchmark studies, benchmark saturation diagnosis, Goodhart’s Law in AI evaluation, construct validity analysis, theoretical UIB framework proposal
2 — Dimension Deep-Dives	Eight UIB Dimensions	Causal reasoning vs pattern matching, embodied task completion, temporal planning and long-horizon goals, social cognition, tool creation, cross-domain transfer, multimodal synthesis, resource-normalized efficiency
3 — Synthesis	Integration and Implementation	Composite scoring methodology, dimensional weighting, open-source benchmark suite, empirical validation protocol, 10-year measurement obsolescence projections

The Eight UIB Dimensions

The UIB framework measures intelligence across eight orthogonal dimensions. The radar chart below visualises placeholder scores across all dimensions, representing the measurement space the benchmark covers.

Focus

The primary analytical focus is on the gap between what current benchmarks measure and what constitutes genuine intelligence. Six areas receive sustained attention throughout the series:

Benchmark saturation and Goodhart’s Law — documenting how optimisation pressure has rendered major benchmarks uninformative.
Construct validity of current AI evaluations — examining whether benchmarks actually measure the constructs they claim to measure.
Causal reasoning vs pattern matching — distinguishing genuine causal understanding from statistical correlation exploitation.
Embodied and multimodal intelligence — measuring capabilities that require physical or cross-modal reasoning.
Resource-normalized efficiency scoring — evaluating intelligence per unit of compute, data, and energy.
Open-source benchmark implementation — delivering usable evaluation tools, not just theoretical frameworks.

Limitations

Black-box evaluation onlyNo proprietary model internals are accessed. All evaluation is conducted through inference-time observation, limiting analysis of internal representations.

Theoretical until Phase 3The UIB framework remains theoretical until empirical validation in the synthesis phase. Early articles propose; later articles test.

Incomplete human baselinesHuman baselines may be incomplete for novel dimensions such as tool creation and cross-domain transfer, where no established psychometric instruments exist.

Ground truth gapsSome dimensions — particularly social cognition and tool creation — lack established ground truth, making evaluation design inherently more speculative.

Scientific Value

The series makes five contributions to the field. First, it provides the first systematic meta-meta-analysis of AI benchmark research — examining not individual benchmarks but the research practices and assumptions underlying benchmark design itself. Second, it proposes the novel eight-dimensional UIB framework as an alternative to single-score leaderboard evaluation. Third, it delivers an open-source benchmark suite designed for community adoption, replication, and extension.

Fourth, it introduces a resource-efficiency normalization methodology that evaluates intelligence relative to computational cost — addressing the growing concern that raw capability scores mask enormous differences in inference expense. Fifth, it produces 10-year measurement obsolescence projections, offering the research community a structured forecast of when current evaluation instruments will lose discriminative power.

Cross-Series Integration

This series draws on and feeds back into the entire Stabilarity research ecosystem:

Table 2. Cross-references to Stabilarity research series
Series	Connection	API Endpoint
AI Economics	ROI vs benchmark score correlation	`/v1/tools/roi`
Cost-Effective AI	Model efficiency scoring	`/v1/tools/risk`
HPF-P Framework	Decision Readiness as intelligence proxy	`/v1/hpf/analyze`
AI Observability	Runtime benchmark monitoring	`/v1/uib/status`
Capability-Adoption Gap	Gap between scores and deployment	`/v1/tools/classify`
Open Humanoid	Embodied dimension validation	—
Future of AI	Benchmark obsolescence prediction	—
Geopolitical Risk	AI capability distribution by nation	`/v1/geo-risk/data/countries`
ScanLab	Domain-specific medical intelligence	`/v1/scanlab/predict`

Key References

Schmidhuber, J. (2024). “Annotated History of Modern AI and Deep Learning.” arXiv:2212.11279v7.
Schmidhuber, J. (2009). “Ultimate Cognition à la Gödel.” Cognitive Computation 1(2):177–193.
Legg, S. & Hutter, M. (2007). “Universal Intelligence: A Definition of Machine Intelligence.” Minds and Machines 17(4):391–444.
Chollet, F. (2019). “On the Measure of Intelligence.” arXiv:1911.01547.
Ivchenko, O. (2026). “Model Benchmarking for Business.” Stabilarity Research Hub.

Resources

GitHub Repository→
Stabilarity Research Hub→
API — status, run, leaderboard, dimensions→
Interactive UIB Benchmark Tool→
Jupyter Notebooks — coming soon

Status

In progress. 0 of 11 articles published. Series launched March 2026. Phase 1 (Foundation) is in active development. Articles will be published sequentially and listed below as they become available.

Contribution Opportunities

Researchers wishing to engage with or build on this work are encouraged to consider the following directions:

Benchmark archaeology: Contribute to the meta-meta-analysis by identifying benchmark studies not covered in the initial 200+ survey, particularly from non-English-language research communities.
Dimension proposals: Suggest additional intelligence dimensions not covered by the eight-dimensional UIB framework, with supporting psychometric or cognitive science literature.
Empirical validation: Run UIB evaluation protocols against frontier models once the Phase 3 benchmark suite is released, contributing results to the open dataset.
Efficiency measurement: Develop or refine resource-normalization metrics that account for hardware heterogeneity, energy costs, and inference latency across deployment contexts.
Human baselines: Design and conduct psychometric studies establishing human performance baselines on novel UIB dimensions, particularly tool creation and cross-domain transfer.

Published Articles

Meta-Research · 2 published

By Oleh Ivchenko

Benchmark research based on publicly available meta-analyses and reproducible evaluation methods.

All Articles

The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured DOI 10/10

Meta-Research · Mar 13, 2026 · 12 min read

The Measurement Crisis: Saturation, Goodhart's Law, and the End of AI Leaderboards DOI 8/10

Meta-Research · Mar 13, 2026 · 15 min read

2 published32 total views27 min total readingMar 2026 – Mar 2026 published