Inference-Agnostic Intelligence Measurement for the Post-Text Era
1 Odesa National Polytechnic University (ONPU)
- Type
- Meta-Research
- Status
- Ongoing · 0/11 articles · 2026–ongoing
- Links
- GitHub
Current AI benchmarks measure a narrow slice of intelligence — predominantly text comprehension and generation. As AI systems evolve into embodied agents, multimodal reasoners, and autonomous planners, the measurement instruments have not kept pace. This series conducts a systematic meta-meta-analysis of 200+ benchmark studies, exposes the dimensional blind spots in current evaluation frameworks, and proposes the Universal Intelligence Benchmark (UIB): an inference-agnostic, eight-dimensional measurement framework covering causal reasoning, embodied task completion, temporal planning, social cognition, tool creation, cross-domain transfer, multimodal synthesis, and resource-normalized efficiency. The goal is not another leaderboard — it is a fundamental rethinking of what “intelligence” means when the system under test is no longer just a language model.
Run benchmark evaluations, explore the eight intelligence dimensions, and compare model scores on the live leaderboard.
Open UIB Benchmark API DocumentationIdea and Motivation
Every frontier AI model now scores above 90% on MMLU, HumanEval, and HellaSwag. The benchmarks are saturated. Meanwhile, these same models fail at causal reasoning, long-horizon planning, and embodied tasks. The measurement instruments have become the bottleneck — not the systems being measured.
This series begins from a simple observation: when every leading system aces the test, the test is no longer measuring what matters. Goodhart’s Law has taken hold — models are optimised for benchmark performance rather than genuine cognitive capability. We need benchmarks that are agnostic to inference modality and test genuine cognitive capabilities across dimensions that current frameworks ignore entirely.
Goal
Develop and validate a universal, inference-agnostic intelligence measurement framework (UIB) through systematic meta-research, dimensional analysis, and open-source implementation. The framework must be applicable to any AI system — text-based, multimodal, embodied, or hybrid — without privileging any particular inference modality or architectural paradigm.
The end product is not a single paper but a complete research programme: theoretical foundations, per-dimension measurement instruments, a composite scoring methodology, and an open-source benchmark suite that the research community can adopt, critique, and extend.
Scope
The series covers 11 articles across three research phases:
| Phase | Focus Area | Key Topics |
|---|---|---|
| 1 — Foundation | Measurement Crisis | Meta-meta-analysis of 200+ benchmark studies, benchmark saturation diagnosis, Goodhart’s Law in AI evaluation, construct validity analysis, theoretical UIB framework proposal |
| 2 — Dimension Deep-Dives | Eight UIB Dimensions | Causal reasoning vs pattern matching, embodied task completion, temporal planning and long-horizon goals, social cognition, tool creation, cross-domain transfer, multimodal synthesis, resource-normalized efficiency |
| 3 — Synthesis | Integration and Implementation | Composite scoring methodology, dimensional weighting, open-source benchmark suite, empirical validation protocol, 10-year measurement obsolescence projections |
The Eight UIB Dimensions
The UIB framework measures intelligence across eight orthogonal dimensions. The radar chart below visualises placeholder scores across all dimensions, representing the measurement space the benchmark covers.
Focus
The primary analytical focus is on the gap between what current benchmarks measure and what constitutes genuine intelligence. Six areas receive sustained attention throughout the series:
- Benchmark saturation and Goodhart’s Law — documenting how optimisation pressure has rendered major benchmarks uninformative.
- Construct validity of current AI evaluations — examining whether benchmarks actually measure the constructs they claim to measure.
- Causal reasoning vs pattern matching — distinguishing genuine causal understanding from statistical correlation exploitation.
- Embodied and multimodal intelligence — measuring capabilities that require physical or cross-modal reasoning.
- Resource-normalized efficiency scoring — evaluating intelligence per unit of compute, data, and energy.
- Open-source benchmark implementation — delivering usable evaluation tools, not just theoretical frameworks.
Limitations
Scientific Value
The series makes five contributions to the field. First, it provides the first systematic meta-meta-analysis of AI benchmark research — examining not individual benchmarks but the research practices and assumptions underlying benchmark design itself. Second, it proposes the novel eight-dimensional UIB framework as an alternative to single-score leaderboard evaluation. Third, it delivers an open-source benchmark suite designed for community adoption, replication, and extension.
Fourth, it introduces a resource-efficiency normalization methodology that evaluates intelligence relative to computational cost — addressing the growing concern that raw capability scores mask enormous differences in inference expense. Fifth, it produces 10-year measurement obsolescence projections, offering the research community a structured forecast of when current evaluation instruments will lose discriminative power.
Cross-Series Integration
This series draws on and feeds back into the entire Stabilarity research ecosystem:
| Series | Connection | API Endpoint |
|---|---|---|
| AI Economics | ROI vs benchmark score correlation | /v1/tools/roi |
| Cost-Effective AI | Model efficiency scoring | /v1/tools/risk |
| HPF-P Framework | Decision Readiness as intelligence proxy | /v1/hpf/analyze |
| AI Observability | Runtime benchmark monitoring | /v1/uib/status |
| Capability-Adoption Gap | Gap between scores and deployment | /v1/tools/classify |
| Open Humanoid | Embodied dimension validation | — |
| Future of AI | Benchmark obsolescence prediction | — |
| Geopolitical Risk | AI capability distribution by nation | /v1/geo-risk/data/countries |
| ScanLab | Domain-specific medical intelligence | /v1/scanlab/predict |
Key References
- Schmidhuber, J. (2024). “Annotated History of Modern AI and Deep Learning.” arXiv:2212.11279v7.
- Schmidhuber, J. (2009). “Ultimate Cognition à la Gödel.” Cognitive Computation 1(2):177–193.
- Legg, S. & Hutter, M. (2007). “Universal Intelligence: A Definition of Machine Intelligence.” Minds and Machines 17(4):391–444.
- Chollet, F. (2019). “On the Measure of Intelligence.” arXiv:1911.01547.
- Ivchenko, O. (2026). “Model Benchmarking for Business.” Stabilarity Research Hub.
Resources
- GitHub Repository→
- Stabilarity Research Hub→
- API — status, run, leaderboard, dimensions→
- Interactive UIB Benchmark Tool→
- Jupyter Notebooks — coming soon
Status
In progress. 0 of 11 articles published. Series launched March 2026. Phase 1 (Foundation) is in active development. Articles will be published sequentially and listed below as they become available.
Contribution Opportunities
Researchers wishing to engage with or build on this work are encouraged to consider the following directions:
- Benchmark archaeology: Contribute to the meta-meta-analysis by identifying benchmark studies not covered in the initial 200+ survey, particularly from non-English-language research communities.
- Dimension proposals: Suggest additional intelligence dimensions not covered by the eight-dimensional UIB framework, with supporting psychometric or cognitive science literature.
- Empirical validation: Run UIB evaluation protocols against frontier models once the Phase 3 benchmark suite is released, contributing results to the open dataset.
- Efficiency measurement: Develop or refine resource-normalization metrics that account for hardware heterogeneity, energy costs, and inference latency across deployment contexts.
- Human baselines: Design and conduct psychometric studies establishing human performance baselines on novel UIB dimensions, particularly tool creation and cross-domain transfer.