Universal Intelligence BenchmarkBenchmark Research · Article 3 of 11

By Oleh Ivchenko · Benchmark research based on publicly available meta-analyses and reproducible evaluation methods.

Inference-Agnostic Intelligence: The UIB Theoretical Framework

Academic Citation: Ivchenko, Oleh (2026). Inference-Agnostic Intelligence: The UIB Theoretical Framework. Research article: Inference-Agnostic Intelligence: The UIB Theoretical Framework. Odessa National Polytechnic University, Department of Economic Cybernetics.
DOI: 10.5281/zenodo.19064304^[1] · View on Zenodo (CERN)

DOI: 10.5281/zenodo.19064304^[1]Zenodo Archive ORCID

2,086 words · 40% fresh refs · 3 diagrams · 16 references

57stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	13%	○	≥80% from editorially reviewed sources
[t]	Trusted	81%	✓	≥80% from verified, high-quality sources
[a]	DOI	50%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	6%	○	≥80% indexed in CrossRef
[i]	Indexed	63%	○	≥80% have metadata indexed
[l]	Academic	31%	○	≥80% from journals/conferences/preprints
[f]	Free Access	69%	○	≥80% are freely accessible
[r]	References	16 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,086	✓	Minimum 2,000 words for a full research article. Current: 2,086
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19064304
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	40%	✗	≥80% of references from 2025–2026. Current: 40%
[c]	Data Charts	0	○	Original data charts from reproducible analysis (min 2). Current: 0
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (60 × 60%) + Required (3/5 × 30%) + Optional (1/4 × 10%)

Abstract #

Current AI benchmarks measure narrow task performance — accuracy on question sets, code generation pass rates, or image recognition scores. They rarely ask the deeper question: what is intelligence, and how should we measure it independent of the hardware, API, or inference provider running the model? This article proposes the Universal Intelligence Benchmark (UIB) theoretical framework: an eight-dimensional, cost-normalized measurement system designed to evaluate intelligence across any model accessible via standardized API. Drawing on formal definitions from Legg and Hutter (2007)^[2], the compression-based intelligence theory of Schmidhuber (2024)^[3], and the algorithmic information-theoretic approach of Chollet (2019)^[4], UIB introduces a composite score that normalizes capability by computational cost — measuring intelligence per dollar, not intelligence per benchmark leak.

The Problem with Current Benchmarks #

The AI evaluation landscape in 2026 faces a fundamental crisis. As documented in our previous analysis of the measurement crisis (Ivchenko, 2026)^[5], benchmark saturation has rendered traditional leaderboards unreliable. MMLU scores cluster above 90%. HumanEval pass rates exceed 95%. GPQA-Diamond, once considered resistant to saturation, now sees frontier models scoring above 80% — as documented by Rein et al. (2026)^[6] in their Nature study of expert-level academic benchmarks.

The NIST AI evaluation report (Keller et al., 2026)^[7] highlights a critical gap: standard benchmarks report point estimates without proper uncertainty quantification. Their proposed Generalized Linear Mixed Models (GLMMs) approach demonstrates that benchmark scores often mask significant variance in latent capabilities. A model scoring 85% on GPQA-Diamond may perform anywhere from 78% to 92% depending on question sampling — yet leaderboards present this as a single definitive number.

Meanwhile, the MLPerf inference benchmark gap (Morgan, 2026)^[8] reveals another blind spot: even hardware benchmarks omit pricing and power consumption data, making it impossible to calculate the true cost of intelligence.

graph TD
    A[Current Benchmark Landscape] --> B[Task-Specific Accuracy]
    A --> C[Leaderboard Rankings]
    A --> D[Single-Score Summaries]
    B --> E[Saturation Problem]
    C --> F[Contamination Problem]
    D --> G[Dimensionality Collapse]
    E --> H[UIB Multi-Dimensional Framework]
    F --> H
    G --> H
    H --> I[Eight Intelligence Dimensions]
    H --> J[Cost Normalization]
    H --> K[Uncertainty Quantification]

Theoretical Foundations #

Formal Intelligence Definitions #

The UIB framework rests on three theoretical pillars.

Pillar 1: Universal Intelligence. Legg and Hutter (2007)^[2] defined universal intelligence as an agent’s ability to achieve goals across a wide range of environments, formalized as the expected reward over all computable environments weighted by Kolmogorov complexity. Their definition — Υ(π) = Σμ 2^(-K(μ)) Vμ^π — provides the theoretical ceiling for any intelligence measure. No finite benchmark can fully capture universal intelligence, but UIB aims to approximate it across eight operationally distinct dimensions.

Pillar 2: Compression as Intelligence. Schmidhuber (2024)^[3] argues that intelligence fundamentally reduces to compression efficiency: the ability to find short programs that explain observed data. His speed prior framework adds a critical resource constraint — not just any compression, but fast compression. This directly informs UIB’s cost normalization: a model that achieves the same compression with fewer computational resources is, by Schmidhuber’s framework, more intelligent. The speed prior states that the probability of a computable object is inversely proportional to the length of the shortest program computing it and the program’s runtime.

Pillar 3: Algorithmic Information Theory. Chollet (2019)^[4] operationalized intelligence measurement through the lens of skill-acquisition efficiency — how quickly a system can learn new skills from few examples, given a fixed set of priors. The ARC benchmark implements this by requiring novel pattern recognition that resists memorization. The recent ARC Prize 2025 Technical Report (Chollet et al., 2026)^[9] confirms that even frontier reasoning models remain constrained to knowledge coverage rather than genuine fluid intelligence, with ARC-AGI-2 scores jumping from below 5% to over 80% in early 2026 through code evolution approaches (Imbue, 2026)^[10] — raising questions about whether even novel benchmarks can resist optimization pressure.

graph LR
    subgraph Theoretical Foundations
        LH[Legg-Hutter
Universal Intelligence]
        JS[Schmidhuber
Compression + Speed Prior]
        FC[Chollet
Skill-Acquisition Efficiency]
    end
    LH -->Theoretical Ceiling| UIB[UIB Framework]
    JS -->Cost Normalization| UIB
    FC -->Dimension Design| UIB
    UIB --> D1[8 Dimensions]
    UIB --> CN[$/Intelligence]
    UIB --> UQ[Uncertainty Bounds]

The Decision Readiness Connection #

Our own Decision Readiness Index (DRI) framework (Ivchenko, 2026)^[11] provides a complementary lens. DRI measures not raw capability but decision readiness — the point at which a system has sufficient information to make a reliable decision. This maps directly to UIB’s temporal and planning dimension: an intelligent system should recognize when it has enough information to act, rather than either acting prematurely or gathering data indefinitely. The HPF framework (Ivchenko, 2026)^[12] formalizes this as a holistic decision-readiness measure for portfolio management, but the principle generalizes to any intelligence assessment.

The Eight UIB Dimensions #

UIB decomposes intelligence into eight operationally distinct dimensions, each corresponding to a measurable capability that existing benchmarks partially capture but none integrate.

Dimension 1: Causal Intelligence (D_causal) #

The ability to reason about cause and effect beyond statistical correlation. Bareinboim et al. (2022)^[13] formalized Pearl’s Causal Hierarchy (PCH) into three levels: association (seeing), intervention (doing), and counterfactuals (imagining). UIB-Causal tests all three levels, measuring whether a model can distinguish correlation from causation, predict intervention outcomes, and reason about counterfactual scenarios.

Dimension 2: Embodied Intelligence (D_embodied) #

Intelligence grounded in physical interaction — understanding spatial relationships, physics, and sensorimotor coordination. This dimension connects directly to our Open Humanoid research on proprioception (Ivchenko, 2026)^[14], where we demonstrated that internal state estimation is fundamental to embodied intelligence.

Dimension 3: Multimodal Integration (D_multimodal) #

The ability to synthesize information across modalities — text, image, audio, video, structured data — into coherent understanding. This goes beyond simple modality recognition to cross-modal reasoning: using visual context to disambiguate text, or combining audio cues with textual descriptions to make inferences neither modality supports alone.

Dimension 4: Temporal and Planning Intelligence (D_temporal) #

Long-horizon reasoning, planning under uncertainty, and temporal abstraction. This is where the DRI concept becomes directly operationally relevant: an intelligent system must determine when it has sufficient information to commit to a plan, and when to continue gathering evidence.

Theory of mind, pragmatic communication, negotiation, and collaborative reasoning. Measuring whether a model can model other agents’ beliefs, predict their actions, and coordinate effectively.

Dimension 6: Tool Creation (D_tool) #

Not merely tool use (calling APIs, writing code) but the ability to create new tools — abstractions, libraries, frameworks — that extend the system’s own capabilities. This dimension tests Schmidhuber’s insight that truly intelligent systems are self-improving: they compress their own problem-solving process.

Dimension 7: Transfer Intelligence (D_transfer) #

The ability to apply knowledge from one domain to solve problems in another, with minimal additional training. This is Chollet’s skill-acquisition efficiency operationalized: how many examples does the system need to learn a genuinely new task?

Dimension 8: Efficiency (D_efficiency) #

Intelligence per unit of compute, energy, or cost. This is not a separate capability but a normalizing dimension that transforms raw scores into resource-adjusted intelligence. Following Schmidhuber’s speed prior, a model achieving 80% performance at 1/10th the cost is more intelligent than one achieving 85% at full cost.

graph TB
    UIB[UIB Composite Score] --> C[D_causal
Causal Reasoning]
    UIB --> E[D_embodied
Physical Intelligence]
    UIB --> M[D_multimodal
Cross-Modal Synthesis]
    UIB --> T[D_temporal
Planning & Horizon]
    UIB --> S[D_social
Theory of Mind]
    UIB --> TC[D_tool
Tool Creation]
    UIB --> TR[D_transfer
Domain Transfer]
    UIB --> EF[D_efficiency
Intelligence per Dollar]
    C -.-> PCH[Pearl Causal Hierarchy]
    T -.-> DRI[Decision Readiness Index]
    EF -.-> SP[Schmidhuber Speed Prior]
    TR -.-> ARC[Chollet ARC Framework]

Mathematical Formulation #

The UIB composite score for a model M is defined as:

UIB(M) = Σᵢ wᵢ · Dᵢ(M) / C(M)

Where:

Dᵢ(M) is the normalized score on dimension i ∈ {1,…,8}, scaled to [0,1]
wᵢ is the information-theoretic weight for dimension i, derived from the dimension’s discriminative power across the model population
C(M) is the compute cost normalization factor: total inference cost (in USD) for the complete evaluation suite

Weight Determination #

Rather than imposing arbitrary dimension weights, UIB derives them empirically. Let σ²ᵢ be the variance of dimension i scores across all evaluated models. The weight for dimension i is:

wᵢ = σ²ᵢ / Σⱼ σ²ⱼ

Dimensions that discriminate more between models receive higher weight. This prevents dimensions where all models score similarly (and thus provide little discriminative information) from dominating the composite. This approach follows the principle from information theory that the value of a measurement is proportional to its entropy — a dimension where all models score 0.95 ± 0.02 carries less information than one where scores range from 0.3 to 0.9.

Cost Normalization #

The cost function C(M) captures total inference expenditure across the evaluation suite:

C(M) = Σₜ (tokensin(t) · pricein(M) + tokensout(t) · priceout(M))

For open-source models run on custom hardware, C(M) substitutes energy cost: watts × hours × electricity rate. This ensures fair comparison between API-accessed and self-hosted models.

The normalization means UIB explicitly rewards efficiency. A model achieving UIBraw = 0.75 at C = $0.50 scores UIB = 1.50, while a model achieving UIBraw = 0.80 at C = $5.00 scores UIB = 0.16. The cheaper model is nearly 10× more “intelligent” per dollar.

Confidence Intervals #

Following the NIST GLMM approach, each dimension score includes bootstrapped confidence intervals. The composite UIB score reports:

UIB(M) = μ ± 1.96·SE, 95% CI

This prevents the false precision that plagues current leaderboards. When UIB reports that Model A scores 1.45 ± 0.12 and Model B scores 1.38 ± 0.15, the overlapping confidence intervals honestly communicate that these models may not be meaningfully different.

Inference Agnosticism: The OpenRouter Principle #

A critical design decision separates UIB from existing benchmarks: the evaluation pipeline is inference-agnostic. Users provide their own API key (via OpenRouter or any OpenAI-compatible endpoint), and the UIB pipeline orchestrates the evaluation tasks. Stabilarity hosts the evaluation logic, not the inference compute.

This has several advantages. First, it eliminates the vendor lock-in problem — any model accessible via API can be evaluated. Second, it naturally captures real-world inference costs, since users pay actual market rates rather than subsidized benchmark pricing. Third, it allows continuous evaluation as model pricing changes: the same model’s UIB score changes when its API price drops, reflecting the real-world improvement in intelligence accessibility.

Our meta-meta-analysis (Ivchenko, 2026)^[15] identified 200+ benchmark studies, most of which were locked to specific evaluation infrastructure. UIB’s inference-agnostic design ensures that any researcher with an API key can reproduce our results — the evaluation pipeline is deterministic given the same model outputs.

From Theory to Implementation #

The UIB framework outlined here provides the theoretical foundation. Subsequent articles in this series will operationalize each dimension with specific evaluation protocols:

Article 4 will define the UIB-Causal evaluation suite, building on the Pearl Causal Hierarchy and existing causal benchmarks like CLadder and CausalBench
Articles 5-8 will specify embodied, temporal, social, and efficiency evaluation protocols
Article 9 will present the first UIB composite scores for 20+ frontier models evaluated via OpenRouter
Article 10 will release the complete open-source benchmark suite

The framework deliberately avoids prescribing which models are “best.” Instead, it provides the measurement apparatus — a telescope, not a verdict. Different applications will weight dimensions differently: an autonomous vehicle system cares deeply about embodied and causal intelligence, while a customer service agent prioritizes social intelligence and efficiency.

Limitations and Open Questions #

Several challenges remain unresolved. First, the dimension decomposition is not provably complete — there may be aspects of intelligence not captured by these eight dimensions. Second, cost normalization assumes that API pricing reflects true computational cost, which may not hold for subsidized endpoints. Third, the weight determination method requires a sufficiently large model population to produce stable variance estimates. Fourth, as demonstrated by the rapid improvement on ARC-AGI-2, any fixed benchmark faces optimization pressure — UIB’s multi-dimensional design mitigates but does not eliminate this risk.

The framework also inherits a fundamental philosophical tension: Legg-Hutter universal intelligence is defined over all computable environments, but any practical benchmark must sample a finite subset. UIB’s eight dimensions represent our best current decomposition of the intelligence space, informed by cognitive science and AI research, but they are not mathematically guaranteed to span the full space of possible intelligence.

Conclusion #

The Universal Intelligence Benchmark theoretical framework addresses three critical gaps in current AI evaluation: dimensionality collapse (reducing intelligence to a single score), cost blindness (ignoring the resources required to achieve that score), and false precision (reporting point estimates without uncertainty). By grounding the framework in formal intelligence theory — Legg-Hutter universality, Schmidhuber compression, and Chollet skill-acquisition efficiency — and normalizing by real-world inference cost, UIB provides a measurement system that answers the question enterprises and researchers actually care about: not “which model scored highest on a leaderboard,” but “which model delivers the most intelligence per dollar for my specific use case?”

The next article in this series will operationalize the first UIB dimension — causal intelligence — with a concrete evaluation protocol and preliminary results.

References (15) #

Stabilarity Research Hub. Inference-Agnostic Intelligence: The UIB Theoretical Framework. doi.org. d t r
Universal Intelligence: A Definition of Machine Intelligence | Minds and Machines | Springer Nature Link. link.springer.com. d c r t i l
(20or). [2212.11279] Annotated History of Modern AI and Deep Learning. arxiv.org. t i i
(20or). [1911.01547] On the Measure of Intelligence. arxiv.org. t i i
Stabilarity Research Hub. (2026). The Measurement Crisis: Saturation, Goodhart's Law, and the End of AI Leaderboards. doi.org. d t i r
Rein et al. (2026). nature.com. r t i l
Expanding the AI Evaluation Toolbox with Statistical Models | NIST. nist.gov. t t
(2026). We Need A Proper AI Inference Benchmark Test. nextplatform.com. v
(20or). [2601.10904] ARC Prize 2025: Technical Report. arxiv.org. t i i
(2026). Beating ARC-AGI-2 with Code Evolution – imbue. imbue.com. v
Stabilarity Research Hub. (2026). Decision Readiness Index (DRI): Measuring Information Sufficiency for Portfolio Decisions. doi.org. d t i r
Stabilarity Research Hub. (2026). HPF: A Holistic Framework for Decision-Readiness in Pharmaceutical Portfolio Management. doi.org. d t i r
Bareinboim et al. (2022). causalai.net. v
Stabilarity Research Hub. Proprioception and Internal State Estimation: Joint Encoders, Torque Sensing, and Body Schema for Humanoid Robots. doi.org. d t i r
Stabilarity Research Hub. (2026). The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured. doi.org. d t i r

Version History · 1 revisions