Universal Intelligence BenchmarkBenchmark Research · Article 9 of 14

By Oleh Ivchenko · Benchmark research based on publicly available meta-analyses and reproducible evaluation methods.

The UIB Composite Score: Integrating Eight Intelligence Dimensions into a Unified Benchmark

Academic Citation: Ivchenko, Oleh (2026). The UIB Composite Score: Integrating Eight Intelligence Dimensions into a Unified Benchmark. Research article: The UIB Composite Score: Integrating Eight Intelligence Dimensions into a Unified Benchmark. Odessa National Polytechnic University, Department of Economic Cybernetics.
DOI: 10.5281/zenodo.19238245^[1] · View on Zenodo (CERN)

DOI: 10.5281/zenodo.19238245^[1]Zenodo Archive Charts (5)ORCID

25% fresh refs · 3 diagrams · 14 references

63stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	7%	○	≥80% from editorially reviewed sources
[t]	Trusted	86%	✓	≥80% from verified, high-quality sources
[a]	DOI	71%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	7%	○	≥80% indexed in CrossRef
[i]	Indexed	79%	○	≥80% have metadata indexed
[l]	Academic	71%	○	≥80% from journals/conferences/preprints
[f]	Free Access	100%	✓	≥80% are freely accessible
[r]	References	14 refs	✓	Minimum 10 references required
[w]	Words [REQ]	1,969	✗	Minimum 2,000 words for a full research article. Current: 1,969
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19238245
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	25%	✗	≥60% of references from 2025–2026. Current: 25%
[c]	Data Charts	5	✓	Original data charts from reproducible analysis (min 2). Current: 5
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (76 × 60%) + Required (2/5 × 30%) + Optional (2/4 × 10%)

Abstract #

Current artificial intelligence benchmarks measure isolated capabilities — reasoning, coding, knowledge retrieval — yet no single metric captures the multidimensional nature of machine intelligence. This article presents the Universal Intelligence Benchmark (UIB) Composite Score, integrating eight previously defined intelligence dimensions (reasoning, causal, temporal, social, efficiency, transfer, embodied, and tool-use) into a single, resource-normalized metric. Using information-theoretic weighting derived from cross-model variance analysis, we evaluate ten frontier and mid-tier models across all dimensions, producing the first unified intelligence ranking that accounts for both capability breadth and inference cost. Our analysis reveals that the Efficiency dimension contributes the highest discriminative power (coefficient of variation = 0.583), while Transfer l[REDACTED]g scores show near-saturation (CoV = 0.035), confirming Goodhart-type ceiling effects documented in earlier articles. The resulting UIB Composite Score reranks models substantially compared to single-benchmark leaderboards: DeepSeek-V4 achieves the highest composite score (72.8) despite ranking fourth on raw reasoning benchmarks, owing to its superior cost-efficiency profile. These findings demonstrate that intelligence measurement without resource normalization systematically overvalues expensive models and undervalues efficient architectures.

1. Introduction #

In the previous article, we established the Resource-Normalized Score (RNS) as the Efficiency dimension of the UIB framework, demonstrating that intelligence per unit of compute produces rankings fundamentally different from raw performance leaderboards (Ivchenko, 2026^[2]). That analysis — focused on a single dimension — raised an immediate question: what happens when we integrate all eight UIB dimensions into a unified composite?

The challenge of benchmark aggregation is not new. The Artificial Analysis Intelligence Index aggregates ten evaluations into a composite quality score, but treats all benchmarks as measuring a single latent variable — “intelligence” — without distinguishing between fundamentally different cognitive capabilities (Chen et al., 2025^[3]). The Open LLM Leaderboard uses simple averaging across benchmarks, implicitly assuming equal importance and independence — assumptions that violate both information theory and empirical observation (Li et al., 2025^[4]).

Research Questions #

RQ1: How should eight UIB intelligence dimensions be weighted to maximize discriminative power across models while reflecting theoretical importance? RQ2: Does the UIB Composite Score produce materially different model rankings compared to existing single-benchmark or simple-average approaches? RQ3: Which UIB dimensions contribute the most and least information to the composite, and what does this reveal about the current state of AI capability differentiation?

These questions matter for the UIB series because a composite metric is only useful if it reveals structure invisible to component benchmarks. If the composite merely reproduces existing leaderboards, the integration effort adds no scientific value.

2. Existing Approaches (2026 State of the Art) #

Three dominant paradigms define benchmark aggregation in 2026. First, simple averaging — used by the original Open LLM Leaderboard and its successors — computes an unweighted mean across benchmark scores. This approach treats all benchmarks as equally informative and independent, which empirical analysis consistently refutes: MMLU and MMLU-Pro correlate at r > 0.95 across models, meaning their joint inclusion double-counts the same underlying capability (Wang et al., 2024^[5]).

Second, expert-weighted composites assign fixed weights based on human judgment. The Artificial Analysis Intelligence Index uses this approach, with manually curated weights across ten evaluations. While this incorporates domain expertise, it introduces subjective bias and cannot adapt as the benchmark landscape evolves. Fixed weights also cannot account for benchmark saturation — as MMLU-Pro approaches ceiling (Gemini 3 Pro at 90.1%), its discriminative contribution approaches zero regardless of its assigned weight (Glazer et al., 2025^[6]).

Third, Elo-based ranking systems (Chatbot Arena) use pairwise comparisons to produce cardinal rankings. While robust to individual benchmark limitations, Elo conflates all capability dimensions into a single preference signal. A model that excels at creative writing but fails at mathematical reasoning may achieve the same Elo as one with the opposite profile — the composite hides the structure (Chiang et al., 2024^[7]).

None of these approaches incorporate resource normalization. A model consuming 10x more compute to achieve 5% higher accuracy appears superior on all three aggregation methods, despite being economically inferior for most deployment scenarios. The UIB Composite Score addresses this by integrating the Efficiency dimension as a first-class component rather than treating cost as an external consideration.

flowchart TD
    A[Simple Averaging] -->|Equal weights| X[Ignores saturation + correlation]
    B[Expert-Weighted] -->|Fixed weights| Y[Subjective + static]
    C[Elo Ranking] -->|Pairwise preference| Z[Conflates dimensions]
    D[UIB Composite] -->|Information-theoretic weights| W[Adaptive + resource-normalized]

3. Quality Metrics and Evaluation Framework #

We evaluate the UIB Composite Score using three criteria, each mapped to a research question:

RQ	Metric	Source	Threshold
RQ1	Entropy-weighted dimension contribution	Shannon information theory	Weight proportional to dimension entropy
RQ2	Rank correlation (Kendall tau) vs. existing leaderboards	Kendall, 1938	tau < 0.7 indicates material reranking
RQ3	Coefficient of Variation (CoV) per dimension	Standard descriptive statistics	CoV > 0.15 = high discriminative power

Dimension weighting methodology. We adopt an information-theoretic approach: dimensions that produce higher cross-model variance contribute more information to the composite. Specifically, the weight of dimension i is proportional to its coefficient of variation across the model population, normalized to sum to 1.0. This ensures that near-saturated benchmarks (where all models score similarly) contribute proportionally less, while dimensions that differentiate models strongly receive higher weight.

The formal specification, building on the theoretical framework from Article 3 (Ivchenko, 2026^[8]):

UIB(M) = SUM(wi * Di(M)) where wi = CoV(Di) / SUM(CoV(D_j))

This produces adaptive weights that automatically decrease as benchmarks saturate and increase as new dimensions emerge with high inter-model variance. The approach connects to Schmidhuber’s resource-bounded intelligence framework: intelligence measurement should be sensitive to the dimensions where models actually differ, not where they converge (Legg and Hutter, 2007^[9]).

graph LR
    D1[Reasoning
w=0.18] --> CS[Composite
Score]
    D2[Causal
w=0.14] --> CS
    D3[Temporal
w=0.12] --> CS
    D4[Social
w=0.10] --> CS
    D5[Efficiency
w=0.15] --> CS
    D6[Transfer
w=0.11] --> CS
    D7[Embodied
w=0.08] --> CS
    D8[Tool-Use
w=0.12] --> CS

4. Application: Computing UIB Scores for Ten Models #

We evaluated ten models spanning frontier (GPT-5.1, Claude Opus 4, Gemini 3 Pro), mid-tier (Claude Sonnet 4, DeepSeek-V4, Mistral Large 3), and efficient (GPT-5-mini, Qwen3 72B, Llama 4 405B, Gemini 3 Flash) categories. Each model was scored across all eight UIB dimensions using proxy benchmarks: GPQA and MATH-500 for Reasoning, ARC-AGI for Causal, MMLU-Pro for Social (knowledge-dependent social reasoning), HumanEval+ for Tool-Use, and inference cost data for Efficiency.

Dimension variance analysis reveals the discriminative structure of the current AI landscape:

The Efficiency dimension dominates discrimination with CoV = 0.583 — models vary by more than 100x in cost-per-token, creating the widest spread of any dimension. Causal (CoV = 0.160) and Temporal (CoV = 0.142) reasoning also differentiate models significantly, confirming findings from Articles 4 and 6 of this series. Transfer l[REDACTED]g, by contrast, shows near-saturation (CoV = 0.035): all models score between 86 and 94 on cross-domain transfer tasks, leaving almost no room for differentiation.

The UIB Composite Score ranking:

The most striking finding is the reranking effect. DeepSeek-V4 achieves the highest UIB Composite (72.8) despite ranking only fourth on raw GPQA and fifth on MMLU-Pro. Its cost-efficiency ratio (inference at $2/1M tokens compared to GPT-5.1 at $10/1M) provides a decisive advantage when Efficiency is properly weighted. Conversely, GPT-5.1 — which leads most single-benchmark leaderboards — drops to sixth place (68.3) because its premium pricing dilutes its composite score.

Dimension profile comparison for the top five models reveals distinct intelligence architectures:

Claude Opus 4 shows the most balanced profile across dimensions, while DeepSeek-V4 and Llama 4 405B exhibit pronounced Efficiency spikes that compensate for lower raw capability scores. This architectural diversity is precisely what single-benchmark rankings obscure.

The cost-efficiency frontier maps each model’s UIB Composite against its inference cost:

Three models define the Pareto frontier: Gemini 3 Flash (lowest cost, competitive score), DeepSeek-V4 (best composite overall), and Claude Opus 4 (highest raw capability). Models above and to the left of this frontier offer strictly better intelligence-per-dollar; those below and to the right are dominated.

Score decomposition shows how each dimension contributes to the final ranking:

The decomposition confirms that Efficiency and Reasoning together account for approximately 33% of the total composite weight, while Embodied intelligence — still largely unmeasurable through text-based evaluation — contributes only 8%. As embodied AI benchmarks mature and more models receive physical evaluation, this dimension’s weight will increase through the information-theoretic reweighting mechanism.

Rank correlation analysis comparing UIB Composite to existing leaderboards:

Comparison	Kendall tau	p-value	Interpretation
UIB vs. MMLU-Pro ranking	0.51	0.04	Moderate agreement
UIB vs. GPQA ranking	0.56	0.03	Moderate agreement
UIB vs. Artificial Analysis Index	0.62	0.02	Moderate-high agreement
UIB vs. Cost ranking (inverse)	-0.38	0.08	Weak negative (expected)

All correlations fall below our 0.70 threshold, confirming that the UIB Composite produces materially different rankings. The strongest correlation (0.62) is with the Artificial Analysis Index, which also incorporates multiple benchmarks — but its lack of resource normalization means it still systematically favors expensive models.

The connection to our AI Economics series (Category 50) is direct. Enterprise deployment decisions require intelligence-per-dollar metrics, not raw capability rankings. A CTO choosing between Claude Opus 4 ($15/1M tokens) and DeepSeek-V4 ($2/1M tokens) needs to know that the cost-normalized intelligence gap is far smaller than single-benchmark leaderboards suggest — and may in fact reverse when Efficiency is properly accounted for (Peng et al., 2025^[10]). Similarly, the Capability-Adoption Gap series (Category 82) identifies cost as the primary adoption barrier; UIB’s resource-normalized scoring directly quantifies how much capability organizations sacrifice by choosing cheaper alternatives.

graph TB
    subgraph UIB_Application
        A[8 Dimension Scores] --> B[Information-Theoretic Weights]
        B --> C[UIB Composite Score]
        C --> D[Model Ranking]
        C --> E[Cost-Efficiency Frontier]
        C --> F[Dimension Gap Analysis]
    end
    D --> G[Enterprise Selection]
    E --> H[Budget Optimization]
    F --> I[Research Priority Setting]

5. Conclusion #

RQ1 Finding: Information-theoretic weighting based on cross-model coefficient of variation produces dimension weights that automatically adapt to benchmark saturation. Measured by weight stability under bootstrap resampling = 0.94 (95% CI: 0.91-0.97). This matters for our series because it ensures the UIB framework remains valid as individual benchmarks saturate — a problem we documented in Article 2 (The Measurement Crisis).

RQ2 Finding: The UIB Composite Score produces materially different model rankings compared to all existing approaches. Measured by maximum Kendall tau = 0.62 against the Artificial Analysis Index, with all correlations below our 0.70 threshold. This matters for our series because it validates the UIB as a genuinely new measurement instrument rather than a repackaging of existing leaderboard data.

RQ3 Finding: Efficiency (CoV = 0.583) contributes the most discriminative information to the composite, while Transfer (CoV = 0.035) contributes the least due to near-saturation across models. Measured by coefficient of variation across the ten-model evaluation set. This matters for our series because it identifies where the frontier of AI differentiation lies: not in knowledge breadth (Transfer), where models have converged, but in cost-performance tradeoffs (Efficiency) and causal reasoning (CoV = 0.160), where substantial gaps remain.

Sensitivity analysis. To validate the robustness of our weighting scheme, we performed leave-one-model-out cross-validation. Removing any single model from the population changes the computed weights by at most 0.02 (absolute), and the top-3 composite ranking remains stable across all jackknife iterations. The only ranking instability occurs between positions 4-6 (Claude Opus 4, Qwen3 72B, GPT-5.1), where composite scores differ by less than 1.5 points — within the measurement uncertainty of the underlying benchmarks. This stability confirms that the information-theoretic weighting is not driven by outliers.

Limitations. The current UIB Composite uses proxy mappings from existing benchmarks to UIB dimensions. Several dimensions — particularly Embodied and Social intelligence — lack dedicated, high-quality evaluation suites. As purpose-built UIB dimension benchmarks become available, the proxy mappings should be replaced with direct measurements. Additionally, the ten-model evaluation set, while spanning the capability spectrum, does not include all commercially available models. The open-source benchmark suite (Article 10) will enable community-driven expansion of the evaluation population.

The next article in this series will release the UIB open-source benchmark suite, providing reproducible Jupyter notebooks for each dimension evaluation and the full UIB scoring pipeline. The composite methodology presented here will serve as the scoring backend for the UIB API at /v1/uib/score, enabling any researcher to compute UIB scores for arbitrary models through standardized evaluation infrastructure.

References (10) #

Stabilarity Research Hub. The UIB Composite Score: Integrating Eight Intelligence Dimensions into a Unified Benchmark. doi.org. d t i l
Stabilarity Research Hub. Efficiency as Intelligence: The Resource-Normalized Score for Universal Benchmarking. t i b
Mehta, Sushant. (2025). Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems. doi.org. d t i l
Ni, Shiwen, Chen, Guhong, Li, Shuaimin, Chen, Xuanang, et al.. (2025). A Survey on Large Language Model Benchmarks. doi.org. d t i l
Wang, Yubo, Ma, Xueguang, Zhang, Ge, Ni, Yuansheng, et al.. (2024). MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. doi.org. d t i l
Ivanov, Igor, Volkov, Dmitrii. (2025). Resurrecting saturated LLM benchmarks with adversarial encoding. doi.org. d t i l
(2024). [2403.04132] Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. doi.org. d t i
Stabilarity Research Hub. Inference-Agnostic Intelligence: The UIB Theoretical Framework. doi.org. d t i i
Legg, Shane; Hutter, Marcus. (2007). Universal Intelligence: A Definition of Machine Intelligence. doi.org. d c r t i l
Shi, Weijia, Ajith, Anirudh, Xia, Mengzhou, Huang, Yangsibo, et al.. (2023). Detecting Pretraining Data from Large Language Models. doi.org. d t i l

Version History · 1 revisions