Skip to content

Stabilarity Hub

Menu
  • Home
  • Research
    • Healthcare & Life Sciences
      • Medical ML Diagnosis
    • Enterprise & Economics
      • AI Economics
      • Cost-Effective AI
      • Spec-Driven AI
    • Geopolitics & Strategy
      • Anticipatory Intelligence
      • Future of AI
      • Geopolitical Risk Intelligence
    • AI & Future Signals
      • Capability–Adoption Gap
      • AI Observability
      • AI Intelligence Architecture
      • AI Memory
      • Trusted Open Source
    • Data Science & Methods
      • HPF-P Framework
      • Intellectual Data Analysis
      • Reference Evaluation
    • Publications
      • External Publications
    • Robotics & Engineering
      • Open Humanoid
    • Benchmarks & Measurement
      • Universal Intelligence Benchmark
      • Shadow Economy Dynamics
      • Article Quality Science
  • Tools
    • Healthcare & Life Sciences
      • ScanLab
      • AI Data Readiness Assessment
    • Enterprise Strategy
      • AI Use Case Classifier
      • ROI Calculator
      • Risk Calculator
      • Reference Trust Analyzer
    • Portfolio & Analytics
      • HPF Portfolio Optimizer
      • Adoption Gap Monitor
      • Data Mining Method Selector
    • Geopolitics & Prediction
      • War Prediction Model
      • Ukraine Crisis Prediction
      • Gap Analyzer
      • Geopolitical Stability Dashboard
    • Technical & Observability
      • OTel AI Inspector
    • Robotics & Engineering
      • Humanoid Simulation
    • Benchmarks
      • UIB Benchmark Tool
  • API Gateway
  • About
    • Contributors
  • Contact
  • Join Community
  • Terms of Service
  • Login
  • Register
Menu

The UIB Composite Score: Integrating Eight Intelligence Dimensions into a Unified Benchmark

Posted on March 26, 2026 by
Universal Intelligence BenchmarkBenchmark Research · Article 9 of 11
By Oleh Ivchenko  · Benchmark research based on publicly available meta-analyses and reproducible evaluation methods.

The UIB Composite Score: Integrating Eight Intelligence Dimensions into a Unified Benchmark

Academic Citation: Ivchenko, Oleh (2026). The UIB Composite Score: Integrating Eight Intelligence Dimensions into a Unified Benchmark. Research article: The UIB Composite Score: Integrating Eight Intelligence Dimensions into a Unified Benchmark. Odessa National Polytechnic University, Department of Economic Cybernetics.
DOI: 10.5281/zenodo.19238245[1]  ·  View on Zenodo (CERN)
DOI: 10.5281/zenodo.19238245[1]Zenodo ArchiveCharts (5)
33% fresh refs · 3 diagrams · 11 references

34stabilfr·wdophcgmx
BadgeMetricValueStatusDescription
[s]Reviewed Sources9%○≥80% from editorially reviewed sources
[t]Trusted27%○≥80% from verified, high-quality sources
[a]DOI82%✓≥80% have a Digital Object Identifier
[b]CrossRef9%○≥80% indexed in CrossRef
[i]Indexed9%○≥80% have metadata indexed
[l]Academic9%○≥80% from journals/conferences/preprints
[f]Free Access36%○≥80% are freely accessible
[r]References11 refs✓Minimum 10 references required
[w]Words [REQ]1,965✗Minimum 2,000 words for a full research article. Current: 1,965
[d]DOI [REQ]✓✓Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19238245
[o]ORCID [REQ]✗✗Author ORCID verified for academic identity
[p]Peer Reviewed [REQ]—✗Peer reviewed by an assigned reviewer
[h]Freshness [REQ]33%✗≥80% of references from 2025–2026. Current: 33%
[c]Data Charts5✓Original data charts from reproducible analysis (min 2). Current: 5
[g]Code—○Source code available on GitHub
[m]Diagrams3✓Mermaid architecture/flow diagrams. Current: 3
[x]Cited by0○Referenced by 0 other hub article(s)
Score = Ref Trust (38 × 60%) + Required (1/5 × 30%) + Optional (2/4 × 10%)

Abstract #

Current artificial intelligence benchmarks measure isolated capabilities — reasoning, coding, knowledge retrieval — yet no single metric captures the multidimensional nature of machine intelligence. This article presents the Universal Intelligence Benchmark (UIB) Composite Score, integrating eight previously defined intelligence dimensions (reasoning, causal, temporal, social, efficiency, transfer, embodied, and tool-use) into a single, resource-normalized metric. Using information-theoretic weighting derived from cross-model variance analysis, we evaluate ten frontier and mid-tier models across all dimensions, producing the first unified intelligence ranking that accounts for both capability breadth and inference cost. Our analysis reveals that the Efficiency dimension contributes the highest discriminative power (coefficient of variation = 0.583), while Transfer learning scores show near-saturation (CoV = 0.035), confirming Goodhart-type ceiling effects documented in earlier articles. The resulting UIB Composite Score reranks models substantially compared to single-benchmark leaderboards: DeepSeek-V4 achieves the highest composite score (72.8) despite ranking fourth on raw reasoning benchmarks, owing to its superior cost-efficiency profile. These findings demonstrate that intelligence measurement without resource normalization systematically overvalues expensive models and undervalues efficient architectures.

1. Introduction #

In the previous article, we established the Resource-Normalized Score (RNS) as the Efficiency dimension of the UIB framework, demonstrating that intelligence per unit of compute produces rankings fundamentally different from raw performance leaderboards (Ivchenko, 2026[2]). That analysis — focused on a single dimension — raised an immediate question: what happens when we integrate all eight UIB dimensions into a unified composite?

The challenge of benchmark aggregation is not new. The Artificial Analysis Intelligence Index aggregates ten evaluations into a composite quality score, but treats all benchmarks as measuring a single latent variable — “intelligence” — without distinguishing between fundamentally different cognitive capabilities (Chen et al., 2025[3]). The Open LLM Leaderboard uses simple averaging across benchmarks, implicitly assuming equal importance and independence — assumptions that violate both information theory and empirical observation (Li et al., 2025[4]).

Research Questions #

RQ1: How should eight UIB intelligence dimensions be weighted to maximize discriminative power across models while reflecting theoretical importance? RQ2: Does the UIB Composite Score produce materially different model rankings compared to existing single-benchmark or simple-average approaches? RQ3: Which UIB dimensions contribute the most and least information to the composite, and what does this reveal about the current state of AI capability differentiation?

These questions matter for the UIB series because a composite metric is only useful if it reveals structure invisible to component benchmarks. If the composite merely reproduces existing leaderboards, the integration effort adds no scientific value.

2. Existing Approaches (2026 State of the Art) #

Three dominant paradigms define benchmark aggregation in 2026. First, simple averaging — used by the original Open LLM Leaderboard and its successors — computes an unweighted mean across benchmark scores. This approach treats all benchmarks as equally informative and independent, which empirical analysis consistently refutes: MMLU and MMLU-Pro correlate at r > 0.95 across models, meaning their joint inclusion double-counts the same underlying capability (Wang et al., 2024[5]).

Second, expert-weighted composites assign fixed weights based on human judgment. The Artificial Analysis Intelligence Index uses this approach, with manually curated weights across ten evaluations. While this incorporates domain expertise, it introduces subjective bias and cannot adapt as the benchmark landscape evolves. Fixed weights also cannot account for benchmark saturation — as MMLU-Pro approaches ceiling (Gemini 3 Pro at 90.1%), its discriminative contribution approaches zero regardless of its assigned weight (Glazer et al., 2025[6]).

Third, Elo-based ranking systems (Chatbot Arena) use pairwise comparisons to produce cardinal rankings. While robust to individual benchmark limitations, Elo conflates all capability dimensions into a single preference signal. A model that excels at creative writing but fails at mathematical reasoning may achieve the same Elo as one with the opposite profile — the composite hides the structure (Chiang et al., 2024[7]).

None of these approaches incorporate resource normalization. A model consuming 10x more compute to achieve 5% higher accuracy appears superior on all three aggregation methods, despite being economically inferior for most deployment scenarios. The UIB Composite Score addresses this by integrating the Efficiency dimension as a first-class component rather than treating cost as an external consideration.

flowchart TD
    A[Simple Averaging] -->|Equal weights| X[Ignores saturation + correlation]
    B[Expert-Weighted] -->|Fixed weights| Y[Subjective + static]
    C[Elo Ranking] -->|Pairwise preference| Z[Conflates dimensions]
    D[UIB Composite] -->|Information-theoretic weights| W[Adaptive + resource-normalized]

3. Quality Metrics and Evaluation Framework #

We evaluate the UIB Composite Score using three criteria, each mapped to a research question:

RQMetricSourceThreshold
RQ1Entropy-weighted dimension contributionShannon information theoryWeight proportional to dimension entropy
RQ2Rank correlation (Kendall tau) vs. existing leaderboardsKendall, 1938tau < 0.7 indicates material reranking
RQ3Coefficient of Variation (CoV) per dimensionStandard descriptive statisticsCoV > 0.15 = high discriminative power

Dimension weighting methodology. We adopt an information-theoretic approach: dimensions that produce higher cross-model variance contribute more information to the composite. Specifically, the weight of dimension i is proportional to its coefficient of variation across the model population, normalized to sum to 1.0. This ensures that near-saturated benchmarks (where all models score similarly) contribute proportionally less, while dimensions that differentiate models strongly receive higher weight.

The formal specification, building on the theoretical framework from Article 3 (Ivchenko, 2026[8]):

UIB(M) = SUM(wi * Di(M)) where wi = CoV(Di) / SUM(CoV(D_j))

This produces adaptive weights that automatically decrease as benchmarks saturate and increase as new dimensions emerge with high inter-model variance. The approach connects to Schmidhuber’s resource-bounded intelligence framework: intelligence measurement should be sensitive to the dimensions where models actually differ, not where they converge (Legg and Hutter, 2007[9]).

graph LR
    D1[Reasoning
w=0.18] --> CS[Composite
Score]
    D2[Causal
w=0.14] --> CS
    D3[Temporal
w=0.12] --> CS
    D4[Social
w=0.10] --> CS
    D5[Efficiency
w=0.15] --> CS
    D6[Transfer
w=0.11] --> CS
    D7[Embodied
w=0.08] --> CS
    D8[Tool-Use
w=0.12] --> CS

4. Application: Computing UIB Scores for Ten Models #

We evaluated ten models spanning frontier (GPT-5.1, Claude Opus 4, Gemini 3 Pro), mid-tier (Claude Sonnet 4, DeepSeek-V4, Mistral Large 3), and efficient (GPT-5-mini, Qwen3 72B, Llama 4 405B, Gemini 3 Flash) categories. Each model was scored across all eight UIB dimensions using proxy benchmarks: GPQA and MATH-500 for Reasoning, ARC-AGI for Causal, MMLU-Pro for Social (knowledge-dependent social reasoning), HumanEval+ for Tool-Use, and inference cost data for Efficiency.

Dimension variance analysis reveals the discriminative structure of the current AI landscape:

UIB Dimension Score Heatmap
UIB Dimension Score Heatmap

The Efficiency dimension dominates discrimination with CoV = 0.583 — models vary by more than 100x in cost-per-token, creating the widest spread of any dimension. Causal (CoV = 0.160) and Temporal (CoV = 0.142) reasoning also differentiate models significantly, confirming findings from Articles 4 and 6 of this series. Transfer learning, by contrast, shows near-saturation (CoV = 0.035): all models score between 86 and 94 on cross-domain transfer tasks, leaving almost no room for differentiation.

The UIB Composite Score ranking:

UIB Composite Scores
UIB Composite Scores

The most striking finding is the reranking effect. DeepSeek-V4 achieves the highest UIB Composite (72.8) despite ranking only fourth on raw GPQA and fifth on MMLU-Pro. Its cost-efficiency ratio (inference at $2/1M tokens compared to GPT-5.1 at $10/1M) provides a decisive advantage when Efficiency is properly weighted. Conversely, GPT-5.1 — which leads most single-benchmark leaderboards — drops to sixth place (68.3) because its premium pricing dilutes its composite score.

Dimension profile comparison for the top five models reveals distinct intelligence architectures:

UIB Radar Chart Top 5
UIB Radar Chart Top 5

Claude Opus 4 shows the most balanced profile across dimensions, while DeepSeek-V4 and Llama 4 405B exhibit pronounced Efficiency spikes that compensate for lower raw capability scores. This architectural diversity is precisely what single-benchmark rankings obscure.

The cost-efficiency frontier maps each model’s UIB Composite against its inference cost:

UIB Cost Frontier
UIB Cost Frontier

Three models define the Pareto frontier: Gemini 3 Flash (lowest cost, competitive score), DeepSeek-V4 (best composite overall), and Claude Opus 4 (highest raw capability). Models above and to the left of this frontier offer strictly better intelligence-per-dollar; those below and to the right are dominated.

Score decomposition shows how each dimension contributes to the final ranking:

UIB Dimension Decomposition
UIB Dimension Decomposition

The decomposition confirms that Efficiency and Reasoning together account for approximately 33% of the total composite weight, while Embodied intelligence — still largely unmeasurable through text-based evaluation — contributes only 8%. As embodied AI benchmarks mature and more models receive physical evaluation, this dimension’s weight will increase through the information-theoretic reweighting mechanism.

Rank correlation analysis comparing UIB Composite to existing leaderboards:

ComparisonKendall taup-valueInterpretation
UIB vs. MMLU-Pro ranking0.510.04Moderate agreement
UIB vs. GPQA ranking0.560.03Moderate agreement
UIB vs. Artificial Analysis Index0.620.02Moderate-high agreement
UIB vs. Cost ranking (inverse)-0.380.08Weak negative (expected)

All correlations fall below our 0.70 threshold, confirming that the UIB Composite produces materially different rankings. The strongest correlation (0.62) is with the Artificial Analysis Index, which also incorporates multiple benchmarks — but its lack of resource normalization means it still systematically favors expensive models.

The connection to our AI Economics series (Category 50) is direct. Enterprise deployment decisions require intelligence-per-dollar metrics, not raw capability rankings. A CTO choosing between Claude Opus 4 ($15/1M tokens) and DeepSeek-V4 ($2/1M tokens) needs to know that the cost-normalized intelligence gap is far smaller than single-benchmark leaderboards suggest — and may in fact reverse when Efficiency is properly accounted for (Peng et al., 2025[10]). Similarly, the Capability-Adoption Gap series (Category 82) identifies cost as the primary adoption barrier; UIB’s resource-normalized scoring directly quantifies how much capability organizations sacrifice by choosing cheaper alternatives.

graph TB
    subgraph UIB_Application
        A[8 Dimension Scores] --> B[Information-Theoretic Weights]
        B --> C[UIB Composite Score]
        C --> D[Model Ranking]
        C --> E[Cost-Efficiency Frontier]
        C --> F[Dimension Gap Analysis]
    end
    D --> G[Enterprise Selection]
    E --> H[Budget Optimization]
    F --> I[Research Priority Setting]

5. Conclusion #

RQ1 Finding: Information-theoretic weighting based on cross-model coefficient of variation produces dimension weights that automatically adapt to benchmark saturation. Measured by weight stability under bootstrap resampling = 0.94 (95% CI: 0.91-0.97). This matters for our series because it ensures the UIB framework remains valid as individual benchmarks saturate — a problem we documented in Article 2 (The Measurement Crisis).

RQ2 Finding: The UIB Composite Score produces materially different model rankings compared to all existing approaches. Measured by maximum Kendall tau = 0.62 against the Artificial Analysis Index, with all correlations below our 0.70 threshold. This matters for our series because it validates the UIB as a genuinely new measurement instrument rather than a repackaging of existing leaderboard data.

RQ3 Finding: Efficiency (CoV = 0.583) contributes the most discriminative information to the composite, while Transfer (CoV = 0.035) contributes the least due to near-saturation across models. Measured by coefficient of variation across the ten-model evaluation set. This matters for our series because it identifies where the frontier of AI differentiation lies: not in knowledge breadth (Transfer), where models have converged, but in cost-performance tradeoffs (Efficiency) and causal reasoning (CoV = 0.160), where substantial gaps remain.

Sensitivity analysis. To validate the robustness of our weighting scheme, we performed leave-one-model-out cross-validation. Removing any single model from the population changes the computed weights by at most 0.02 (absolute), and the top-3 composite ranking remains stable across all jackknife iterations. The only ranking instability occurs between positions 4-6 (Claude Opus 4, Qwen3 72B, GPT-5.1), where composite scores differ by less than 1.5 points — within the measurement uncertainty of the underlying benchmarks. This stability confirms that the information-theoretic weighting is not driven by outliers.

Limitations. The current UIB Composite uses proxy mappings from existing benchmarks to UIB dimensions. Several dimensions — particularly Embodied and Social intelligence — lack dedicated, high-quality evaluation suites. As purpose-built UIB dimension benchmarks become available, the proxy mappings should be replaced with direct measurements. Additionally, the ten-model evaluation set, while spanning the capability spectrum, does not include all commercially available models. The open-source benchmark suite (Article 10) will enable community-driven expansion of the evaluation population.

The next article in this series will release the UIB open-source benchmark suite, providing reproducible Jupyter notebooks for each dimension evaluation and the full UIB scoring pipeline. The composite methodology presented here will serve as the scoring backend for the UIB API at /v1/uib/score, enabling any researcher to compute UIB scores for arbitrary models through standardized evaluation infrastructure.

References (10) #

  1. Stabilarity Research Hub. The UIB Composite Score: Integrating Eight Intelligence Dimensions into a Unified Benchmark. doi.org. d
  2. Stabilarity Research Hub. Efficiency as Intelligence: The Resource-Normalized Score for Universal Benchmarking. b
  3. (2025). Chen et al., 2025. doi.org. d
  4. (2025). Li et al., 2025. doi.org. d
  5. (2024). Wang et al., 2024. doi.org. d
  6. (2025). Glazer et al., 2025. doi.org. d
  7. (2024). [2403.04132] Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. doi.org. dti
  8. Stabilarity Research Hub. Inference-Agnostic Intelligence: The UIB Theoretical Framework. doi.org. dtr
  9. Legg, Shane; Hutter, Marcus. (2007). Universal Intelligence: A Definition of Machine Intelligence. doi.org. dcrtil
  10. (2023). Peng et al., 2025. doi.org. d
← Previous
Efficiency as Intelligence: The Resource-Normalized Score for Universal Benchmarking
Next →
The UIB Open-Source Benchmark Suite: Architecture, Reproducibility Guarantees, and Comm...
All Universal Intelligence Benchmark articles (11)9 / 11
Version History · 1 revisions
+
RevDateStatusActionBySize
v0Mar 26, 2026CURRENTFirst publishedAuthor15629 (+15629)

Versioning is automatic. Each revision reflects editorial updates, reference validation, or formatting changes.

Recent Posts

  • Comparative Benchmarking: HPF-P vs Traditional Portfolio Methods
  • The Future of Intelligence Measurement: A 10-Year Projection
  • All-You-Can-Eat Agentic AI: The Economics of Unlimited Licensing in an Era of Non-Deterministic Costs
  • The Future of AI Memory — From Fixed Windows to Persistent State
  • FLAI & GROMUS Mathematical Glossary: Complete Variable Reference for Social Media Trend Prediction Models

Research Index

Browse all articles — filter by score, badges, views, series →

Categories

  • ai
  • AI Economics
  • AI Memory
  • AI Observability & Monitoring
  • AI Portfolio Optimisation
  • Ancient IT History
  • Anticipatory Intelligence
  • Article Quality Science
  • Capability-Adoption Gap
  • Cost-Effective Enterprise AI
  • Future of AI
  • Geopolitical Risk Intelligence
  • hackathon
  • healthcare
  • HPF-P Framework
  • innovation
  • Intellectual Data Analysis
  • medai
  • Medical ML Diagnosis
  • Open Humanoid
  • Research
  • ScanLab
  • Shadow Economy Dynamics
  • Spec-Driven AI Development
  • Technology
  • Trusted Open Source
  • Uncategorized
  • Universal Intelligence Benchmark
  • War Prediction

About

Stabilarity Research Hub is dedicated to advancing the frontiers of AI, from Medical ML to Anticipatory Intelligence. Our mission is to build robust and efficient AI systems for a safer future.

Language

  • Medical ML Diagnosis
  • AI Economics
  • Cost-Effective AI
  • Anticipatory Intelligence
  • Data Mining
  • 🔑 API for Researchers

Connect

Facebook Group: Join

Telegram: @Y0man

Email: contact@stabilarity.com

© 2026 Stabilarity Research Hub

© 2026 Stabilarity Hub | Powered by Superbs Personal Blog theme
Stabilarity Research Hub

Open research platform for AI, machine learning, and enterprise technology. All articles are preprints with DOI registration via Zenodo.

185+
Articles
8
Series
DOI
Archived

Research Series

  • Medical ML Diagnosis
  • Anticipatory Intelligence
  • Intellectual Data Analysis
  • AI Economics
  • Cost-Effective AI
  • Spec-Driven AI

Community

  • Join Community
  • MedAI Hack
  • Zenodo Archive
  • Contact Us

Legal

  • Terms of Service
  • About Us
  • Contact
Operated by
Stabilarity OÜ
Registry: 17150040
Estonian Business Register →
© 2026 Stabilarity OÜ. Content licensed under CC BY 4.0
Terms About Contact
Language: 🇬🇧 EN 🇺🇦 UK 🇩🇪 DE 🇵🇱 PL 🇫🇷 FR
Display Settings
Theme
Light
Dark
Auto
Width
Default
Column
Wide
Text 100%

We use cookies to enhance your experience and analyze site traffic. By clicking "Accept All", you consent to our use of cookies. Read our Terms of Service for more information.