Universal Intelligence BenchmarkBenchmark Research · Article 4 of 11

By Oleh Ivchenko · Benchmark research based on publicly available meta-analyses and reproducible evaluation methods.

Causal Intelligence as a UIB Dimension: Measuring What Models Actually Understand

Academic Citation: Ivchenko, Oleh (2026). Causal Intelligence as a UIB Dimension: Measuring What Models Actually Understand. Research article: Causal Intelligence as a UIB Dimension: Measuring What Models Actually Understand. Odessa National Polytechnic University, Department of Economic Cybernetics.
DOI: 10.5281/zenodo.19102383^[1] · View on Zenodo (CERN)

DOI: 10.5281/zenodo.19102383^[1]Zenodo Archive ORCID

8% fresh refs · 3 diagrams · 13 references

30stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	15%	○	≥80% from verified, high-quality sources
[a]	DOI	23%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	8%	○	≥80% have metadata indexed
[l]	Academic	62%	○	≥80% from journals/conferences/preprints
[f]	Free Access	85%	✓	≥80% are freely accessible
[r]	References	13 refs	✓	Minimum 10 references required
[w]	Words [REQ]	1,940	✗	Minimum 2,000 words for a full research article. Current: 1,940
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19102383
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	8%	✗	≥80% of references from 2025–2026. Current: 8%
[c]	Data Charts	0	○	Original data charts from reproducible analysis (min 2). Current: 0
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (26 × 60%) + Required (2/5 × 30%) + Optional (1/4 × 10%)

Abstract #

Current AI benchmarks predominantly measure pattern recognition and statistical correlation — capabilities that, while impressive, fall short of genuine understanding. This article introduces Causal Intelligence as a formal dimension within the Universal Intelligence Benchmark (UIB) framework, arguing that any credible measure of machine intelligence must evaluate whether systems can reason about cause and effect across Pearl’s three-level causal hierarchy: association, intervention, and counterfactual reasoning. We survey emerging causal benchmarks (CausalBench, CausalReasoningBenchmark, InterveneBench, EconCausal), identify their structural limitations, and propose a unified scoring methodology that integrates causal reasoning assessment into the UIB’s multi-dimensional evaluation architecture. Our analysis reveals that even frontier models achieve less than 10% accuracy on null-effect recognition in causal tasks — a fundamental gap that current leaderboards entirely obscure.

The Correlation Trap in AI Evaluation #

The dominant paradigm in AI evaluation rewards systems that excel at finding statistical regularities in data. Models that achieve 90% on MMLU or score highly on HumanEval are celebrated as approaching human-level intelligence, yet these benchmarks fundamentally test associational reasoning — the lowest rung of Pearl’s Causal Hierarchy (Bareinboim et al., 2022)^[2]. A system that can predict “patients who take aspirin tend to have fewer headaches” operates at Level 1 (association). Understanding why aspirin reduces headaches requires Level 2 (intervention). Asking “would this patient’s headache have resolved without aspirin?” demands Level 3 (counterfactual reasoning).

This distinction matters enormously for any benchmark claiming to measure intelligence. As our UIB theoretical framework (Ivchenko, 2026)^[3] established, genuine intelligence is inference-agnostic — it should not privilege systems that merely memorize distributional patterns over those that grasp underlying causal mechanisms. The question is not whether a model produces the right answer, but whether it arrives there through a reasoning process that would generalize to novel causal structures.

graph TD
    A[Pearl's Causal Hierarchy] --> L1[Level 1: Association]
    A --> L2[Level 2: Intervention]
    A --> L3[Level 3: Counterfactual]
    L1 --> L1D["P(Y|X) — Seeing"]
    L2 --> L2D["P(Y|do(X)) — Doing"]
    L3 --> L3D["P(Y_x|X',Y') — Imagining"]
    L1D --> B1[Current Benchmarks:
MMLU, GPQA, HellaSwag]
    L2D --> B2[Emerging Benchmarks:
InterveneBench, CausalBench]
    L3D --> B3[Frontier Challenge:
Counterfactual reasoning tasks]
    style L1 fill:#f3f3f3
    style L2 fill:#fff3e0
    style L3 fill:#ffebee

The hierarchy is not merely theoretical. Ibeling et al. (2025)^[4] demonstrated formally that the computational complexity of satisfiability increases strictly across the three levels — counterfactual reasoning is provably harder than interventional reasoning, which is provably harder than associational reasoning. Any benchmark that conflates these levels produces misleading intelligence assessments.

The Current Landscape of Causal Benchmarks #

A new generation of causal evaluation tools has emerged in 2025–2026, each addressing different aspects of the problem. Understanding their structure reveals both progress and persistent gaps.

CausalProfiler: Synthetic Benchmark Generation #

Cai et al. (2026)^[5] introduced CausalProfiler, the first random generator of synthetic causal benchmarks with coverage guarantees operating across all three levels of causal reasoning. The tool generates benchmarks with transparent assumptions, addressing a critical limitation of hand-crafted causal test suites: lack of systematic coverage. By generating tests across observation, intervention, and counterfactual levels, CausalProfiler enables researchers to profile a model’s causal capabilities with statistical rigor rather than relying on curated examples that may not represent the full difficulty spectrum.

CausalReasoningBenchmark: Disentangling Identification and Estimation #

Sawarni et al. (2026)^[6] constructed a real-world benchmark that separates two distinct components of causal reasoning that previous evaluations conflated: causal identification (determining which causal effect is estimable from observed data) and causal estimation (computing the magnitude of the effect). This disentanglement is crucial because a model might correctly identify that a causal relationship exists while completely failing to estimate its strength — or vice versa. Their benchmark uses real-world datasets rather than synthetic scenarios, grounding evaluation in empirically relevant causal structures.

InterveneBench: From Observation to Action #

InterveneBench (2026)^[7] pushes evaluation into the interventional domain by requiring models to design causal studies in real social systems. Rather than asking models to answer questions about pre-specified causal graphs, InterveneBench tests whether models can reason about what interventions would be needed to establish causality — a capability essential for scientific reasoning and policy design. This represents a significant advance: the benchmark evaluates not just causal understanding but causal agency.

graph LR
    subgraph "Benchmark Coverage Map"
    CP[CausalProfiler] --> SYN[Synthetic
All 3 Levels]
    CRB[CausalReasoningBenchmark] --> RW[Real-World
Identification + Estimation]
    IB[InterveneBench] --> INT[Social Systems
Intervention Design]
    EC[EconCausal] --> DOM[Domain-Specific
Economics Context]
    CB[CausalBench] --> MD[Multi-Domain
4 Perspectives]
    end
    SYN --> GAP1[No No domain grounding]
    RW --> GAP2[No No counterfactual level]
    INT --> GAP3[No No estimation tasks]
    EC --> GAP4[No Single domain only]
    CB --> GAP5[No Limited to LLMs]
    style GAP1 fill:#ffcdd2
    style GAP2 fill:#ffcdd2
    style GAP3 fill:#ffcdd2
    style GAP4 fill:#ffcdd2
    style GAP5 fill:#ffcdd2

EconCausal: Domain-Specific Failure Modes #

Perhaps the most revealing benchmark is EconCausal (Zhou et al., 2026)^[8], which tests causal reasoning in economics — a domain where causal inference has direct policy consequences. Their findings expose a critical failure mode: frontier models achieve only 9.5% accuracy on null-effect recognition. When there is no causal effect, models overwhelmingly hallucinate one. This “over-commitment bias” — the tendency to assert causal relationships where none exist — is arguably more dangerous than failing to detect real effects, because it generates confident but wrong policy recommendations.

CausalBench: Multi-Perspective Evaluation #

Zhou et al. (2025)^[9] developed CausalBench, which evaluates causal reasoning from four simultaneous perspectives per question, requiring consistent answers across all viewpoints. This multi-dimensional evaluation forces models to demonstrate genuine understanding of causal structures rather than succeeding through pattern matching or lucky guesses. Their approach also reveals a significant correlation between causal reasoning failures and hallucination propensity — models that struggle with causality are more likely to generate factually incorrect outputs.

Why Causal Intelligence Belongs in UIB #

The UIB framework, as established in our systematic map of benchmark studies (Ivchenko, 2026)^[10], identified that most evaluation instruments measure overlapping statistical capabilities while leaving fundamental cognitive dimensions untested. Causal intelligence is precisely such a dimension — it is:

Orthogonal to existing metrics. A model can achieve perfect scores on MMLU, GPQA, and coding benchmarks while fundamentally failing at causal reasoning. The EconCausal results demonstrate this empirically: models that score well on general knowledge benchmarks still cannot distinguish causal effects from null effects.

Hierarchically structured. Pearl’s three-level hierarchy provides a natural scoring framework: Level 1 (association) acts as a baseline, Level 2 (intervention) represents operational capability, and Level 3 (counterfactual) captures the deepest form of causal understanding. This maps directly to UIB’s multi-dimensional scoring architecture.

Domain-transferable. Unlike benchmarks tied to specific tasks (coding, math, language), causal reasoning is a meta-capability that applies across all domains. A model that genuinely reasons causally about pharmacological interactions should also reason causally about economic policy or mechanical systems.

Measurably distinct from memorization. CausalProfiler’s synthetic generation approach ensures that causal tasks cannot be solved through training data memorization — a persistent confound in static benchmarks that our meta-analysis (Ivchenko, 2026)^[10] identified as a systemic problem.

Proposed UIB Causal Intelligence Scoring #

We propose a three-tier scoring methodology aligned with Pearl’s hierarchy, incorporating insights from each of the benchmarks surveyed:

graph TD
    subgraph "UIB Causal Intelligence Score"
    CS[Causal Score
0-100] --> T1[Tier 1: Associational
Weight: 20%]
    CS --> T2[Tier 2: Interventional
Weight: 35%]
    CS --> T3[Tier 3: Counterfactual
Weight: 45%]
    T1 --> T1M[Metrics:
• Correlation identification
• Confound awareness
• Spurious pattern rejection]
    T2 --> T2M[Metrics:
• Do-calculus application
• Intervention design
• Effect estimation accuracy]
    T3 --> T3M[Metrics:
• Counterfactual consistency
• Null-effect recognition
• Cross-domain transfer]
    end
    T3M --> CRIT[Critical Threshold:
Null-effect accuracy > 50%
required for Tier 3 credit]
    style CS fill:#e3f2fd
    style CRIT fill:#fff9c4

Tier 1 — Associational (20% weight). Can the model correctly identify correlational patterns and, critically, distinguish them from causal claims? This tier draws on CausalBench’s multi-perspective evaluation: the model must consistently recognize that “X correlates with Y” does not imply “X causes Y” across multiple framings of the same relationship. A model that passes Tier 1 demonstrates statistical literacy.

Tier 2 — Interventional (35% weight). Can the model reason about the effects of interventions? This tier incorporates InterveneBench’s study-design tasks and CausalReasoningBenchmark’s separation of identification from estimation. The model must correctly apply do-calculus reasoning: determining what would happen if we changed a variable, rather than merely observing its current state. A model that passes Tier 2 can support decision-making.

Tier 3 — Counterfactual (45% weight). Can the model reason about alternative histories? This is the most demanding tier, requiring the model to evaluate statements of the form “if X had been different, Y would have been Z” while maintaining consistency with observed evidence. Critically, this tier includes EconCausal’s null-effect recognition: the model must achieve above 50% accuracy on cases where no causal effect exists. A model that passes Tier 3 demonstrates something approaching genuine causal understanding.

The asymmetric weighting reflects a fundamental insight from Imbens (2026)^[11]: the potential outcomes framework and Pearl’s structural framework converge at the counterfactual level, making it the most rigorous test of causal capability regardless of theoretical orientation.

Integration with Existing UIB Dimensions #

The UIB theoretical framework (Ivchenko, 2026)^[3] defines intelligence as multi-dimensional and inference-agnostic. Causal Intelligence integrates with the existing dimension architecture as follows:

The causal dimension interacts with but remains distinct from logical reasoning (which tests formal deduction without causal semantics), mathematical ability (which tests computational procedures without causal interpretation), and domain knowledge (which tests factual recall without causal structure). A model could score perfectly on all three while completely failing the causal dimension — and this failure would reveal a critical limitation invisible to current evaluation approaches.

Furthermore, the causal dimension provides a natural bridge to the UIB’s planned embodied intelligence dimension. Research on causal reasoning in military decision-making (MDPI, 2026)^[12] demonstrates that real-world deployment requires models to reason causally about interventions with physical consequences — precisely the capability that separates simulation from embodiment.

Limitations and Open Problems #

Several challenges remain in operationalizing causal intelligence measurement:

Ground truth ambiguity. Unlike mathematical problems with unique solutions, causal questions sometimes admit multiple valid causal models. The UIB scoring methodology must account for this by evaluating reasoning process quality alongside answer correctness.

Scalability of synthetic generation. While CausalProfiler addresses coverage, the computational cost of generating comprehensive causal benchmarks at scale remains substantial. The UIB implementation must balance coverage guarantees against practical evaluation time.

Cultural and domain bias. Causal intuitions vary across cultures and domains. An economic causal model that is valid in one regulatory environment may not transfer to another. The UIB must explicitly test cross-domain transfer rather than assuming causal reasoning generalizes automatically.

The identification-estimation gap. CausalReasoningBenchmark’s disentanglement reveals that identification and estimation are largely independent capabilities. The UIB scoring must reflect this — a model that can identify causal structures but cannot estimate effect sizes has a fundamentally different capability profile than one that can estimate but not identify.

Conclusion #

The incorporation of Causal Intelligence as a formal UIB dimension addresses what is arguably the largest blind spot in current AI evaluation. The evidence is clear: frontier models that appear highly capable on conventional benchmarks exhibit catastrophic failures when tested on genuine causal reasoning — particularly in recognizing the absence of causal effects. By structuring evaluation around Pearl’s three-level hierarchy and incorporating insights from the emerging generation of causal benchmarks, the UIB provides a framework for measuring what models actually understand about the world, rather than what patterns they have memorized from it. The 9.5% null-effect accuracy finding from EconCausal should serve as a wake-up call: without causal intelligence measurement, we are flying blind on the question that matters most — do these systems understand anything at all?

References (12) #

Stabilarity Research Hub. Causal Intelligence as a UIB Dimension: Measuring What Models Actually Understand. doi.org. d
Pearl's Causal Hierarchy (Bareinboim et al., 2022). causalai.net. v
Stabilarity Research Hub. Inference-Agnostic Intelligence: The UIB Theoretical Framework. doi.org. d t r
(20or). Ibeling et al. (2025). arxiv.org. i
(20or). Cai et al. (2026). arxiv.org. i
(20or). Sawarni et al. (2026). arxiv.org. i
(20or). InterveneBench (2026). arxiv.org. i
(20or). EconCausal (Zhou et al., 2026). arxiv.org. i
Zhou et al. (2025). papers.ssrn.com. i
Stabilarity Research Hub. (2026). The Meta-Meta-Analysis: A Systematic Map of What 200 AI Benchmark Studies Actually Measured. doi.org. d t i r
(20or). Imbens (2026). arxiv.org. i
Research on causal reasoning in military decision-making (MDPI, 2026). mdpi.com. l

Version History · 1 revisions