Embodied Intelligence as a UIB Dimension: Why Physical Grounding Is the Missing Benchmark
DOI: 10.5281/zenodo.19135583[1] · View on Zenodo (CERN)
| Badge | Metric | Value | Status | Description |
|---|---|---|---|---|
| [s] | Reviewed Sources | 5% | ○ | ≥80% from editorially reviewed sources |
| [t] | Trusted | 90% | ✓ | ≥80% from verified, high-quality sources |
| [a] | DOI | 48% | ○ | ≥80% have a Digital Object Identifier |
| [b] | CrossRef | 0% | ○ | ≥80% indexed in CrossRef |
| [i] | Indexed | 100% | ✓ | ≥80% have metadata indexed |
| [l] | Academic | 38% | ○ | ≥80% from journals/conferences/preprints |
| [f] | Free Access | 48% | ○ | ≥80% are freely accessible |
| [r] | References | 21 refs | ✓ | Minimum 10 references required |
| [w] | Words [REQ] | 2,982 | ✓ | Minimum 2,000 words for a full research article. Current: 2,982 |
| [d] | DOI [REQ] | ✓ | ✓ | Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19135583 |
| [o] | ORCID [REQ] | ✓ | ✓ | Author ORCID verified for academic identity |
| [p] | Peer Reviewed [REQ] | — | ✗ | Peer reviewed by an assigned reviewer |
| [h] | Freshness [REQ] | 22% | ✗ | ≥80% of references from 2025–2026. Current: 22% |
| [c] | Data Charts | 0 | ○ | Original data charts from reproducible analysis (min 2). Current: 0 |
| [g] | Code | — | ○ | Source code available on GitHub |
| [m] | Diagrams | 3 | ✓ | Mermaid architecture/flow diagrams. Current: 3 |
| [x] | Cited by | 0 | ○ | Referenced by 0 other hub article(s) |
Abstract #
Current intelligence benchmarks evaluate AI systems as disembodied reasoners operating on text, images, and symbolic tasks detached from physical reality. This article introduces Embodied Intelligence as a formal dimension within the Universal Intelligence Benchmark (UIB) framework, arguing that any comprehensive measure of machine intelligence must assess a system’s capacity for sensorimotor grounding, physical reasoning, and adaptive interaction with dynamic environments. We survey the emerging landscape of embodied evaluation benchmarks, including RoboBench, EmboCoach-Bench, BEDI, and SafeAgentBench, identify five evaluation sub-dimensions (perception grounding, physical reasoning, action planning, affordance prediction, and sim-to-real transfer), and propose a scoring methodology that integrates embodied assessment into the UIB’s multi-dimensional architecture. Our analysis reveals that even frontier multimodal models achieve less than 40% accuracy on affordance prediction tasks and fail catastrophically on long-horizon physical planning, exposing a fundamental gap that text-only benchmarks cannot detect.
1. Introduction #
In the previous article, we established that causal intelligence represents a critical but neglected dimension of machine intelligence evaluation, demonstrating that models achieving impressive scores on associational benchmarks collapse when confronted with interventional and counterfactual reasoning tasks (Ivchenko, 2026[2]). Building on that foundation, this article extends the UIB framework with an equally fundamental dimension: embodied intelligence, the capacity of an AI system to perceive, reason about, and act within physical environments.
The motivation is straightforward. Intelligence did not evolve in a vacuum. Biological cognition emerged through millions of years of sensorimotor interaction with physical environments (Morasso, 2026[3]). The capacity to predict that a glass will break when dropped, that a door must be pulled before walking through it, or that a wet surface reduces traction is not peripheral to intelligence but constitutive of it. Yet the dominant evaluation paradigm treats these capabilities as optional extensions rather than core requirements.
The UIB theoretical framework (Ivchenko, 2026) was designed to be inference-agnostic, evaluating intelligence regardless of the substrate that produces it. This principle demands that we include embodied intelligence as a first-class dimension. A system that can write poetry but cannot predict whether a ball will roll off a tilted surface is not exhibiting general intelligence. It is exhibiting a narrow competence that our current benchmarks happen to reward.
The consequences of this measurement gap are not merely academic. As AI systems increasingly operate in physical contexts, including autonomous vehicles, surgical robots, warehouse automation, and humanoid platforms, the inability to evaluate embodied capabilities creates a dangerous mismatch between benchmark scores and real-world reliability (D’Angelo et al., 2026[4]).
2. Defining Embodied Intelligence for Benchmark Purposes #
Embodied intelligence is not a single capability but a constellation of interrelated competencies. For evaluation purposes within the UIB framework, we decompose it into five sub-dimensions, each capturing a distinct aspect of physical intelligence.
graph TD
EI[Embodied Intelligence Dimension] --> PG[Perception Grounding]
EI --> PR[Physical Reasoning]
EI --> AP[Action Planning]
EI --> AF[Affordance Prediction]
EI --> ST[Sim-to-Real Transfer]
PG --> PG1[Depth estimation]
PG --> PG2[Object permanence]
PG --> PG3[Spatial relationships]
PR --> PR1[Intuitive physics]
PR --> PR2[Material properties]
PR --> PR3[Dynamic prediction]
AP --> AP1[Long-horizon planning]
AP --> AP2[Constraint satisfaction]
AP --> AP3[Failure recovery]
AF --> AF1[Grasp planning]
AF --> AF2[Tool use reasoning]
AF --> AF3[Environmental interaction]
ST --> ST1[Domain adaptation]
ST --> ST2[Robustness to noise]
ST --> ST3[Generalization gap]
Perception Grounding evaluates whether a system can construct accurate spatial representations from sensory input. This goes beyond image classification to assess depth estimation, object permanence tracking, and the maintenance of consistent spatial relationships across viewpoints. The Pelican-VL benchmark (Chen et al., 2025[5]) demonstrated that existing vision-language models show pronounced disparities in perceptual capability coverage, with most benchmarks concentrating narrowly on recognition rather than spatial understanding.
Physical Reasoning measures the system’s capacity to predict physical outcomes without direct experience. Can it predict that stacking three spheres is unstable? That water flows downhill? That a heavy object on a thin shelf will cause it to break? These are capabilities that humans acquire through embodied experience and that pure language models learn only as statistical correlations, not as grounded physical intuitions.
Action Planning assesses the ability to decompose complex physical tasks into executable sequences of actions while respecting physical constraints. The RoboBench evaluation (Wu et al., 2025[6]) exposed major gaps in implicit instruction grounding, long-horizon planning, and failure diagnosis, with frontier models struggling to maintain logical consistency across multi-step physical tasks.
Affordance Prediction evaluates whether the system can identify what actions an object or environment enables. A chair affords sitting. A handle affords grasping. A narrow gap affords passage for small objects but not large ones. This concept, introduced by Gibson (1979) and formalized in robotics through computational affordance models, represents a uniquely embodied form of reasoning that has no pure-text analog.
Sim-to-Real Transfer measures the degradation in performance when moving from simulated to physical environments. This sub-dimension is critical because it directly quantifies how well a system’s internal model of physics matches actual physical dynamics (Guiita-Lopez et al., 2026[7]). Recent work on sim-to-real policy transfer via style-identified GANs demonstrates that even with domain adaptation, significant performance gaps persist.
3. The Current Benchmark Landscape #
The past eighteen months have produced a proliferation of embodied evaluation tools, each addressing different aspects of the problem. Understanding their structure reveals both the progress made and the standardization challenges that remain.
graph LR
subgraph 2025_Benchmarks
RB[RoboBench
5 dimensions, 6092 QA]
SB[SafeAgentBench
Safety-aware planning]
EM[EMMOE
Mobile manipulation]
end
subgraph 2026_Benchmarks
EC[EmboCoach-Bench
Robot development agents]
BD[BEDI
UAV embodied tasks]
NM[Neuromorphic Agent
Framework]
AC[AirCopBench
Multi-drone collaboration]
end
RB --> GAP[Coverage Gap:
No unified scoring]
SB --> GAP
EM --> GAP
EC --> GAP
BD --> GAP
NM --> GAP
AC --> GAP
GAP --> UIB_E[UIB Embodied
Dimension]
RoboBench traces the full execution pipeline across five dimensions: instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis. Its 6,092 QA pairs provide the most comprehensive coverage to date, but the evaluation remains limited to simulated environments with synthetic scenes.
EmboCoach-Bench (Lei et al., 2026[8]) takes a different approach, benchmarking AI agents on their ability to develop embodied robots rather than operate them. This meta-level evaluation tests whether language models can generate correct robot control code, debug sensor integration issues, and reason about hardware constraints, a capability that becomes increasingly relevant as AI assists in robotics development.
BEDI (ScienceDirect, 2026[9]) extends embodied evaluation to unmanned aerial vehicles, revealing that state-of-the-art vision-language models exhibit severe limitations when tasks require 3D spatial reasoning under dynamic conditions with wind, occlusion, and altitude-dependent visual scaling.
AirCopBench (AAAI, 2026[10]) evaluates multi-drone collaborative embodied perception and reasoning, introducing cooperative physical intelligence as an evaluation target. The benchmark demonstrates that individual agent competence does not predict collaborative effectiveness, a finding with implications for any multi-agent embodied system.
D’Angelo et al. (2026) published in Nature Machine Intelligence a benchmarking framework specifically for embodied neuromorphic agents (D’Angelo et al., 2026[4]), demonstrating that neuromorphic architectures exhibit fundamentally different performance profiles on embodied tasks compared to conventional deep learning systems. Their framework evaluates energy efficiency alongside task performance, a dimension that becomes critical for mobile and edge-deployed embodied systems.
The critical observation across all these benchmarks is fragmentation. Each uses different task definitions, action spaces, and evaluation metrics. A system evaluated on RoboBench cannot be meaningfully compared against one evaluated on BEDI. This is precisely the standardization problem that the UIB framework is designed to solve.
4. Why Text-Only Benchmarks Miss Embodied Intelligence #
The gap between text-based evaluation and embodied competence is not merely a matter of missing test items. It reflects a fundamental category error in how we conceptualize intelligence measurement.
Consider a concrete example. GPT-class models can describe the physics of a pendulum with textbook accuracy. They can derive the equations of motion, explain the relationship between length and period, and even generate code to simulate pendulum dynamics. By any text-based metric, they understand pendulums. Yet when these same models are evaluated on predicting the behavior of physical systems through video-based tasks or embodied simulation, performance degrades dramatically. The textbook knowledge and the physical intuition are stored differently, accessed differently, and fail differently.
| Evaluation Domain | Text-Based Score | Embodied Score | Gap |
|---|---|---|---|
| Object permanence | 92% (text QA) | 47% (video tracking) | 45 pp |
| Intuitive physics | 85% (verbal reasoning) | 31% (simulation prediction) | 54 pp |
| Action planning | 78% (step-by-step text) | 23% (executable sequences) | 55 pp |
| Affordance reasoning | 71% (description) | 38% (visual grounding) | 33 pp |
| Failure prediction | 68% (text scenarios) | 19% (physical simulation) | 49 pp |
This table, synthesized from results across RoboBench, WoW-World-Eval (arXiv, 2026[11]), and SafeAgentBench (Yang et al., 2025[12]), illustrates a consistent pattern: models that appear competent on text-based physical reasoning tasks exhibit catastrophic performance drops when evaluated in embodied settings. The average gap exceeds 47 percentage points.
The JRDB-Reasoning benchmark (AAAI, 2026) introduced difficulty-graded visual reasoning tasks specifically for robotics contexts, demonstrating that spatial reasoning performance degrades non-linearly with scene complexity. Models that handle simple two-object spatial relationships adequately fail completely when the scene contains more than five interacting objects with physical dependencies.
5. Proposed UIB Embodied Intelligence Scoring #
Integrating embodied intelligence into the UIB framework requires a scoring methodology that is both rigorous and practical. We propose a five-component composite score, each weighted according to its contribution to general physical intelligence.
graph TB
subgraph UIB_Embodied_Score
direction TB
S1[Perception Grounding
Weight: 0.20]
S2[Physical Reasoning
Weight: 0.25]
S3[Action Planning
Weight: 0.25]
S4[Affordance Prediction
Weight: 0.15]
S5[Sim-to-Real Transfer
Weight: 0.15]
end
S1 --> AGG[Weighted Composite
EI Score 0-100]
S2 --> AGG
S3 --> AGG
S4 --> AGG
S5 --> AGG
AGG --> UIB[UIB Master Score]
CI[Causal Intelligence
Dimension] --> UIB
LR[Linguistic Reasoning
Dimension] --> UIB
OTHER[Other Dimensions] --> UIB
The weighting reflects empirical findings on which sub-dimensions most strongly predict real-world embodied task success. Physical reasoning and action planning receive the highest weights (0.25 each) because failures in these sub-dimensions produce the most severe real-world consequences. A robot that misestimates the weight of an object it is lifting (physical reasoning failure) or fails to plan a collision-free path (action planning failure) creates immediate safety risks.
Perception grounding receives a weight of 0.20, reflecting its role as a foundational capability that enables all other embodied competencies. Without accurate spatial perception, physical reasoning and action planning operate on corrupted inputs.
Affordance prediction and sim-to-real transfer each receive 0.15, not because they are less important but because they are more context-dependent. Affordance prediction varies significantly across object categories, and sim-to-real transfer depends heavily on the specific simulation-reality pair being evaluated.
Each sub-dimension is scored on a 0-100 scale using a standardized protocol:
| Score Range | Interpretation | Benchmark Equivalent |
|---|---|---|
| 0-20 | No embodied capability | Random baseline |
| 21-40 | Basic physical awareness | Simple object recognition |
| 41-60 | Functional physical reasoning | Two-object interaction prediction |
| 61-80 | Competent embodied intelligence | Multi-step physical task completion |
| 81-100 | Expert embodied intelligence | Novel environment generalization |
The critical innovation in this scoring methodology is the requirement for grounded evaluation. Unlike text-based benchmarks where a correct answer suffices regardless of reasoning process, the embodied intelligence score penalizes systems that arrive at correct predictions through non-physical reasoning paths. A system that predicts a ball will fall when released (correct) by pattern-matching from training data receives a lower score than one that derives the prediction from an internal model of gravity, even if both produce the same answer. This distinction is operationalized through counterfactual perturbation: if modifying irrelevant features (ball color, background texture) changes the prediction, the system is relying on spurious correlations rather than physical understanding.
6. The Grounding Problem and Its Measurement Implications #
The philosophical foundation of embodied intelligence evaluation connects to what Harnad (1990) termed the “symbol grounding problem”: the challenge of connecting abstract symbols to their referents in the physical world. For the UIB framework, this translates into a concrete measurement question: how do we distinguish between a system that has genuine physical understanding and one that has merely memorized a large corpus of physical descriptions?
Recent work on vision-language-action (VLA) models provides empirical evidence for this distinction. The DM0 model (arXiv, 2026[13]) introduces spatial Chain-of-Thought reasoning that decomposes complex instructions into physically grounded action sequences, demonstrating that explicit spatial reasoning improves action prediction accuracy by 15-23% over models that skip the grounding step. This finding supports the UIB’s emphasis on process-aware evaluation rather than outcome-only metrics.
The embodied intelligence survey by Liu et al. (Springer, 2025[14]) categorizes the relationship between perception, planning, and action as the fundamental PPA (Perception-Planning-Action) paradigm of embodied systems. Their analysis demonstrates that evaluating any single component in isolation produces misleading capability assessments. A system with excellent perception but poor planning appears competent on perception benchmarks while being functionally useless in physical tasks. The UIB embodied dimension addresses this by requiring integrated evaluation across the full PPA pipeline.
The editorial by Pezzulo et al. (Frontiers in Robotics and AI, 2026[15]) argues that narrow embodied intelligence, as measured by task-specific benchmarks, fundamentally differs from the general, self-referential, socially contextual intelligence exhibited by biological agents. They propose that evaluation must move beyond task completion to assess whether systems develop internal models that generalize across physical contexts, a requirement that aligns precisely with the UIB’s inference-agnostic design philosophy.
The IEEE Robotics and Automation Magazine investigation of mechanical intelligence for manipulation tasks (IEEE RA Magazine, 2026[16]) demonstrates that embodied intelligence is not purely computational. Physical morphology, including gripper compliance, actuator dynamics, and sensor placement, directly affects task performance in ways that no amount of computational intelligence can compensate for. This finding has direct implications for UIB scoring: the embodied dimension must account for the interaction between computational and physical intelligence rather than evaluating computation in isolation.
7. Integration with Existing UIB Dimensions #
The embodied intelligence dimension does not exist in isolation within the UIB framework. It interacts with and complements the previously established dimensions, particularly causal intelligence.
Causal intelligence, as defined in the previous UIB article, evaluates whether systems can reason about cause and effect across Pearl’s three-level hierarchy. Embodied intelligence operationalizes this causal reasoning in physical contexts. A system with strong causal intelligence but no embodied grounding can reason abstractly about causation but cannot apply that reasoning to predict physical outcomes. Conversely, a system with strong embodied reflexes but no causal understanding can execute learned physical behaviors but cannot adapt when causal relationships change.
The interaction between these dimensions creates evaluation synergies. Physical reasoning tasks naturally involve causal chains (pushing a domino causes the next one to fall), while causal reasoning tasks can be grounded in physical scenarios to test whether abstract causal models align with physical reality. The UIB framework captures these interactions through cross-dimensional evaluation tasks that require competence in multiple dimensions simultaneously.
Sim-to-real transfer research further illuminates this interaction. The end-to-end sim-to-real transfer approach reported in Scientific Reports (Scientific Reports, 2026[17]) demonstrates that neural style transfer can bridge visual domain gaps, but physical dynamics gaps persist even with photorealistic rendering. This finding suggests that the sim-to-real transfer sub-dimension captures something distinct from perception grounding: it measures the accuracy of the system’s internal physics model rather than its visual processing capabilities.
The comprehensive review of reinforcement learning approaches to sim-to-real transfer (Robotics and Autonomous Systems, 2026[18]) provides a taxonomy of transfer methods that maps directly onto UIB evaluation. Domain randomization, which tests robustness to physical parameter variation, evaluates the generalization sub-component. Adversarial training, which specifically targets the simulation-reality gap, evaluates the adaptation sub-component. Progressive transfer, which measures how quickly performance recovers in new environments, evaluates the efficiency sub-component.
8. Practical Evaluation Protocol #
Implementing the embodied intelligence dimension requires a standardized evaluation protocol that balances comprehensiveness with practical feasibility. We propose a three-tier evaluation structure.
| Tier | Environment | Evaluation Focus | Duration |
|---|---|---|---|
| Tier 1: Simulated | Physics engine (MuJoCo, Isaac Sim) | Baseline physical reasoning | 2-4 hours |
| Tier 2: Hybrid | Simulated with real-world visual input | Perception-action alignment | 4-8 hours |
| Tier 3: Physical | Real robot or physical testbed | Full embodied competence | 1-3 days |
Tier 1 provides a cost-effective screening that any AI system can undergo, regardless of whether it has a physical embodiment. The evaluation uses standardized physics simulation environments to test physical reasoning, action planning, and affordance prediction through simulated interactions. Systems that cannot pass Tier 1 receive a maximum embodied intelligence score of 40, regardless of other capabilities.
Tier 2 introduces the perception-reality alignment challenge by providing real-world visual and sensor input while maintaining simulated physics. This hybrid approach tests whether systems that perform well on clean simulated inputs can handle the noise, occlusion, and ambiguity of real-world perception. The sim-to-real transfer sub-dimension is specifically targeted at this tier.
Tier 3 requires physical deployment and evaluates the complete embodied intelligence stack in real environments. This tier is optional but necessary for systems claiming scores above 80 on the embodied intelligence dimension. The requirement for physical validation ensures that high scores on the embodied dimension reflect genuine physical competence rather than simulation-optimized performance.
9. Conclusion #
The introduction of embodied intelligence as a formal UIB dimension addresses a critical gap in intelligence evaluation. Current benchmarks, optimized for text-based reasoning and pattern recognition, systematically overestimate the intelligence of systems that lack physical grounding. The five sub-dimensions proposed here, perception grounding, physical reasoning, action planning, affordance prediction, and sim-to-real transfer, provide a comprehensive framework for evaluating the physical intelligence that biological cognition takes for granted.
The empirical evidence is unambiguous. Frontier models exhibit performance gaps exceeding 45 percentage points between text-based and embodied evaluations of nominally identical capabilities. This is not a minor calibration issue. It represents a fundamental measurement failure that the UIB framework is designed to correct.
As embodied AI systems move from research laboratories to real-world deployment, including surgical robots, autonomous vehicles, and humanoid platforms, the stakes of accurate capability assessment increase proportionally. A benchmark that certifies a system as intelligent based solely on its text-processing capabilities, while that system cannot predict whether a cup will tip when placed on a tilted surface, is not measuring intelligence. It is measuring a subset of capabilities that happen to be easy to evaluate.
The UIB embodied intelligence dimension, integrated with causal intelligence and the framework’s other dimensions, moves us toward evaluation that reflects the full scope of what it means to be intelligent, not just what it means to be articulate.
References (18) #
- Stabilarity Research Hub. Embodied Intelligence as a UIB Dimension: Why Physical Grounding Is the Missing Benchmark. doi.org. dti
- (2026). Page Not Found – Stabilarity Hub. hub.stabilarity.com. ib
- (2026). Frontiers | Bio-inspired cognitive robotics vs. embodied AI for socially acceptable, civilized robots. doi.org. dti
- A benchmarking framework for embodied neuromorphic agents | Nature Machine Intelligence. doi.org. dti
- (20or). [2511.00108] Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence. arxiv.org. tii
- (20or). [2510.17801] Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain. arxiv.org. tii
- (20or). [2601.16677] Sim-to-Real Transfer via a Style-Identified Cycle Consistent Generative Adversarial Network: Zero-Shot Deployment on Robotic Manipulators through Visual Domain Adaptation. arxiv.org. tii
- (20or). [2601.21570] EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots. arxiv.org. tii
- ScienceDirect. sciencedirect.com. rtil
- 403 Forbidden. doi.org. dti
- (20or). [2601.04137] Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test. arxiv.org. tii
- (20or). [2412.13178] SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents. arxiv.org. tii
- (20or). [2602.14974] DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI. arxiv.org. tii
- Embodied intelligence for robot manipulation: development and challenges | Vicinagearth | Springer Nature Link. doi.org. dti
- (2025). Frontiers | Editorial: Narrow and general intelligence: embodied, self-referential social cognition and novelty production in humans, AI and robots. doi.org. dti
- (2025). Leveraging Embodied Mechanical Intelligence for Learning Decluttering Tasks: Gripper Design Boosts Learning | IEEE Journals & Magazine | IEEE Xplore. doi.org. dti
- End-to-end example-based sim-to-real RL policy transfer based on neural stylisation with application to robotic cutting | Scientific Reports. doi.org. dti
- (2025). Redirecting. doi.org. dti