Specifying the Impossible: A Complete Engineering Specification for an Autonomous Humanoid Robot
DOI: 10.5281/zenodo.18946974 · View on Zenodo (CERN)
The Specification Challenge
A humanoid robot is a system of perhaps 500 interdependent requirements. The locomotion subsystem demands actuators with specific torque curves, which constrain motor selection, which determines power draw, which sizes the battery, which adds mass to the structure, which increases the torque requirements for locomotion. Every specification decision cascades through the system. How do you specify something this complex? The conventional answer is iteration: make initial estimates, design the system, discover the estimates were wrong, revise, repeat. This works but obscures the engineering logic. The final specification appears as if it emerged fully formed, disconnected from the tradeoffs that shaped it. We take a different approach: explicit constraint propagation. We define the hard constraints first (mass limit, battery life, emergency stop response), then allocate budgets to subsystems, then verify the allocations sum to less than the total. When constraints conflict, we document the conflict and the resolution. The specification becomes a living record of engineering reasoning, not just a frozen parameter list. This article presents the complete high-level specification for the Open Humanoid. By the end, every subsystem has defined interfaces, budgets, and performance targets. The remaining eighteen articles fill in the detailed designs.
How Industry Approaches Specification
Before presenting our specification, we examine how existing platforms handle the problem.
Boston Dynamics: Capability-First Design
Boston Dynamics appears to practice capability-first design: identify a desired behavior (backflip, stair descent, push recovery), then engineer systems to achieve it. The specification emerges from capability targets rather than preceding them. This approach produces impressive demonstrations but resists systematic documentation. Each capability may require custom solutions that do not generalize. The 2024-2025 transition from hydraulic to electric actuation suggests a fundamental architecture revision that a specification-first approach might have anticipated.
Unitree: Platform Scaling
Unitree demonstrates platform scaling: the G1 (127cm, 35kg) and H1 (180cm, 47kg) share architectural approaches while targeting different applications. The specification discipline manifests in consistent actuator interfaces and software frameworks across platforms. The H1’s world-record 3.3 m/s running speed indicates aggressive performance optimization within a stable specification envelope. Research institutions report that basic locomotion can be achieved in 1-2 weeks with the provided SDK, suggesting well-documented interfaces.
Automotive Approach: Cost-Target Design
Tesla and automotive-adjacent programs practice cost-target design: the specification begins with a price point ($20,000-$30,000 for Optimus), then derives technical requirements that fit the cost envelope. This inverts the traditional engineering sequence where performance requirements precede cost estimation. Cost-target design produces manufacturable systems but may sacrifice capability margins. Reports questioning Optimus’s autonomous capability suggest the cost constraints may have compressed the compute and sensing budgets.
Master Constraint Set
The Open Humanoid specification begins with seven non-negotiable constraints: | Constraint | Value | Rationale | |————|——-|———–| | Total mass | 80 kg maximum | Two-person handling, standard doorway passage | | Height | 160-180 cm | Human-scale environment compatibility | | Battery life | >60 min | Useful work cycles without recharging | | Operating temperature | 0-40 C | Indoor environments, moderate outdoor | | IP rating | IP54 minimum | Dust protection, splash resistance | | Emergency stop response | <100 ms | Industrial safety compliance | | Onboard communication | WiFi 6 + Bluetooth 5.2 | Standard industrial connectivity | These constraints derive from practical deployment requirements, not arbitrary targets. An 80-kilogram limit allows two technicians to handle the robot manually during maintenance or emergency recovery. The 60-minute battery target enables an 8-hour workday with battery swaps, accounting for 85% duty cycle. IP54 protection handles the splash events common in industrial environments without requiring the cost and weight of full waterproofing.
Subsystem Mass Budget
The 80-kilogram mass limit must be allocated across subsystems. Based on analysis of existing platforms and engineering estimates: | Subsystem | Mass Allocation (kg) | Percentage | |———–|———————|————| | Structure (skeleton, housing) | 18.0 | 22.5% | | Lower body actuators | 16.0 | 20.0% | | Upper body actuators | 10.0 | 12.5% | | Battery pack | 12.0 | 15.0% | | Compute and electronics | 4.0 | 5.0% | | Sensors (vision, IMU, force) | 3.0 | 3.75% | | Wiring and connectors | 4.0 | 5.0% | | Hands and end effectors | 3.0 | 3.75% | | Head assembly (sensors, speakers) | 2.5 | 3.125% | | Thermal management | 2.5 | 3.125% | | Margin | 5.0 | 6.25% | | Total | 80.0 | 100% | The 5-kilogram margin (6.25%) provides buffer for integration hardware, cable routing adjustments, and specification changes during detailed design. Without margin, any subsystem overrun would require system-wide redesign. Lower body actuators receive the largest allocation (20%) because bipedal locomotion requires high torque at the hip, knee, and ankle. The Unitree H1 achieves 189 N.m/kg peak torque density in its actuators; we budget for similar performance.
Power Budget
The 60-minute battery life constraint combined with the 12-kilogram battery mass determines available energy. Modern lithium-ion cells achieve approximately 250 Wh/kg at the cell level, degrading to approximately 200 Wh/kg at the pack level after accounting for battery management systems, structural housing, and thermal management. 12 kg battery x 200 Wh/kg = 2,400 Wh total capacity For 60-minute operation: 2,400 Wh / 1 hour = 2,400 W average power budget
pie title Power Budget Allocation (2400W Total)
"Locomotion Actuators" : 1200
"Upper Body Actuators" : 400
"Onboard Compute" : 300
"Sensors & Perception" : 150
"Communication" : 50
"Thermal Management" : 200
"Margin" : 100
| Subsystem | Power Allocation (W) | Percentage |
|---|---|---|
| Locomotion actuators | 1,200 | 50.0% |
| Upper body actuators | 400 | 16.7% |
| Onboard compute | 300 | 12.5% |
| Sensors and perception | 150 | 6.25% |
| Communication | 50 | 2.1% |
| Thermal management | 200 | 8.3% |
| Margin | 100 | 4.2% |
| Total | 2,400 | 100% |
Locomotion consumes 50% of power because bipedal walking requires continuous torque production at multiple joints. This allocation assumes moderate walking speed (1.0-1.5 m/s) on flat terrain. Running gaits or stair climbing would exceed the budget temporarily, supported by battery capacity buffering. The 300W compute budget constrains onboard AI capabilities. For reference, NVIDIA Jetson AGX Orin consumes 15-60W depending on workload; a 300W budget allows multiple accelerator modules or higher-power discrete GPUs.
Subsystem Specifications
Locomotion Subsystem
subsystem: locomotion
version: 0.1
dependencies: [actuation, structure, power, control]
constraints:
mass_budget_kg: 16.0 (lower body actuators)
power_budget_w: 1200
volume_mm: distributed across legs
cost_usd: 8000 target
performance_targets:
gait_speed_ms: 1.5 minimum, 2.5 target
degrees_of_freedom: 12 (6 per leg)
balance_recovery_ms: <500
step_height_mm: 150
ground_clearance_swing_mm: 30
slope_capability_deg: 15
open_challenges:
- Dynamic stability during turning
- Energy-efficient gait generation
- Uneven terrain adaptation
- J R Soc Interface 23(235):20250662 (human-inspired bipedal locomotion)
The 12-DOF lower body provides 3 DOF per hip (flexion/extension, abduction/adduction, rotation), 1 DOF per knee (flexion/extension), and 2 DOF per ankle (flexion/extension, inversion/eversion). This matches the minimal kinematic chain for human-like walking while constraining actuator count. The 500ms balance recovery target requires active center-of-mass adjustment. Research on deep reinforcement learning for locomotion demonstrates that simulation-trained policies can achieve robust recovery using only proprioceptive feedback when trained with appropriate randomization curricula.
Manipulation Subsystem
subsystem: manipulation
version: 0.1
dependencies: [actuation, structure, control, vision]
constraints:
mass_budget_kg: 13.0 (upper body actuators + hands)
power_budget_w: 400
volume_mm: distributed across arms and torso
cost_usd: 6000 target
performance_targets:
arm_dof: 14 (7 per arm)
hand_dof: 24 (12 per hand)
grip_force_n: 40
payload_kg: 5 (per hand), 10 (two-handed)
positioning_accuracy_mm: 5
reach_mm: 700
open_challenges:
- Dexterous manipulation with compliant grasp
- Contact-rich task planning
- Tool use adaptation
- Figure AI BMW pilot data (2025)
- Unitree H1 manipulation specifications
The 7-DOF arm configuration (shoulder 3 DOF, elbow 1 DOF, wrist 3 DOF) provides kinematic redundancy for obstacle avoidance. The 12-DOF hand configuration (4 fingers x 3 DOF each) enables power grasp, precision grasp, and basic in-hand manipulation. 40N grip force allows secure handling of objects up to approximately 4 kg in a friction grip (assuming coefficient 0.5), with higher capacity in form-closure grasps.
Vision Subsystem
subsystem: vision
version: 0.1
dependencies: [compute, power]
constraints:
mass_budget_kg: 1.5
power_budget_w: 80
volume_mm: head-mounted, 150x100x80
cost_usd: 2000 target
performance_targets:
rgb_resolution: 1920x1080
depth_resolution: 640x480
field_of_view_deg: 90 horizontal
frame_rate_hz: 30
depth_range_m: 0.3-10
latency_ms: <50
open_challenges:
- Real-time object detection at 30 fps
- Robust depth estimation in varied lighting
- SLAM in dynamic environments
- Unitree sensor specifications
- Intel RealSense D455 benchmarks
The vision subsystem combines RGB camera for appearance processing with depth sensor for spatial understanding. The 50ms latency target requires tight integration between sensing and compute; typical USB-connected depth cameras add 30-50ms latency before processing.
Speech Subsystem
subsystem: speech
version: 0.1
dependencies: [compute, power]
constraints:
mass_budget_kg: 0.5
power_budget_w: 30
volume_mm: head-mounted, 80x60x40
cost_usd: 500 target
performance_targets:
asr_latency_ms: <200
tts_latency_ms: <100
wake_word_detection: always-on, <10mW
language_support: en, de, zh minimum
noise_robustness_snr_db: 5
open_challenges:
- Onboard LLM inference within power budget
- Real-time conversation with <500ms response
- Multi-speaker disambiguation
- Whisper model specifications
- Edge LLM benchmarks 2026
The speech subsystem faces the most significant compute constraints. Running a language model on-device within a 30W allocation requires quantized models and specialized inference hardware. Cloud fallback may be necessary for complex reasoning while keeping basic interaction local.
Compute Subsystem
subsystem: compute
version: 0.1
dependencies: [power, thermal]
constraints:
mass_budget_kg: 4.0
power_budget_w: 300
volume_mm: torso-mounted, 200x150x100
cost_usd: 5000 target
performance_targets:
flops_inference: 200 TOPS (INT8)
flops_float: 50 TFLOPS (FP16)
memory_gb: 32
control_loop_hz: 1000
perception_latency_ms: <50
open_challenges:
- Real-time control + perception on shared hardware
- Thermal management within enclosure
- Deterministic scheduling for safety-critical loops
- NVIDIA Jetson specifications
- Real-time OS benchmarks
flowchart LR
subgraph Sensors
IMU[IMU 1kHz]
Encoders[Joint Encoders 1kHz]
Force[Force Sensors 500Hz]
Vision[Vision 30Hz]
Audio[Audio 16kHz]
end
subgraph Perception["Perception Pipeline"]
StateEst[State Estimation]
ObjDet[Object Detection]
SLAM[SLAM]
ASR[Speech Recognition]
end
subgraph Planning["Planning Layer"]
MotionPlan[Motion Planning]
TaskPlan[Task Planning]
NavPlan[Navigation]
end
subgraph Control["Control Layer"]
WholeBody[Whole-Body Control]
JointCtrl[Joint Controllers]
SafetyMon[Safety Monitor]
end
subgraph Actuation
Motors[Motor Drivers]
Speakers[Audio Output]
end
IMU --> StateEst
Encoders --> StateEst
Force --> StateEst
Vision --> ObjDet
Vision --> SLAM
Audio --> ASR
StateEst --> WholeBody
ObjDet --> MotionPlan
SLAM --> NavPlan
ASR --> TaskPlan
MotionPlan --> WholeBody
TaskPlan --> MotionPlan
NavPlan --> MotionPlan
WholeBody --> JointCtrl
JointCtrl --> Motors
SafetyMon --> Motors
TaskPlan --> Speakers
The compute architecture separates real-time control (1kHz joint control, 100Hz whole-body control) from perception (30Hz vision, streaming audio). A real-time operating system partition handles control while a Linux partition handles perception and planning. Safety monitoring operates independently with hardware watchdog timers.
Power Subsystem
subsystem: power
version: 0.1
dependencies: [thermal, structure]
constraints:
mass_budget_kg: 12.0
volume_mm: torso-mounted, 300x200x100
cost_usd: 3000 target
performance_targets:
capacity_wh: 2400
voltage_v: 48 nominal
peak_discharge_a: 100
charging_time_hr: 2 (fast charge)
cycle_life: 1000 cycles to 80%
hot_swap: supported
open_challenges:
- Thermal runaway prevention
- Cell balancing during high-current discharge
- Weight distribution for balance
- LG Chem cell specifications
- BMS design guidelines
The 48V nominal voltage balances actuator efficiency (higher voltage = lower current = thinner cables) against safety (lower voltage = reduced shock hazard). Hot-swap capability enables continuous operation across battery changes.
Structure Subsystem
subsystem: structure
version: 0.1
dependencies: [all - provides mounting]
constraints:
mass_budget_kg: 18.0
cost_usd: 4000 target
performance_targets:
materials: carbon fiber (limbs), aluminum 6061 (joints), TPU (covers)
factor_of_safety: 2.5 static, 4.0 fatigue
joint_stiffness_nm_deg: 1000 minimum
environmental: IP54
open_challenges:
- Impact absorption during falls
- Cable routing through joints
- Maintenance accessibility
- Carbon fiber layup standards
- IP54 sealing guidelines
The structural subsystem provides mounting for all other subsystems while maintaining stiffness under dynamic loads. Carbon fiber offers the best strength-to-weight ratio for limb segments; aluminum provides manufacturability at joint housings; TPU covers protect electronics while allowing compliance.
System Dependencies
graph TD
subgraph Core["Core Systems"]
Power[Power]
Compute[Compute]
Structure[Structure]
end
subgraph Mobility["Mobility Systems"]
Locomotion[Locomotion]
Navigation[Navigation]
end
subgraph Perception["Perception Systems"]
Vision[Vision]
Sensors[Sensor Fusion]
end
subgraph Interaction["Interaction Systems"]
Manipulation[Manipulation]
Speech[Speech]
end
subgraph Safety["Safety Systems"]
SafetySys[Safety & E-Stop]
Thermal[Thermal Management]
end
Power --> Compute
Power --> Locomotion
Power --> Manipulation
Power --> Vision
Power --> Speech
Structure --> Locomotion
Structure --> Manipulation
Structure --> Power
Compute --> Locomotion
Compute --> Navigation
Compute --> Vision
Compute --> Sensors
Compute --> Manipulation
Compute --> Speech
Vision --> Sensors
Vision --> Navigation
Vision --> Manipulation
Sensors --> Locomotion
Sensors --> Safety
Navigation --> Locomotion
SafetySys --> Power
SafetySys --> Locomotion
SafetySys --> Manipulation
Thermal --> Power
Thermal --> Compute
The dependency graph reveals that Power, Compute, and Structure are foundational: every other subsystem depends on them. This suggests the design sequence should finalize these three subsystems first, providing stable interfaces for dependent systems.
The Twenty-Article Decomposition
Each subsystem specification above becomes the foundation for a detailed design article: | Article | Subsystem Focus | Key Deliverable | |———|—————–|—————–| | 3 | Locomotion | Gait generation algorithm | | 4 | Actuation | Motor selection and torque analysis | | 5 | Structure | Material selection and stress analysis | | 6 | Power | Battery cell selection and BMS design | | 7 | Compute | Hardware selection and OS configuration | | 8 | Vision | Camera selection and perception pipeline | | 9 | Sensors | IMU selection and fusion algorithm | | 10 | Manipulation | Hand design and grasp planning | | 11 | Speech | ASR/TTS integration and LLM deployment | | 12 | Navigation | SLAM algorithm and path planning | | 13 | Force Control | Impedance control implementation | | 14 | Safety | Emergency stop and fall detection | | 15 | Control | Real-time architecture and scheduling | | 16 | Communication | Multi-robot protocol design | | 17 | Simulation | Physics engine integration | | 18 | Integration | System testing methodology | | 19 | Assembly | Bill of materials and sourcing | | 20 | Demonstration | Two-robot simulation room | Each article will reference this specification, verify compliance with budgets, and update constraint allocations as detailed design reveals opportunities or conflicts.
Specification Management
The specification lives in a GitHub repository alongside simulation code:
open-humanoid/
specs/
MASTER_SCHEMA.md
locomotion.yaml
manipulation.yaml
vision.yaml
...
simulation/
index.html
src/
assets/
articles/
ROADMAP.md
docs/
As each article completes, its corresponding specification file updates from status: specified to status: validated (design complete) or status: simulated (simulation confirms performance). Version control provides complete traceability. If a later article discovers that the manipulation mass budget is insufficient, the commit history shows exactly what changed and why.
Conclusion
Specifying a humanoid robot requires disciplined constraint management: allocating budgets across subsystems, tracking dependencies, and maintaining margin for integration. The specification presented here provides the foundation for the remaining eighteen articles. Every number in this specification is provisional. The mass budgets will shift as detailed design proceeds. The power allocations will adjust as actuator selections finalize. The interface definitions will evolve as integration reveals missing signals. But the structure remains stable: explicit constraints, explicit allocations, explicit rationale. When the final simulation room demonstrates two walking, communicating robots, every design decision will trace back to this specification. The next article addresses the first motion challenge: bipedal gait and balance control.
References
- Ivchenko, O. (2026). The Open Humanoid: Why We Are Building a Robot From First Principles. Stabilarity Research Hub, Article 1 of 20.
- arXiv. (2026). Robust humanoid walking on compliant and uneven terrain with deep reinforcement learning. arXiv arXiv:2504.13619. Available: https://arxiv.org/abs/2504.13619
- Koseki, S., Hayashibe, M., & Owaki, D. (2026). Human-inspired bipedal locomotion: from neuromechanics to mathematical modelling and robotic applications. Journal of the Royal Society Interface, 23(235), 20250662.
- arXiv. (2025). Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning. arXiv arXiv:2501.02116. Available: https://arxiv.org/abs/2501.02116
- Unitree Robotics. (2026). H1 and G1 Technical Specifications.
- Figure AI. (2025). Figure 02 BMW deployment technical summary.
- NVIDIA. (2026). Jetson AGX Orin Technical Reference Manual.
- Radosavovic, I., et al. (2024). Humanoid locomotion as next token prediction. arXiv arXiv:2402.19469. Available: https://arxiv.org/abs/2402.19469
- Kim, D., et al. (2023). Torque-based deep reinforcement learning for task-and-robot agnostic learning on bipedal robots using sim-to-real transfer. IEEE Robotics and Automation Letters. https://doi.org/10.1109/LRA.2023.3234044
- arXiv. (2025). Deep reinforcement learning for robotic bipedal locomotion: A brief survey. arXiv arXiv:2404.17070. Available: https://arxiv.org/abs/2404.17070