Human-Robot Interaction: Gesture Recognition, Emotion Detection, and Social Behaviour for Humanoid Robots
DOI: 10.5281/zenodo.19154329[1] · View on Zenodo (CERN)
| Badge | Metric | Value | Status | Description |
|---|---|---|---|---|
| [s] | Reviewed Sources | 0% | ○ | ≥80% from editorially reviewed sources |
| [t] | Trusted | 93% | ✓ | ≥80% from verified, high-quality sources |
| [a] | DOI | 67% | ○ | ≥80% have a Digital Object Identifier |
| [b] | CrossRef | 0% | ○ | ≥80% indexed in CrossRef |
| [i] | Indexed | 100% | ✓ | ≥80% have metadata indexed |
| [l] | Academic | 20% | ○ | ≥80% from journals/conferences/preprints |
| [f] | Free Access | 33% | ○ | ≥80% are freely accessible |
| [r] | References | 15 refs | ✓ | Minimum 10 references required |
| [w] | Words [REQ] | 2,701 | ✓ | Minimum 2,000 words for a full research article. Current: 2,701 |
| [d] | DOI [REQ] | ✓ | ✓ | Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19154329 |
| [o] | ORCID [REQ] | ✓ | ✓ | Author ORCID verified for academic identity |
| [p] | Peer Reviewed [REQ] | — | ✗ | Peer reviewed by an assigned reviewer |
| [h] | Freshness [REQ] | 54% | ✗ | ≥80% of references from 2025–2026. Current: 54% |
| [c] | Data Charts | 0 | ○ | Original data charts from reproducible analysis (min 2). Current: 0 |
| [g] | Code | — | ○ | Source code available on GitHub |
| [m] | Diagrams | 3 | ✓ | Mermaid architecture/flow diagrams. Current: 3 |
| [x] | Cited by | 0 | ○ | Referenced by 0 other hub article(s) |
Abstract #
For a humanoid robot to operate alongside humans in domestic, healthcare, and industrial settings, it must perceive and respond to the non-verbal cues that govern human social interaction. This article examines three pillars of human-robot interaction (HRI) for open-source humanoid platforms: gesture recognition through vision and inertial sensing, emotion detection via facial expression analysis and multimodal affective computing, and the generation of socially appropriate robot behaviour grounded in large language model (LLM) reasoning. We present sensor selection criteria, real-time inference architectures suitable for edge deployment, and a control-loop design that closes the perception-action gap for social responsiveness. Each subsystem is specified with sufficient engineering detail to guide implementation on the Open Humanoid reference platform, continuing our commitment to reproducible, first-principles robotics design.
1. Introduction #
In the previous article, we specified the communication protocols, middleware topology, and real-time networking stack that allow every subsystem of the Open Humanoid to exchange data deterministically ([1][2]). With reliable low-latency transport in place, we now turn to the highest layer of the robot’s software architecture: the perception and generation of social signals that enable natural human-robot interaction.
Human communication is overwhelmingly non-verbal. Mehrabian’s classic decomposition attributes only seven percent of emotional meaning to words, with thirty-eight percent carried by vocal tone and fifty-five percent by facial expression and body language. A humanoid robot that ignores these channels is, in practice, socially deaf. Recent systematic reviews confirm that physical gestures produced by robots are as identifiable as those of humans, underscoring the importance of bidirectional non-verbal communication in HRI ([2][3]). The convergence of lightweight deep learning, edge accelerators, and foundation models has made real-time social perception feasible on embedded hardware for the first time ([3][4]).
This article specifies three interdependent subsystems for the Open Humanoid: (i) a gesture recognition pipeline that interprets human hand and body poses, (ii) an emotion detection module that fuses facial expression analysis with vocal prosody, and (iii) a social behaviour planner that uses LLM-based reasoning to select contextually appropriate robot responses. Together, these subsystems close the social interaction loop, transforming the Open Humanoid from a capable manipulator into a collaborative partner.
2. Gesture Recognition Pipeline #
2.1 Sensor Modalities for Gesture Capture #
Robust gesture recognition requires redundant sensing across multiple modalities. The Open Humanoid’s existing perception stack, specified in Article 8 of this series, provides stereo RGB-D cameras and an IMU array. For HRI, we augment these with a dedicated wide-angle RGB camera mounted at chest height, angled upward to capture the human interlocutor’s upper body within conversational distance (0.5-2.0 m).
The primary vision pipeline uses MediaPipe Holistic or its open-source successors to extract 33 body landmarks, 21 per-hand landmarks, and 468 facial mesh points at 30 fps on edge hardware. The landmark extraction stage runs on a dedicated NPU partition, consuming approximately 2.1 TOPS of the available compute budget. A secondary depth channel from the Intel RealSense D456 provides metric hand distance and resolves scale ambiguity for pointing gestures.
Research on deep learning-based gesture-driven robot control demonstrates that combining static pose classification with dynamic temporal modelling yields recognition accuracies above 95% on standard benchmarks ([4][5]). The key architectural choice is the separation of spatial feature extraction (handled by a lightweight CNN) from temporal sequence modelling (handled by a transformer or LSTM head).
2.2 Classification Architecture #
We adopt a two-stage architecture for gesture classification:
Stage 1 — Spatial Encoding: A MobileNetV3-Small backbone extracts spatial features from the landmark tensor. The input is not raw RGB but a normalised skeleton representation (joint angles and distances), which provides illumination invariance and reduces model size to under 3 MB.
Stage 2 — Temporal Decoding: A single-layer temporal transformer with 4 attention heads processes a sliding window of 16 frames (approximately 530 ms at 30 fps). This captures the dynamics of gestures such as waving, beckoning, and pointing while remaining causal (no future-frame lookahead) for real-time operation.
The gesture vocabulary is organised into three tiers. Tier 1 comprises safety-critical commands: stop, come, go away, and emergency point. These are trained with maximum recall and trigger immediate motor responses. Tier 2 covers task-relevant gestures: grasp-this, follow-me, look-there, and handover-ready. Tier 3 encompasses social gestures: wave, thumbs-up, nod, and head-shake. A comprehensive review of visual hand gesture recognition with deep learning catalogues over 40 distinct gesture classes used in HRI research, from which our vocabulary is derived ([5][6]).
flowchart TD
A[RGB Camera 30fps] --> B[MediaPipe Holistic]
B --> C[33 Body + 42 Hand Landmarks]
C --> D[Skeleton Normalisation]
D --> E[MobileNetV3-Small Encoder]
E --> F[Temporal Transformer 16-frame]
F --> G{Gesture Class}
G -->|Tier 1: Safety| H[Immediate Motor Command]
G -->|Tier 2: Task| I[Task Planner Queue]
G -->|Tier 3: Social| J[Social Behaviour Planner]
K[Depth Camera] --> L[Metric Distance]
L --> D
2.3 Latency and Reliability Requirements #
For safety-critical Tier 1 gestures, the end-to-end latency budget from photon capture to motor command is 150 ms, matching the human visual reaction time. This is allocated as: 33 ms camera capture, 25 ms landmark extraction, 15 ms skeleton normalisation, 40 ms classification inference, and 37 ms command dispatch via the EtherCAT bus specified in Article 17. Tier 2 and Tier 3 gestures operate on a relaxed 300 ms budget, allowing ensemble voting across multiple frames for higher accuracy.
Recent work on dexterous humanoid hands highlights that gesture semantics extend beyond recognition to the robot’s own expressive capacity, enabling bidirectional gestural communication ([6][7]). This motivates the integration of gesture generation (Section 4) alongside recognition.
3. Emotion Detection Module #
3.1 Facial Expression Recognition #
Facial expression recognition (FER) forms the primary channel for emotion detection. The Open Humanoid uses the chest-mounted wide-angle camera to capture the interlocutor’s face, with a secondary feed from the head-mounted stereo pair for close-range interaction.
We specify a hybrid attention model for FER that combines channel and spatial attention mechanisms within a ResNet-18 backbone. Recent work on the FER-HA architecture achieves 73.2% accuracy on AffectNet (8 classes) while maintaining inference times under 12 ms on mobile GPUs ([7][8]). For edge deployment, lightweight residual CNNs with knowledge distillation from larger teacher networks achieve 58-65% accuracy on AffectNet with models under 1.5 MB, suitable for continuous background inference ([8][9]).
The emotion classification maps facial expressions to Ekman’s six basic emotions (happiness, sadness, anger, fear, surprise, disgust) plus a neutral state. However, for HRI purposes, we collapse these into four actionable states: positive (happiness, surprise), negative (sadness, anger, fear, disgust), neutral, and confused (detected via compound expression analysis). This pragmatic reduction improves classification reliability and simplifies downstream behaviour selection.
3.2 Multimodal Fusion #
Unimodal facial expression analysis is insufficient for robust emotion detection due to occlusion, cultural variation, and individual differences in expressiveness. We fuse three modalities:
Visual channel: FER from the camera pipeline described above, producing a 7-class probability vector at 15 Hz.
Acoustic channel: Vocal prosody features (pitch contour, speech rate, energy envelope) extracted from the robot’s microphone array using a Wav2Vec 2.0 encoder fine-tuned on emotional speech corpora. This produces a 4-class emotion embedding at utterance boundaries.
Contextual channel: Dialogue history and task state from the LLM-based social planner (Section 4), providing prior expectations about likely emotional states.
The fusion architecture uses a late-fusion strategy with learned attention weights. Each modality produces an independent emotion estimate, and an attention layer learns to weight the modalities based on signal quality (e.g., suppressing the visual channel when the face is occluded). Research on multimodal perception-driven decision-making for HRI confirms that late fusion with quality-aware gating outperforms early fusion by 8-12 percentage points in naturalistic settings ([3][4]).
flowchart LR
subgraph Visual
A1[Camera] --> A2[Face Detection]
A2 --> A3[FER-HA Model]
A3 --> A4[7-class Prob Vector]
end
subgraph Acoustic
B1[Mic Array] --> B2[VAD + Segment]
B2 --> B3[Wav2Vec 2.0]
B3 --> B4[4-class Embedding]
end
subgraph Context
C1[Dialogue History] --> C2[LLM Prior]
C2 --> C3[Expected Emotion]
end
A4 --> D[Quality-Aware Attention Fusion]
B4 --> D
C3 --> D
D --> E[Fused Emotion State]
E --> F[Social Behaviour Planner]
3.3 Temporal Smoothing and Confidence Thresholds #
Raw emotion predictions oscillate rapidly between frames. We apply an exponential moving average with a decay constant of 0.85 over a 2-second window, ensuring that transient micro-expressions do not trigger abrupt behavioural changes. A confidence threshold of 0.65 gates emotion-driven behaviour: below this threshold, the robot defaults to neutral social posture rather than risking inappropriate responses to misclassified emotions.
The continual learning challenge in FER, where a robot must adapt to new users without forgetting prior knowledge, is addressed through experience replay buffers that store representative exemplars from each emotion class. Recent work on continual facial feature transfer demonstrates that this approach preserves accuracy on base classes while adapting to domain shift across individuals ([9][10]).
4. Social Behaviour Planner #
4.1 LLM-Driven Interaction Architecture #
The social behaviour planner is the cognitive layer that transforms perceived gestures and emotions into contextually appropriate robot actions. Recent surveys on LLMs in HRI reveal that foundation models are reshaping how robots sense context and generate socially grounded interactions ([10][11]). However, directly querying a cloud-hosted LLM for every social decision introduces unacceptable latency (200-500 ms round-trip) and dependency on network availability.
We specify a hybrid architecture with two inference paths:
Fast path: A lightweight, locally-deployed language model (1-3B parameters, quantised to INT4) handles routine social responses. This model is fine-tuned on a curated dataset of HRI dialogue transcripts and mapped gesture-response pairs. Inference runs on the robot’s edge GPU in under 80 ms. Research on simultaneous text and gesture generation for social robots with small language models demonstrates that models in this parameter range can produce coherent verbal and gestural responses simultaneously ([11][12]).
Slow path: A cloud-hosted large model (70B+ parameters) handles complex, ambiguous, or novel social situations. The fast path routes to the slow path when its confidence score falls below 0.5 or when the interaction involves multi-turn reasoning about user intent. A local cache stores recent slow-path responses to accelerate repeated queries.
4.2 Theory of Mind Integration #
Effective social behaviour requires the robot to maintain a model of the human’s mental state, beliefs, and intentions — a capacity known as Theory of Mind (ToM). We integrate ToM reasoning into the social planner through a structured state representation:
The ToM module maintains a per-person state vector that tracks: (i) inferred emotional state (from Section 3), (ii) estimated task goal (from dialogue and gesture context), (iii) attention focus (from gaze tracking), (iv) engagement level (from proximity, orientation, and interaction frequency), and (v) familiarity score (from face recognition across sessions).
Research on infusing Theory of Mind into socially intelligent LLM agents demonstrates that explicit ToM representations enable more strategic, goal-oriented reasoning and better relationship maintenance with human partners ([12][13]). The Open Humanoid’s ToM module updates the state vector at 5 Hz and passes it as structured context to the language model, enabling responses that account for the human’s likely perspective.
4.3 Behaviour Generation and Motor Mapping #
The social planner produces three types of output:
Verbal responses: Text passed to the speech synthesis module for vocalisation. Responses are constrained to a maximum of 30 words for conversational turns to maintain natural pacing.
Gestural responses: Parameterised gesture commands from a library of 24 pre-defined social gestures (nod, head-tilt, wave, point, shrug, open-palm, etc.). Each gesture is defined as a joint-space trajectory with configurable amplitude and speed parameters. The gesture library is implemented as motion primitives that blend smoothly with the robot’s current posture via the impedance controller specified in Article 12.
Postural adjustments: Continuous, low-amplitude body orientation changes that maintain social spatial norms. The robot adjusts its torso orientation to face the speaker (within 15 degrees of the interlocutor’s centroid), maintains appropriate interpersonal distance (0.6-1.2 m for social interaction, per Hall’s proxemic zones), and modulates head tilt to signal attention or confusion.
flowchart TD
A[Gesture Recognition] --> D[Social Behaviour Planner]
B[Emotion Detection] --> D
C[Speech Recognition] --> D
D --> E{Complexity Assessment}
E -->|Simple| F[Local SLM 1-3B]
E -->|Complex| G[Cloud LLM 70B+]
F --> H[Response Generator]
G --> H
H --> I[Verbal: Speech Synthesis]
H --> J[Gestural: Motion Primitives]
H --> K[Postural: Proxemic Control]
L[ToM State Vector] --> D
B --> L
A --> L
M[Gaze Tracker] --> L
5. System Integration and Safety Considerations #
5.1 Interaction State Machine #
The complete HRI subsystem operates as a finite state machine with five states:
Idle: The robot is not engaged in social interaction. Gesture recognition runs at reduced frame rate (10 fps) to conserve power. The robot enters Approach state when a human is detected within 3 m and oriented toward the robot.
Approach: The robot adjusts posture to face the approaching human and transitions to Engaged when the human enters the social zone (1.2 m) or initiates verbal/gestural contact.
Engaged: Full perception pipeline active. All three modalities (gesture, emotion, speech) feed the social planner. The robot maintains this state as long as the human remains in the social zone and exhibits engagement cues (gaze contact, speech, gesture).
Disengaged: Triggered when the human turns away, retreats beyond 2 m, or 15 seconds elapse without interaction. The robot executes a closing gesture (nod, slight wave) and returns to Idle.
Emergency: Triggered by Tier 1 safety gestures. Overrides all other states and executes the corresponding safety behaviour immediately.
5.2 Ethical and Privacy Safeguards #
Emotion detection in HRI raises significant ethical concerns. The Open Humanoid implements the following safeguards by design:
No persistent emotion logging: Emotion state vectors are maintained only in volatile memory and are discarded when the interaction session ends. No emotional profiles are stored to disk.
Opt-out signalling: A visible LED on the robot’s chest indicates when emotion detection is active (amber) versus inactive (off). Users can disable emotion detection via a voice command (“stop reading my emotions”) or a physical gesture (palm-out stop).
Cultural calibration: The FER model ships with region-specific fine-tuning profiles that account for documented cultural differences in emotional expression intensity. The default profile is conservative, biasing toward neutral classification when confidence is marginal.
Transparency: When the robot’s behaviour is influenced by detected emotion (e.g., slowing speech rate in response to detected confusion), it can optionally verbalise its reasoning (“I notice you might be confused — would you like me to explain differently?”).
Recent work on whole-body tactile interaction for humanoid robots emphasises that physical safety must be maintained even during social interaction, with force limits enforced regardless of the social planner’s output ([13][14]). The Open Humanoid’s impedance controller maintains a 10 N force ceiling on all social gestures, ensuring that no gesture can cause injury even in the event of software failure.
5.3 Computational Budget #
The complete HRI stack requires the following compute allocation:
| Subsystem | Processor | TOPS / TFLOPS | Latency | Power |
|---|---|---|---|---|
| Landmark extraction | NPU Partition A | 2.1 TOPS | 25 ms | 1.2 W |
| Gesture classification | NPU Partition B | 0.8 TOPS | 40 ms | 0.5 W |
| Facial expression recognition | GPU Slice 1 | 1.4 TFLOPS | 12 ms | 2.1 W |
| Vocal prosody analysis | CPU + DSP | 0.3 TOPS | 50 ms | 0.8 W |
| Local SLM inference | GPU Slice 2 | 4.2 TFLOPS | 80 ms | 6.5 W |
| ToM state update | CPU | 0.1 TOPS | 5 ms | 0.3 W |
| Total HRI stack | — | — | — | 11.4 W |
The 11.4 W total fits within the 15 W allocation budgeted for high-level cognition in Article 16’s power budget, leaving 3.6 W of headroom for future expansion.
6. Conclusion #
This article has specified the human-robot interaction subsystem for the Open Humanoid, covering gesture recognition, emotion detection, and LLM-driven social behaviour planning. The architecture is designed around three principles: real-time responsiveness through edge-first inference, robustness through multimodal fusion with quality-aware gating, and ethical responsibility through privacy-preserving design choices.
The gesture recognition pipeline processes body and hand landmarks through a two-stage spatial-temporal architecture, achieving sub-150 ms latency for safety-critical commands. The emotion detection module fuses facial expression, vocal prosody, and contextual signals through a late-fusion attention mechanism, with temporal smoothing to prevent erratic behaviour. The social behaviour planner uses a hybrid local/cloud LLM architecture with explicit Theory of Mind state tracking to generate verbal, gestural, and postural responses that are contextually appropriate and socially fluent.
Critically, all components are specified with open-source implementations and edge-deployable models, maintaining the Open Humanoid project’s commitment to accessible, reproducible robotics. The interaction subsystem transforms the platform from a collection of capable mechanical and perceptual systems into an entity that can participate in the social world alongside humans.
The next article in this series will address system integration and testing, covering full-body commissioning, regression testing, and validation frameworks for bringing all subsystems together into a functioning humanoid robot.
References (14) #
- Stabilarity Research Hub. Human-Robot Interaction: Gesture Recognition, Emotion Detection, and Social Behaviour for Humanoid Robots. doi.org. dti
- Stabilarity Research Hub. Communication Protocols: ROS 2, EtherCAT, and Real-Time Networking for Humanoid Robot Subsystems. ib
- (2025). Error: DOI Not Found. doi.org. dti
- (2025). Frontiers | Multimodal perception-driven decision-making for human-robot interaction: a survey. doi.org. dti
- Access Denied. doi.org. dti
- (20or). [2507.04465] Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions. arxiv.org. tii
- (2025). Redirecting. doi.org. dti
- FER-HA: a hybrid attention model for facial emotion recognition | The Journal of Supercomputing | Springer Nature Link. doi.org. dti
- (2026). Development of Lightweight Residual Convolutional Neural Network for Efficient Facial Emotion Recognition. doi.org. dti
- (2025). Continual Facial Features Transfer for Facial Expression Recognition | IEEE Journals & Magazine | IEEE Xplore. doi.org. dti
- (20or). [2602.15063] How Do We Research Human-Robot Interaction in the Age of Large Language Models? A Systematic Review. arxiv.org. tii
- (2025). Frontiers | Simultaneous text and gesture generation for social robots with small language models. doi.org. dti
- (20or). [2509.22887] Infusing Theory of Mind into Socially Intelligent LLM Agents. arxiv.org. tii
- Just a moment…. doi.org. dti