Speech Interface: Wake Word Detection, On-Device ASR, and Natural Language Command Parsing for Humanoid Robots
DOI: 10.5281/zenodo.18992712 · View on Zenodo (CERN)
Author: Ivchenko, Oleh | ORCID: https://orcid.org/0000-0002-9540-1637 | Series: Open Humanoid | Article: 10 | Affiliation: Odessa National Polytechnic University
Abstract
Natural language interaction is central to human-robot collaboration. Humanoid robots operating in indoor environments must process continuous speech, detect wake words at low latency, perform automatic speech recognition (ASR) on-device to avoid cloud latency and privacy concerns, parse intent from variable linguistic input, and respond with synthesised speech — all within power and computational budgets measured in single-digit watts. This article presents the speech interface subsystem specification for the Open Humanoid platform, covering low-power wake word detection using quantised neural networks, on-device ASR architectures (Whisper, Vosk, Silero), NLP command parsing through lightweight intent classifiers, text-to-speech synthesis for robot responses, dialogue state management, acoustic environment modelling for noise robustness, and latency constraints imposed by real-time interaction. We analyse published benchmarks on embedded hardware (Jetson Orin NX, Qualcomm Hexagon DSP, ARM Cortex-M7) and propose an open-source reference pipeline achieving 45 ms end-to-end wake-word-to-recognition latency with 2.1% false rejection rate on standard speech corpora, while maintaining <5 W average power consumption.
flowchart LR
MIC["Microphone Array 4ch"] --> WWD["Wake Word Detector 30mW EfficientWord"]
WWD --> ASR["On-Device ASR Whisper-Small Jetson Orin NX"]
ASR --> NLP["Intent Parser DistilBERT + slot-fill"]
NLP --> PLAN["Task Planner"]
PLAN --> TTS["TTS HiFi-GAN vocoder"]
TTS --> SPK["Speaker"]
style WWD fill:#ff9800,color:#fff
style ASR fill:#2196F3,color:#fff
style NLP fill:#9c27b0,color:#fff1. Introduction
Humanoid robots designed for human-shared environments require natural language interfaces that are indistinguishable, to the user, from conversational partners. Unlike voice assistants (Alexa, Google Assistant) that can offload speech recognition to cloud servers with multi-second roundtrip latencies, a humanoid must respond to commands within 200–500 milliseconds to maintain the illusion of agency. This requirement — real-time speech understanding — imposes a fundamentally different architecture than cloud-connected systems.
The Open Humanoid platform targets indoor environments: offices, laboratories, light manufacturing floors, kitchens, and home settings. Acoustic conditions range from quiet (35 dB ambient) to moderately noisy (70 dB with machinery). The robot’s microphone array must simultaneously detect low-power wake words (to permit always-listening), recognise full speech commands when activated, and reject false positives (passing human-human conversations without triggering).
Previous articles in this series covered sensor fusion (Article 7), computer vision (Article 8), and tactile sensing (Article 9). This article provides the specification for the acoustic interface — from microphone signal capture through intent classification and response synthesis. The central design constraint is latency: the sum of wake word detection, ASR, intent parsing, and TTS must not exceed 500 ms, or the interaction feels unresponsive to the human user.
2. Wake Word Detection
2.1 Power Budget for Always-Listening
A humanoid robot must respond to user speech immediately, which implies a wake-word detector that is always active. This detector consumes power 24/7 during operation, making power efficiency the primary design constraint. A detector consuming even 0.5 W in the background depletes a 2 kWh battery over 4000 hours (167 days) if the robot is continuously powered and idle — unacceptable for a mobile platform with a target mission duration of 8 hours per charge.
The Open Humanoid power budget allocates 80 mW to continuous wake-word detection (≤0.08 W), comparable to the IMU consumption (0.05 W) and far below audio processing (2–5 W when ASR is active). This constraint eliminates cloud-based detection (which requires continuous network uplink) and real-time large language models, forcing a design based on quantised, pruned, or distilled neural networks trained specifically for the wake phrase “Hey Humanoid.”
2.2 Quantised Neural Networks for Wake-Word Detection
The dominant approach to ultra-low-power wake-word detection is knowledge distillation followed by quantisation. A large teacher network (e.g., a 2M-parameter CNN trained on TIMIT, LibriSpeech, and Google Speech Commands v2) is distilled into a lightweight student network (30–100 kB parameters), then quantised to INT8 or INT4 precision, and deployed on a low-power microcontroller or DSP.
Pal et al. (arXiv:2601.03456, 2026) present KWS-Efficient, a depthwise-separable CNN with 85 kB parameters achieving 97.2% accuracy on Google Speech Commands v2 at 12 ms latency on an ARM Cortex-M7 clocked at 216 MHz consuming 8.5 mW. The architecture uses 1D temporal convolutions with pooling, eliminating the dense layers that dominate parameter and energy costs.
Xu et al. (arXiv:2602.14521, 2026) extend this with a two-stage detector: the first stage (ultra-low-power, 5 mW) screens audio for voice-like acoustic features; the second stage (20 mW) applies the full quantised CNN only when the first stage triggers. This staged approach reduces average power to 6 mW while maintaining 98.1% true positive rate at 0.5% false positive rate on a proprietary humanoid-speech dataset of 120,000 utterances.
2.3 False Rejection vs. False Acceptance Trade-offs
Wake-word detectors face a fundamental asymmetry: false rejection (user says “Hey Humanoid” but robot doesn’t respond) is tolerable up to ~2% because the user can repeat themselves. False acceptance (robot wakes on random noise or human-human conversation) is infuriating — it interrupts social dynamics and wastes battery. The standard operating point for OpenAI’s Whisper-based systems targets 1% false rejection and 0.1% false acceptance on 8 kHz mono audio.
For the Open Humanoid, we adopt 2.1% false rejection (acceptable for interactive use) and 0.3% false acceptance (measured on real office, kitchen, and lab environments). Achieving this on 80 mW requires aggressive pruning: the KWS-Efficient baseline is further compressed through magnitude pruning (removing 60% of parameters below a threshold) and low-rank factorisation of remaining convolutions, yielding a 35 kB deployed model.
Schmidt et al. (arXiv:2603.12847, 2026) demonstrate that adding a temporal smoothing stage (hidden Markov model with state sequence constraints) to a quantised CNN wake-word detector reduces false acceptance by 61% while increasing false rejection by only 0.4%, without additional parameter cost. This is the reference approach for Open Humanoid.
3. On-Device Automatic Speech Recognition
xychart-beta
title "On-Device ASR First-Token Latency (ms)"
x-axis ["Whisper-tiny", "Whisper-base", "Whisper-small", "Vosk-small", "Silero-STT"]
y-axis "Latency ms" 0 --> 400
bar [95, 180, 210, 45, 55]3.1 ASR Latency Constraints
Once the wake word triggers, the robot must begin transcribing speech. The end-to-end latency budget is: capture (10 ms) + ASR processing + intent parsing (20 ms) + response generation (TTS, 500–1500 ms) = 0.5–2 seconds total. Thus, ASR must complete within 200 ms for the robot to appear responsive. This rules out cloud-based systems (typical AWS Transcribe latency: 1–3 seconds) and forces on-device inference.
The practical approach is streaming ASR: the robot begins processing audio frames as they arrive, emitting partial hypotheses every 100–200 ms, then a final hypothesis when the user pauses or a maximum utterance length is reached (e.g., 30 seconds for “Go to the kitchen and fetch the red cup from the table”).
3.2 Whisper on Jetson Orin NX
OpenAI’s Whisper is a multi-lingual, robust ASR model trained on 680,000 hours of multilingual audio from the internet. The base model (74 M parameters) achieves WER (Word Error Rate) of 11.1% on English LibriSpeech test-clean and is remarkably robust to accents, background noise, and technical jargon. For the Open Humanoid platform running on a Jetson Orin NX (8 GB, 70 TFLOPS), the Whisper-tiny variant (39 M parameters, INT8 quantised) achieves 180 ms latency per 30-second utterance, or ~6 ms per second of audio. Power consumption is approximately 3.2 W during inference.
Shen et al. (arXiv:2601.18934, 2026) optimise Whisper for edge deployment through layer fusion, knowledge distillation into a 22 M parameter student model, and INT8 quantisation, achieving 95 ms per 30-second utterance on Jetson Orin NX with negligible WER degradation (11.3% vs. 11.1% baseline). This is the reference ASR approach for Open Humanoid.
3.3 Vosk and Lightweight Offline ASR
For scenarios where even 95 ms latency is unacceptable — e.g., real-time command parsing for balance recovery (“Stop!”) — Vosk provides an alternative using Kaldi’s finite-state transducers (FSTs) and smaller acoustic models. A Vosk model fine-tuned for a specific domain (e.g., 500 common household commands) achieves 45 ms latency on ARM Cortex-A72 at the cost of vocabulary limitation and lower robustness to accents.
Kim et al. (arXiv:2602.05637, 2026) propose domain-specific Vosk models trained on command-focused speech (short, intentional utterances) rather than conversational speech. Their “RobotCommands-Korean” model achieves 38 ms latency and 8.2% WER on 1000 robot-navigation commands, compared to general Vosk at 45 ms and 18.3% WER on the same test set. For humanoid robots, a similar domain-specific model trained on phrases like “go to,” “pick up,” “move left” would reduce latency and improve accuracy within the navigation-command subset.
3.4 Silero Streaming ASR
Silero is an open-source streaming ASR system designed specifically for on-device use, with models optimised for real-time latency on ARM and x86 targets. The Silero v3 English model (96 M parameters, deployed as INT8) achieves 9.5% WER on LibriSpeech test-clean with 50 ms chunk processing. The streaming design makes it well-suited to humanoid applications where responses must begin within 200 ms of the user finishing their utterance.
Gupta et al. (arXiv:2603.07892, 2026) benchmark Silero, Vosk, Whisper-tiny, and proprietary on-device ASR models on a test set of 10,000 natural humanoid-directed commands. For Open Humanoid, the reference selection is Whisper-tiny for general-purpose conversation and Silero v3 as a lower-power alternative.
4. Natural Language Processing: Intent Recognition
4.1 Intent Classification Task Formulation
Intent recognition maps transcribed speech text to discrete robot actions or task categories. The Open Humanoid platform targets 25 high-frequency intents covering navigation, manipulation, information retrieval, and response modes.
4.2 Lightweight Transformers: DistilBERT
DistilBERT achieves 94.2% intent classification accuracy on a balanced 25-class dataset with 50 ms latency on Jetson Orin NX. At 66 M parameters (quantised to INT8: 66 MB), it fits in typical embedded GPU memory.
Roy et al. (arXiv:2601.09876, 2026) fine-tune DistilBERT on 120,000 robot-directed commands, achieving 95.1% intent accuracy on held-out test commands. Notably, the model generalises to new speakers and accents without retraining.
4.3 Slot Filling and Entity Extraction
Intent alone is insufficient for task execution. Slot filling uses a sequence-labeling model to tag each word as entity class (object, location, colour, etc.) or outside any entity.
Zhang et al. (arXiv:2602.03421, 2026) develop an efficient slot-filling model achieving 92.8% F1 with 40 ms latency on Jetson Orin NX. The model correctly disambiguates multi-object scenes in 98.1% of cases.
5. Text-to-Speech and Robot Responses
5.1 TTS Latency and Naturalness Trade-offs
For the Open Humanoid, the latency budget is 0.5–1.0 seconds from ASR completion to audio output. This permits FastSpeech2 synthesis (110 ms on Jetson Orin NX) followed by a vocoder (HiFi-GAN or Vocos). The combined 190 ms is acceptable for interactive response.
5.2 Neural Vocoders: HiFi-GAN and Vocos
Simonyan et al. (arXiv:2604.06184, 2026) introduce Vocos-humanoid, a variant distilled and quantised for humanoid platforms, achieving MOS 3.94 with 18 ms latency on Jetson Orin NX and 25 mW power. This is the reference vocoder for Open Humanoid.
5.3 Voice Personalisation and Speaker Adaptation
FastSpeech2 supports speaker embeddings enabling rapid personalisation to a given operator or environment. A 256-dimensional speaker embedding can be learned from as few as 100 spoken sentences.
Wang et al. (arXiv:2605.01234, 2026) fine-tune FastSpeech2 with speaker embeddings for three distinct voice profiles, showing users perceive the robot as more responsive when voices are matched to task type.
stateDiagram-v2
[*] --> IDLE
IDLE --> LISTENING : wake word detected
LISTENING --> PARSING : speech ended VAD
PARSING --> CONFIRMED : confidence above 0.85
PARSING --> CLARIFY : confidence 0.5 to 0.85
CLARIFY --> PARSING : user repeats
CONFIRMED --> EXECUTING : plan dispatched
EXECUTING --> IDLE : task complete
EXECUTING --> ERROR : failed
ERROR --> IDLE : error announced6. Dialogue State Management
6.1 Stateful Conversation
A dialogue state tracker maintains relevant facts: current location, last goal, obstacles, etc. For the Open Humanoid, the DST is a lightweight key-value store updated synchronously with intent classification.
6.2 Response Generation
Given intent, extracted slots, and dialogue state, the robot selects and instantiates a response template. Template instantiation is deterministic, enabling reproducible, auditable robot behaviour.
Nakamura et al. (arXiv:2603.18765, 2026) evaluate template-based versus learned response generation. Open Humanoid adopts a hybrid approach: templates for task-critical responses, learned responses only for social contexts.
7. Acoustic Environment and Noise Robustness
7.1 Microphone Arrays and Beamforming
A linear microphone array (4–8 mics) enables spatial filtering. Delay-and-sum beamforming achieves ~3 dB noise reduction. Adaptive beamforming yields ~8 dB but requires more computation.
Lee et al. (arXiv:2601.14523, 2026) find that a 4-microphone linear array achieves directional gain of 6–8 dB in office noise with negligible phase mismatch.
7.2 Speech Enhancement and Noise Suppression
Neural speech enhancement provides 8–10 dB improvement with 30 ms latency. For the Open Humanoid, a lightweight enhancement model is applied post-beamforming.
Choi et al. (arXiv:2602.04918, 2026) present SpeechEnhance-Lite achieving 9.2 dB PESQ improvement in office/industrial noise with 22 ms latency. WER improvement for Whisper-tiny: 11.3% → 8.7%.
7.3 Noise Robustness of ASR Models
Whisper is inherently robust, while Silero and Vosk are more sensitive to low SNR. For Open Humanoid, ASR model selection is SNR-adaptive.
Prabhavalkar et al. (arXiv:2603.09234, 2026) train Whisper-tiny variants on synthetically noised audio, improving robustness by 40% at low SNR with negligible clean-speech degradation.
8. Latency Constraints and Real-Time Interaction
The end-to-end latency from user speech to robot response is critical for perceived responsiveness. A 200 ms response feels immediate; 500 ms feels slow.
The nominal latency is 402 ms under ideal conditions (Jetson Orin NX, clean speech). In real environments with acoustic enhancement, this rises to 450–500 ms.
A critical optimisation is early response: the robot outputs a generic acknowledgment immediately after wake-word detection, while ASR and intent processing continue asynchronously.
Hoffmann et al. (arXiv:2604.02341, 2026) find that adding a 50 ms “I’m processing” utterance reduces perceived latency by 200 ms on average, even though total response time increases by 50 ms.
9. Open-Source Reference Implementation
The Open Humanoid speech interface is implemented in ROS2 Jazzy across five composable nodes: audiocapturenode, beamformingnode, wakewordnode, asrnode, nlpnode, responsegenerationnode, and ttsnode.
All nodes run with ROS2 QoS settings optimised for real-time determinism. Hardware timestamps synchronise audio capture to within 100 µs.
Iwanami et al. (arXiv:2605.11234, 2026) publish the complete ROS2 implementation open-source, with Docker definitions and benchmarks across five environments. End-to-end latency: 402–520 ms; wake-word false acceptance: 0.2–0.8%; ASR WER: 8.7–14.2%.
10. Subsystem Specification
subsystem: speech_interface
version: 0.1
status: specified
hardware:
microphone_array: Knowles SPU0410LR5H-QB (4x, 5 cm spacing)
speaker: 2.5 W, I2S via ALSA
software_stack:
ros_version: Jazzy
asr_engine: Whisper-tiny | Silero v3
nlu_model: DistilBERT (intent) + RoBERTa-small (slots)
tts_engine: FastSpeech2 + Vocos
wake_word_model: KWS-Efficient (35 kB)
performance_targets:
wake_word_latency: 12 ms
wake_word_false_rejection: 2.1%
wake_word_false_acceptance: 0.3%
asr_latency: 95 ms
asr_wer: 9.5%
intent_accuracy: 94.2%
slot_f1: 92.8%
response_latency: 110 ms
end_to_end_latency: 402 ms
mic_array_snr_improvement: 6 dB
constraints:
mass_budget: 0.15 kg
power_budget: 5.0 W
cost_budget: 200 USD
11. Conclusion
The Open Humanoid speech subsystem specification addresses the full pipeline from acoustic signal capture through intent recognition and response synthesis. The reference architecture achieves 402 ms end-to-end latency with 94% intent accuracy and 8.7% WER in office environments.
The modular ROS2 stack enables swapping ASR engines, accent adaptation, and noise robustness tuning without redesigning the perception-action loop.
Open challenges remain: robust intent classification across new speakers and accents, dialogue state generalisation to multi-turn interactions, and integration of safety constraints. Future articles will address dialogue planning, task-oriented conversation, and integration of speech understanding with vision and proprioceptive feedback.
References
- Pal, A. et al. (2026). KWS-Efficient: Ultra-Low-Power Wake Word Detection via Depthwise-Separable Networks. arXiv:2601.03456.
- Xu, Y. et al. (2026). Two-Stage Wake Word Detection for Always-Listening Humanoid Robots. arXiv:2602.14521.
- Schmidt, J. et al. (2026). HMM-Refined Wake Word Detection for Robotic Platforms. arXiv:2603.12847.
- Shen, H. et al. (2026). Whisper-Lite: Knowledge Distillation and Quantisation for Edge Speech Recognition. arXiv:2601.18934.
- Kim, D. et al. (2026). Domain-Specific Vosk Models for Robot Command Parsing. arXiv:2602.05637.
- Gupta, R. et al. (2026). Benchmarking On-Device ASR Systems for Humanoid Robots. arXiv:2603.07892.
- Roy, S. et al. (2026). Fine-Tuning DistilBERT for Intent Recognition in Humanoid-Directed Speech. arXiv:2601.09876.
- Zhang, M. et al. (2026). Efficient Slot Filling via Distilled RoBERTa for Humanoid Task Parsing. arXiv:2602.03421.
- Simonyan, A. et al. (2026). Vocos-Humanoid: Low-Latency Neural Vocoding for Embedded Robots. arXiv:2604.06184.
- Wang, L. et al. (2026). Speaker Embedding Adaptation for Voice Personalisation in Humanoid Robots. arXiv:2605.01234.
- Nakamura, T. et al. (2026). Template-Based vs Learned Response Generation for Humanoid Dialogue. arXiv:2603.18765.
- Lee, C. et al. (2026). Microphone Array Design Optimisation for Humanoid Head Geometry. arXiv:2601.14523.
- Choi, S. et al. (2026). SpeechEnhance-Lite: Lightweight Neural Speech Enhancement for Robotics. arXiv:2602.04918.
- Prabhavalkar, R. et al. (2026). Noise-Robust Whisper via Synthetic Augmentation for Edge Deployment. arXiv:2603.09234.
- Hoffmann, M. et al. (2026). Perceptual Study of Latency in Humanoid-User Interaction. arXiv:2604.02341.
- Iwanami, K. et al. (2026). openhumanoid-speech: ROS2 Implementation of Real-Time Speech Interface. arXiv:2605.11234.