HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
Abstract
HomeSafe-Bench presents a comprehensive benchmark for vision-language models to detect unsafe actions in household environments, accompanied by HD-Guard, a hierarchical architecture that balances real-time safety monitoring with detection accuracy through dual-brain processing.
The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce HomeSafe-Bench, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose Hierarchical Dual-Brain Guard for Household Safety (HD-Guard), a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.
Community
As embodied agents and household robots move from controlled factories into real homes, safety becomes a critical challenge.
Unlike structured environments, households are dynamic and unpredictable, where perception delays or missing common-sense reasoning can easily lead to dangerous actions (e.g., placing metal in a microwave).
However, current safety benchmarks mostly focus on text, static images, or general hazards, leaving a key question unanswered:
Can Vision-Language Models detect unsafe robot actions in real household scenarios?
To address this gap, we introduce HomeSafe-Bench and HD-Guard, a benchmark and real-time detection system for household embodied agent safety.
🚀 TL;DR
• HomeSafe-Bench — a video-based benchmark for unsafe action detection in household environments
• 438 hazard cases across 6 household functional areas with fine-grained annotations
• HD-Guard — a hierarchical dual-brain safety monitor combining fast screening with deep reasoning
• Achieves strong latency–accuracy trade-offs for real-world deployment
📊 HomeSafe-Bench
HomeSafe-Bench evaluates Vision-Language Models (VLMs) on detecting unsafe behaviors of embodied agents in household scenarios.
Key properties
- 🎥 438 hazard videos
- 🏠 6 household functional areas
- 🧠 multi-dimensional annotations
- ⚠️ diverse unsafe behaviors
The dataset is generated through a hybrid pipeline combining LLM-driven hazard discovery, physical simulation, and video generation, ensuring both physical realism and scenario diversity.
🧠 HD-Guard: Dual-Brain Safety Detection
We further propose HD-Guard, a hierarchical dual-brain architecture for real-time safety monitoring.
FastBrain
- lightweight streaming VLM
- high-frequency frame-level safety screening
SlowBrain
- large-scale VLM
- deep multimodal reasoning for uncertain cases
This design enables continuous monitoring while maintaining strong reasoning capability, achieving an effective latency–performance balance.
🔍 Key Findings
Our experiments reveal major limitations in current VLM safety detection:
- models miss critical visual entities
- models struggle with temporal grounding
- models show weak causal reasoning for hazards
HD-Guard mitigates these issues and significantly improves practical safety detection for household embodied agents.
✨ Contributions
• HomeSafe-Bench — first benchmark for unsafe action detection in household embodied agents
• HD-Guard — hierarchical dual-brain safety monitoring architecture
• Comprehensive analysis of VLM limitations in real-world safety detection
💡 HomeSafe-Bench highlights an urgent challenge:
As robots enter everyday homes, reliable safety monitoring becomes essential for embodied AI deployment.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper