Papers
arxiv:2603.11975

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Published on Mar 12
· Submitted by
SunZX
on Mar 16
Authors:
,

Abstract

HomeSafe-Bench presents a comprehensive benchmark for vision-language models to detect unsafe actions in household environments, accompanied by HD-Guard, a hierarchical architecture that balances real-time safety monitoring with detection accuracy through dual-brain processing.

AI-generated summary

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce HomeSafe-Bench, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose Hierarchical Dual-Brain Guard for Household Safety (HD-Guard), a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.

Community

Paper author Paper submitter

As embodied agents and household robots move from controlled factories into real homes, safety becomes a critical challenge.
Unlike structured environments, households are dynamic and unpredictable, where perception delays or missing common-sense reasoning can easily lead to dangerous actions (e.g., placing metal in a microwave).

However, current safety benchmarks mostly focus on text, static images, or general hazards, leaving a key question unanswered:

Can Vision-Language Models detect unsafe robot actions in real household scenarios?

To address this gap, we introduce HomeSafe-Bench and HD-Guard, a benchmark and real-time detection system for household embodied agent safety.


🚀 TL;DR

HomeSafe-Bench — a video-based benchmark for unsafe action detection in household environments
438 hazard cases across 6 household functional areas with fine-grained annotations
HD-Guard — a hierarchical dual-brain safety monitor combining fast screening with deep reasoning
• Achieves strong latency–accuracy trade-offs for real-world deployment


📊 HomeSafe-Bench

HomeSafe-Bench evaluates Vision-Language Models (VLMs) on detecting unsafe behaviors of embodied agents in household scenarios.

Key properties

  • 🎥 438 hazard videos
  • 🏠 6 household functional areas
  • 🧠 multi-dimensional annotations
  • ⚠️ diverse unsafe behaviors

The dataset is generated through a hybrid pipeline combining LLM-driven hazard discovery, physical simulation, and video generation, ensuring both physical realism and scenario diversity.


🧠 HD-Guard: Dual-Brain Safety Detection

We further propose HD-Guard, a hierarchical dual-brain architecture for real-time safety monitoring.

FastBrain

  • lightweight streaming VLM
  • high-frequency frame-level safety screening

SlowBrain

  • large-scale VLM
  • deep multimodal reasoning for uncertain cases

This design enables continuous monitoring while maintaining strong reasoning capability, achieving an effective latency–performance balance.


🔍 Key Findings

Our experiments reveal major limitations in current VLM safety detection:

  • models miss critical visual entities
  • models struggle with temporal grounding
  • models show weak causal reasoning for hazards

HD-Guard mitigates these issues and significantly improves practical safety detection for household embodied agents.


✨ Contributions

HomeSafe-Bench — first benchmark for unsafe action detection in household embodied agents
HD-Guard — hierarchical dual-brain safety monitoring architecture
Comprehensive analysis of VLM limitations in real-world safety detection


💡 HomeSafe-Bench highlights an urgent challenge:
As robots enter everyday homes, reliable safety monitoring becomes essential for embodied AI deployment.


Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.11975 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.11975 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 1