arxiv:2603.11975

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Published on Mar 12

· Submitted by

SunZX on Mar 16

RUC-GSAI-IIRLab

Upvote

Authors:

Jiayue Pu ,

Zhongxiang Sun ,

Zilu Zhang ,

Abstract

HomeSafe-Bench presents a comprehensive benchmark for vision-language models to detect unsafe actions in household environments, accompanied by HD-Guard, a hierarchical architecture that balances real-time safety monitoring with detection accuracy through dual-brain processing.

AI-generated summary

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce HomeSafe-Bench, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose Hierarchical Dual-Brain Guard for Household Safety (HD-Guard), a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

Jeryi

Paper author Paper submitter about 12 hours ago

As embodied agents and household robots move from controlled factories into real homes, safety becomes a critical challenge.
Unlike structured environments, households are dynamic and unpredictable, where perception delays or missing common-sense reasoning can easily lead to dangerous actions (e.g., placing metal in a microwave).

However, current safety benchmarks mostly focus on text, static images, or general hazards, leaving a key question unanswered:

Can Vision-Language Models detect unsafe robot actions in real household scenarios?

To address this gap, we introduce HomeSafe-Bench and HD-Guard, a benchmark and real-time detection system for household embodied agent safety.

🚀 TL;DR

• HomeSafe-Bench — a video-based benchmark for unsafe action detection in household environments
• 438 hazard cases across 6 household functional areas with fine-grained annotations
• HD-Guard — a hierarchical dual-brain safety monitor combining fast screening with deep reasoning
• Achieves strong latency–accuracy trade-offs for real-world deployment

📊 HomeSafe-Bench

HomeSafe-Bench evaluates Vision-Language Models (VLMs) on detecting unsafe behaviors of embodied agents in household scenarios.

Key properties

🎥 438 hazard videos
🏠 6 household functional areas
🧠 multi-dimensional annotations
⚠️ diverse unsafe behaviors

The dataset is generated through a hybrid pipeline combining LLM-driven hazard discovery, physical simulation, and video generation, ensuring both physical realism and scenario diversity.

🧠 HD-Guard: Dual-Brain Safety Detection

We further propose HD-Guard, a hierarchical dual-brain architecture for real-time safety monitoring.

FastBrain

lightweight streaming VLM
high-frequency frame-level safety screening

SlowBrain

large-scale VLM
deep multimodal reasoning for uncertain cases

This design enables continuous monitoring while maintaining strong reasoning capability, achieving an effective latency–performance balance.

🔍 Key Findings

Our experiments reveal major limitations in current VLM safety detection:

models miss critical visual entities
models struggle with temporal grounding
models show weak causal reasoning for hazards

HD-Guard mitigates these issues and significantly improves practical safety detection for household embodied agents.

✨ Contributions

• HomeSafe-Bench — first benchmark for unsafe action detection in household embodied agents
• HD-Guard — hierarchical dual-brain safety monitoring architecture
• Comprehensive analysis of VLM limitations in real-world safety detection

💡 HomeSafe-Bench highlights an urgent challenge:
As robots enter everyday homes, reliable safety monitoring becomes essential for embodied AI deployment.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.11975 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.11975 in a dataset README.md to link it from this page.