arxiv:2503.12844

GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

Published on Mar 17, 2025

Authors:

Abstract

A new accessibility-aware dataset and evaluation framework are introduced to improve multimodal large language models for assisting blind and low vision individuals with mobility challenges.

AI-generated summary

Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV), with 7% of visually impaired individuals experiencing falls at least once a month. While recent advances in Multimodal Large Language Models (MLLMs) offer promising opportunities for BLV assistance, their development has been hindered by limited datasets. This limitation stems from the fact that BLV-aware annotation requires specialized domain knowledge and intensive labor. To address this gap, we introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs (including 2K human-annotated pairs) that capture diverse real-world scenes from a pedestrian's viewpoint. Our approach shifts the annotation burden from generation to verification through a collaborative human-AI framework grounded in established accessibility standards, significantly improving efficiency while maintaining high-quality annotations. We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities, specifically object recognition and relative depth perception. Our experimental results highlight the importance of accurate spatial understanding for effective BLV guidance. GuideDog and GuideDogQA will advance research in MLLM-based assistive technologies for BLV individuals while contributing to broader applications in understanding egocentric scenes for robotics and augmented reality. The code and dataset will be publicly available.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.12844 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.12844 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.