Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.15.2
title: Depth-Aware Scene Description
emoji: π
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
hardware: zero-a10g
Depth-Aware Scene Description for Visually Impaired Users
EE654 β 3D Vision & Augmented Reality | Maynooth University
An assistive scene description system that injects structured spatial context from monocular depth estimation into a vision-language model, enabling natural-language descriptions that include object distances, physical sizes, horizontal positions, and depth-ordered scene layout. No retraining. No human-written references.
System Overview
Two delivered configurations plus a null baseline:
| Configuration | Models | Description |
|---|---|---|
| Baseline (Stage 1) | VLM only | Flat categorical descriptions β reference point only |
| Stage 2 (core) | VLM + Depth Anything V2 Small | Depth context preamble injected into every prompt |
| Stage 3 (enhanced) | VLM + DA-V2 + YOLOv8n | Per-object depth sampled at YOLO bounding-box centres |
The depth context constructor (src/depth_context.py) is the technical
contribution: it converts a per-pixel disparity map into a structured
natural-language preamble containing per-object distance (cm), physical
width/height (cm), horizontal position (left/centre/right), nearest-first
ordering, and foreground/midground/background scene-layout summary.
Hardware & Models
| Target | GPU | VLM |
|---|---|---|
| Google Colab T4 (β€16 GB) | 15 GB VRAM | Moondream 2B |
| Local RTX 5060 Laptop (β₯16 GB) | 16 GB VRAM | Qwen2.5-VL-3B |
The VLM is selected automatically at runtime via torch.cuda.mem_get_info.
Override with --force-model moondream or --force-model qwen.
Evaluation
Dataset: ARKitScenes [Baruch et al., NeurIPS 2021] β 1,000 frames
sampled from 100 indoor scenes (kitchens, bedrooms, offices, corridors)
with random.seed(42).
Three reference-free metrics (no human annotations required):
- STD β Spatial Term Density: spatial vocabulary terms per 100 words
- SFS β Spatial Faithfulness Score: label + position + distance zone propagation from preamble to generated description
- Preamble BERTScore F1 β generated text vs depth preamble as reference
Setup
pip install -r requirements.txt
Model weights are downloaded automatically on first run (YOLOv8n, Depth Anything V2 Small, Moondream 2B or Qwen2.5-VL-3B).
Running the Pipeline
Gradio UI (webcam + AR overlay)
python -m src.ui.gradio_app
Evaluation
# BERTScore ablation (Stage 1 / 2 / 3) on 1,000 images
python -m src.evaluation.bertscore_ablation \
--output outputs/results/ablation_1k.csv
# Spatial Faithfulness Score (Stage 3, min 1 detection)
python -m src.evaluation.spatial_faithfulness \
--output outputs/results/sfs_s3_1k.csv \
--min-objects 1
# Per-stage latency + VRAM benchmark
python -m src.evaluation.latency_benchmark \
--image data/test_images/arkit_41159529_0000.jpg \
--output outputs/results/latency.csv
Project Structure
src/
βββ config.py # all constants, paths, model IDs
βββ depth_context.py # CORE CONTRIBUTION
βββ pipeline.py # run_stage1 / run_stage2 / run_stage3
βββ data/
β βββ arkitscenes_loader.py # downloads ARKitScenes frames
βββ models/
β βββ depth.py # Depth Anything V2 loader
β βββ detector.py # YOLOv8n loader
β βββ vlm.py # hardware-adaptive Moondream/Qwen loader
βββ evaluation/
β βββ bertscore_ablation.py # STD + Preamble BERTScore F1
β βββ spatial_faithfulness.py # SFS metric
β βββ latency_benchmark.py # per-stage timing + VRAM
βββ ui/
βββ gradio_app.py # Gradio UI with AR overlay
data/
βββ test_images/ # 1,000 ARKitScenes evaluation frames
outputs/
βββ results/ # evaluation CSVs
βββ ablation_1k.csv # BERTScore ablation, 1k images
βββ sfs_s2_1k.csv # SFS Stage 2, 1k images
βββ sfs_s3_1k.csv # SFS Stage 3, 1k images (min 1 object)
βββ latency.csv # per-stage latency benchmark
Report
See REPORT.md for the full academic write-up including methodology,
results tables, and discussion.