---
title: Depth-Aware Scene Description
emoji: 👁
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "5.49.1"
app_file: app.py
pinned: false
hardware: zero-a10g
---

# Depth-Aware Scene Description for Visually Impaired Users

EE654 — 3D Vision & Augmented Reality | Maynooth University

An assistive scene description system that injects structured spatial
context from monocular depth estimation into a vision-language model,
enabling natural-language descriptions that include object distances,
physical sizes, horizontal positions, and depth-ordered scene layout.
No retraining. No human-written references.

## System Overview

Two delivered configurations plus a null baseline:

| Configuration | Models | Description |
|---|---|---|
| Baseline (Stage 1) | VLM only | Flat categorical descriptions — reference point only |
| **Stage 2** (core) | VLM + Depth Anything V2 Small | Depth context preamble injected into every prompt |
| **Stage 3** (enhanced) | VLM + DA-V2 + YOLOv8n | Per-object depth sampled at YOLO bounding-box centres |

The **depth context constructor** (`src/depth_context.py`) is the technical
contribution: it converts a per-pixel disparity map into a structured
natural-language preamble containing per-object distance (cm), physical
width/height (cm), horizontal position (left/centre/right), nearest-first
ordering, and foreground/midground/background scene-layout summary.

## Hardware & Models

| Target | GPU | VLM |
|---|---|---|
| Google Colab T4 (≤16 GB) | 15 GB VRAM | Moondream 2B |
| Local RTX 5060 Laptop (≥16 GB) | 16 GB VRAM | Qwen2.5-VL-3B |

The VLM is selected automatically at runtime via `torch.cuda.mem_get_info`.
Override with `--force-model moondream` or `--force-model qwen`.

## Evaluation

Dataset: **ARKitScenes** [Baruch et al., NeurIPS 2021] — 1,000 frames
sampled from 100 indoor scenes (kitchens, bedrooms, offices, corridors)
with `random.seed(42)`.

Three reference-free metrics (no human annotations required):

- **STD** — Spatial Term Density: spatial vocabulary terms per 100 words
- **SFS** — Spatial Faithfulness Score: label + position + distance zone
  propagation from preamble to generated description
- **Preamble BERTScore F1** — generated text vs depth preamble as reference

## Setup

```bash
pip install -r requirements.txt
```

Model weights are downloaded automatically on first run (YOLOv8n, Depth
Anything V2 Small, Moondream 2B or Qwen2.5-VL-3B).

## Running the Pipeline

### Gradio UI (webcam + AR overlay)

```bash
python -m src.ui.gradio_app
```

### Evaluation

```bash
# BERTScore ablation (Stage 1 / 2 / 3) on 1,000 images
python -m src.evaluation.bertscore_ablation \
    --output outputs/results/ablation_1k.csv

# Spatial Faithfulness Score (Stage 3, min 1 detection)
python -m src.evaluation.spatial_faithfulness \
    --output outputs/results/sfs_s3_1k.csv \
    --min-objects 1

# Per-stage latency + VRAM benchmark
python -m src.evaluation.latency_benchmark \
    --image data/test_images/arkit_41159529_0000.jpg \
    --output outputs/results/latency.csv
```

## Project Structure

```
src/
├── config.py                      # all constants, paths, model IDs
├── depth_context.py               # CORE CONTRIBUTION
├── pipeline.py                    # run_stage1 / run_stage2 / run_stage3
├── data/
│   └── arkitscenes_loader.py      # downloads ARKitScenes frames
├── models/
│   ├── depth.py                   # Depth Anything V2 loader
│   ├── detector.py                # YOLOv8n loader
│   └── vlm.py                     # hardware-adaptive Moondream/Qwen loader
├── evaluation/
│   ├── bertscore_ablation.py      # STD + Preamble BERTScore F1
│   ├── spatial_faithfulness.py    # SFS metric
│   └── latency_benchmark.py       # per-stage timing + VRAM
└── ui/
    └── gradio_app.py              # Gradio UI with AR overlay
data/
└── test_images/                   # 1,000 ARKitScenes evaluation frames
outputs/
└── results/                       # evaluation CSVs
    ├── ablation_1k.csv            # BERTScore ablation, 1k images
    ├── sfs_s2_1k.csv              # SFS Stage 2, 1k images
    ├── sfs_s3_1k.csv              # SFS Stage 3, 1k images (min 1 object)
    └── latency.csv                # per-stage latency benchmark
```

## Report

See `REPORT.md` for the full academic write-up including methodology,
results tables, and discussion.