DepthLens / README.md
Rishabh Jain
Fix /info endpoint crash: pin gradio_client and guard bool schemas
d265d8e
---
title: Depth-Aware Scene Description
emoji: πŸ‘
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "5.49.1"
app_file: app.py
pinned: false
hardware: zero-a10g
---
# Depth-Aware Scene Description for Visually Impaired Users
EE654 β€” 3D Vision & Augmented Reality | Maynooth University
An assistive scene description system that injects structured spatial
context from monocular depth estimation into a vision-language model,
enabling natural-language descriptions that include object distances,
physical sizes, horizontal positions, and depth-ordered scene layout.
No retraining. No human-written references.
## System Overview
Two delivered configurations plus a null baseline:
| Configuration | Models | Description |
|---|---|---|
| Baseline (Stage 1) | VLM only | Flat categorical descriptions β€” reference point only |
| **Stage 2** (core) | VLM + Depth Anything V2 Small | Depth context preamble injected into every prompt |
| **Stage 3** (enhanced) | VLM + DA-V2 + YOLOv8n | Per-object depth sampled at YOLO bounding-box centres |
The **depth context constructor** (`src/depth_context.py`) is the technical
contribution: it converts a per-pixel disparity map into a structured
natural-language preamble containing per-object distance (cm), physical
width/height (cm), horizontal position (left/centre/right), nearest-first
ordering, and foreground/midground/background scene-layout summary.
## Hardware & Models
| Target | GPU | VLM |
|---|---|---|
| Google Colab T4 (≀16 GB) | 15 GB VRAM | Moondream 2B |
| Local RTX 5060 Laptop (β‰₯16 GB) | 16 GB VRAM | Qwen2.5-VL-3B |
The VLM is selected automatically at runtime via `torch.cuda.mem_get_info`.
Override with `--force-model moondream` or `--force-model qwen`.
## Evaluation
Dataset: **ARKitScenes** [Baruch et al., NeurIPS 2021] β€” 1,000 frames
sampled from 100 indoor scenes (kitchens, bedrooms, offices, corridors)
with `random.seed(42)`.
Three reference-free metrics (no human annotations required):
- **STD** β€” Spatial Term Density: spatial vocabulary terms per 100 words
- **SFS** β€” Spatial Faithfulness Score: label + position + distance zone
propagation from preamble to generated description
- **Preamble BERTScore F1** β€” generated text vs depth preamble as reference
## Setup
```bash
pip install -r requirements.txt
```
Model weights are downloaded automatically on first run (YOLOv8n, Depth
Anything V2 Small, Moondream 2B or Qwen2.5-VL-3B).
## Running the Pipeline
### Gradio UI (webcam + AR overlay)
```bash
python -m src.ui.gradio_app
```
### Evaluation
```bash
# BERTScore ablation (Stage 1 / 2 / 3) on 1,000 images
python -m src.evaluation.bertscore_ablation \
--output outputs/results/ablation_1k.csv
# Spatial Faithfulness Score (Stage 3, min 1 detection)
python -m src.evaluation.spatial_faithfulness \
--output outputs/results/sfs_s3_1k.csv \
--min-objects 1
# Per-stage latency + VRAM benchmark
python -m src.evaluation.latency_benchmark \
--image data/test_images/arkit_41159529_0000.jpg \
--output outputs/results/latency.csv
```
## Project Structure
```
src/
β”œβ”€β”€ config.py # all constants, paths, model IDs
β”œβ”€β”€ depth_context.py # CORE CONTRIBUTION
β”œβ”€β”€ pipeline.py # run_stage1 / run_stage2 / run_stage3
β”œβ”€β”€ data/
β”‚ └── arkitscenes_loader.py # downloads ARKitScenes frames
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ depth.py # Depth Anything V2 loader
β”‚ β”œβ”€β”€ detector.py # YOLOv8n loader
β”‚ └── vlm.py # hardware-adaptive Moondream/Qwen loader
β”œβ”€β”€ evaluation/
β”‚ β”œβ”€β”€ bertscore_ablation.py # STD + Preamble BERTScore F1
β”‚ β”œβ”€β”€ spatial_faithfulness.py # SFS metric
β”‚ └── latency_benchmark.py # per-stage timing + VRAM
└── ui/
└── gradio_app.py # Gradio UI with AR overlay
data/
└── test_images/ # 1,000 ARKitScenes evaluation frames
outputs/
└── results/ # evaluation CSVs
β”œβ”€β”€ ablation_1k.csv # BERTScore ablation, 1k images
β”œβ”€β”€ sfs_s2_1k.csv # SFS Stage 2, 1k images
β”œβ”€β”€ sfs_s3_1k.csv # SFS Stage 3, 1k images (min 1 object)
└── latency.csv # per-stage latency benchmark
```
## Report
See `REPORT.md` for the full academic write-up including methodology,
results tables, and discussion.