--- title: Depth-Aware Scene Description emoji: 👁 colorFrom: blue colorTo: purple sdk: gradio sdk_version: "5.49.1" app_file: app.py pinned: false hardware: zero-a10g --- # Depth-Aware Scene Description for Visually Impaired Users EE654 — 3D Vision & Augmented Reality | Maynooth University An assistive scene description system that injects structured spatial context from monocular depth estimation into a vision-language model, enabling natural-language descriptions that include object distances, physical sizes, horizontal positions, and depth-ordered scene layout. No retraining. No human-written references. ## System Overview Two delivered configurations plus a null baseline: | Configuration | Models | Description | |---|---|---| | Baseline (Stage 1) | VLM only | Flat categorical descriptions — reference point only | | **Stage 2** (core) | VLM + Depth Anything V2 Small | Depth context preamble injected into every prompt | | **Stage 3** (enhanced) | VLM + DA-V2 + YOLOv8n | Per-object depth sampled at YOLO bounding-box centres | The **depth context constructor** (`src/depth_context.py`) is the technical contribution: it converts a per-pixel disparity map into a structured natural-language preamble containing per-object distance (cm), physical width/height (cm), horizontal position (left/centre/right), nearest-first ordering, and foreground/midground/background scene-layout summary. ## Hardware & Models | Target | GPU | VLM | |---|---|---| | Google Colab T4 (≤16 GB) | 15 GB VRAM | Moondream 2B | | Local RTX 5060 Laptop (≥16 GB) | 16 GB VRAM | Qwen2.5-VL-3B | The VLM is selected automatically at runtime via `torch.cuda.mem_get_info`. Override with `--force-model moondream` or `--force-model qwen`. ## Evaluation Dataset: **ARKitScenes** [Baruch et al., NeurIPS 2021] — 1,000 frames sampled from 100 indoor scenes (kitchens, bedrooms, offices, corridors) with `random.seed(42)`. Three reference-free metrics (no human annotations required): - **STD** — Spatial Term Density: spatial vocabulary terms per 100 words - **SFS** — Spatial Faithfulness Score: label + position + distance zone propagation from preamble to generated description - **Preamble BERTScore F1** — generated text vs depth preamble as reference ## Setup ```bash pip install -r requirements.txt ``` Model weights are downloaded automatically on first run (YOLOv8n, Depth Anything V2 Small, Moondream 2B or Qwen2.5-VL-3B). ## Running the Pipeline ### Gradio UI (webcam + AR overlay) ```bash python -m src.ui.gradio_app ``` ### Evaluation ```bash # BERTScore ablation (Stage 1 / 2 / 3) on 1,000 images python -m src.evaluation.bertscore_ablation \ --output outputs/results/ablation_1k.csv # Spatial Faithfulness Score (Stage 3, min 1 detection) python -m src.evaluation.spatial_faithfulness \ --output outputs/results/sfs_s3_1k.csv \ --min-objects 1 # Per-stage latency + VRAM benchmark python -m src.evaluation.latency_benchmark \ --image data/test_images/arkit_41159529_0000.jpg \ --output outputs/results/latency.csv ``` ## Project Structure ``` src/ ├── config.py # all constants, paths, model IDs ├── depth_context.py # CORE CONTRIBUTION ├── pipeline.py # run_stage1 / run_stage2 / run_stage3 ├── data/ │ └── arkitscenes_loader.py # downloads ARKitScenes frames ├── models/ │ ├── depth.py # Depth Anything V2 loader │ ├── detector.py # YOLOv8n loader │ └── vlm.py # hardware-adaptive Moondream/Qwen loader ├── evaluation/ │ ├── bertscore_ablation.py # STD + Preamble BERTScore F1 │ ├── spatial_faithfulness.py # SFS metric │ └── latency_benchmark.py # per-stage timing + VRAM └── ui/ └── gradio_app.py # Gradio UI with AR overlay data/ └── test_images/ # 1,000 ARKitScenes evaluation frames outputs/ └── results/ # evaluation CSVs ├── ablation_1k.csv # BERTScore ablation, 1k images ├── sfs_s2_1k.csv # SFS Stage 2, 1k images ├── sfs_s3_1k.csv # SFS Stage 3, 1k images (min 1 object) └── latency.csv # per-stage latency benchmark ``` ## Report See `REPORT.md` for the full academic write-up including methodology, results tables, and discussion.