DepthLens / README.md
Rishabh Jain
Fix /info endpoint crash: pin gradio_client and guard bool schemas
d265d8e

A newer version of the Gradio SDK is available: 6.15.2

Upgrade
metadata
title: Depth-Aware Scene Description
emoji: πŸ‘
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
hardware: zero-a10g

Depth-Aware Scene Description for Visually Impaired Users

EE654 β€” 3D Vision & Augmented Reality | Maynooth University

An assistive scene description system that injects structured spatial context from monocular depth estimation into a vision-language model, enabling natural-language descriptions that include object distances, physical sizes, horizontal positions, and depth-ordered scene layout. No retraining. No human-written references.

System Overview

Two delivered configurations plus a null baseline:

Configuration Models Description
Baseline (Stage 1) VLM only Flat categorical descriptions β€” reference point only
Stage 2 (core) VLM + Depth Anything V2 Small Depth context preamble injected into every prompt
Stage 3 (enhanced) VLM + DA-V2 + YOLOv8n Per-object depth sampled at YOLO bounding-box centres

The depth context constructor (src/depth_context.py) is the technical contribution: it converts a per-pixel disparity map into a structured natural-language preamble containing per-object distance (cm), physical width/height (cm), horizontal position (left/centre/right), nearest-first ordering, and foreground/midground/background scene-layout summary.

Hardware & Models

Target GPU VLM
Google Colab T4 (≀16 GB) 15 GB VRAM Moondream 2B
Local RTX 5060 Laptop (β‰₯16 GB) 16 GB VRAM Qwen2.5-VL-3B

The VLM is selected automatically at runtime via torch.cuda.mem_get_info. Override with --force-model moondream or --force-model qwen.

Evaluation

Dataset: ARKitScenes [Baruch et al., NeurIPS 2021] β€” 1,000 frames sampled from 100 indoor scenes (kitchens, bedrooms, offices, corridors) with random.seed(42).

Three reference-free metrics (no human annotations required):

  • STD β€” Spatial Term Density: spatial vocabulary terms per 100 words
  • SFS β€” Spatial Faithfulness Score: label + position + distance zone propagation from preamble to generated description
  • Preamble BERTScore F1 β€” generated text vs depth preamble as reference

Setup

pip install -r requirements.txt

Model weights are downloaded automatically on first run (YOLOv8n, Depth Anything V2 Small, Moondream 2B or Qwen2.5-VL-3B).

Running the Pipeline

Gradio UI (webcam + AR overlay)

python -m src.ui.gradio_app

Evaluation

# BERTScore ablation (Stage 1 / 2 / 3) on 1,000 images
python -m src.evaluation.bertscore_ablation \
    --output outputs/results/ablation_1k.csv

# Spatial Faithfulness Score (Stage 3, min 1 detection)
python -m src.evaluation.spatial_faithfulness \
    --output outputs/results/sfs_s3_1k.csv \
    --min-objects 1

# Per-stage latency + VRAM benchmark
python -m src.evaluation.latency_benchmark \
    --image data/test_images/arkit_41159529_0000.jpg \
    --output outputs/results/latency.csv

Project Structure

src/
β”œβ”€β”€ config.py                      # all constants, paths, model IDs
β”œβ”€β”€ depth_context.py               # CORE CONTRIBUTION
β”œβ”€β”€ pipeline.py                    # run_stage1 / run_stage2 / run_stage3
β”œβ”€β”€ data/
β”‚   └── arkitscenes_loader.py      # downloads ARKitScenes frames
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ depth.py                   # Depth Anything V2 loader
β”‚   β”œβ”€β”€ detector.py                # YOLOv8n loader
β”‚   └── vlm.py                     # hardware-adaptive Moondream/Qwen loader
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ bertscore_ablation.py      # STD + Preamble BERTScore F1
β”‚   β”œβ”€β”€ spatial_faithfulness.py    # SFS metric
β”‚   └── latency_benchmark.py       # per-stage timing + VRAM
└── ui/
    └── gradio_app.py              # Gradio UI with AR overlay
data/
└── test_images/                   # 1,000 ARKitScenes evaluation frames
outputs/
└── results/                       # evaluation CSVs
    β”œβ”€β”€ ablation_1k.csv            # BERTScore ablation, 1k images
    β”œβ”€β”€ sfs_s2_1k.csv              # SFS Stage 2, 1k images
    β”œβ”€β”€ sfs_s3_1k.csv              # SFS Stage 3, 1k images (min 1 object)
    └── latency.csv                # per-stage latency benchmark

Report

See REPORT.md for the full academic write-up including methodology, results tables, and discussion.