Spaces:

Rishabh12j
/

DepthLens

Sleeping

App Files Files Community

DepthLens / README.md

Rishabh Jain

Fix /info endpoint crash: pin gradio_client and guard bool schemas

d265d8e about 1 month ago

preview code

raw

history blame contribute delete

4.52 kB

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

metadata

title: Depth-Aware Scene Description
emoji: 👁
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
hardware: zero-a10g

Depth-Aware Scene Description for Visually Impaired Users

EE654 — 3D Vision & Augmented Reality | Maynooth University

An assistive scene description system that injects structured spatial context from monocular depth estimation into a vision-language model, enabling natural-language descriptions that include object distances, physical sizes, horizontal positions, and depth-ordered scene layout. No retraining. No human-written references.

System Overview

Two delivered configurations plus a null baseline:

Configuration	Models	Description
Baseline (Stage 1)	VLM only	Flat categorical descriptions — reference point only
Stage 2 (core)	VLM + Depth Anything V2 Small	Depth context preamble injected into every prompt
Stage 3 (enhanced)	VLM + DA-V2 + YOLOv8n	Per-object depth sampled at YOLO bounding-box centres

The depth context constructor (src/depth_context.py) is the technical contribution: it converts a per-pixel disparity map into a structured natural-language preamble containing per-object distance (cm), physical width/height (cm), horizontal position (left/centre/right), nearest-first ordering, and foreground/midground/background scene-layout summary.

Hardware & Models

Target	GPU	VLM
Google Colab T4 (≤16 GB)	15 GB VRAM	Moondream 2B
Local RTX 5060 Laptop (≥16 GB)	16 GB VRAM	Qwen2.5-VL-3B

The VLM is selected automatically at runtime via torch.cuda.mem_get_info. Override with --force-model moondream or --force-model qwen.

Evaluation

Dataset: ARKitScenes [Baruch et al., NeurIPS 2021] — 1,000 frames sampled from 100 indoor scenes (kitchens, bedrooms, offices, corridors) with random.seed(42).

Three reference-free metrics (no human annotations required):

STD — Spatial Term Density: spatial vocabulary terms per 100 words
SFS — Spatial Faithfulness Score: label + position + distance zone propagation from preamble to generated description
Preamble BERTScore F1 — generated text vs depth preamble as reference

Setup

pip install -r requirements.txt

Model weights are downloaded automatically on first run (YOLOv8n, Depth Anything V2 Small, Moondream 2B or Qwen2.5-VL-3B).

Running the Pipeline

Gradio UI (webcam + AR overlay)

python -m src.ui.gradio_app

Evaluation

# BERTScore ablation (Stage 1 / 2 / 3) on 1,000 images
python -m src.evaluation.bertscore_ablation \
    --output outputs/results/ablation_1k.csv

# Spatial Faithfulness Score (Stage 3, min 1 detection)
python -m src.evaluation.spatial_faithfulness \
    --output outputs/results/sfs_s3_1k.csv \
    --min-objects 1

# Per-stage latency + VRAM benchmark
python -m src.evaluation.latency_benchmark \
    --image data/test_images/arkit_41159529_0000.jpg \
    --output outputs/results/latency.csv

Project Structure

src/
├── config.py                      # all constants, paths, model IDs
├── depth_context.py               # CORE CONTRIBUTION
├── pipeline.py                    # run_stage1 / run_stage2 / run_stage3
├── data/
│   └── arkitscenes_loader.py      # downloads ARKitScenes frames
├── models/
│   ├── depth.py                   # Depth Anything V2 loader
│   ├── detector.py                # YOLOv8n loader
│   └── vlm.py                     # hardware-adaptive Moondream/Qwen loader
├── evaluation/
│   ├── bertscore_ablation.py      # STD + Preamble BERTScore F1
│   ├── spatial_faithfulness.py    # SFS metric
│   └── latency_benchmark.py       # per-stage timing + VRAM
└── ui/
    └── gradio_app.py              # Gradio UI with AR overlay
data/
└── test_images/                   # 1,000 ARKitScenes evaluation frames
outputs/
└── results/                       # evaluation CSVs
    ├── ablation_1k.csv            # BERTScore ablation, 1k images
    ├── sfs_s2_1k.csv              # SFS Stage 2, 1k images
    ├── sfs_s3_1k.csv              # SFS Stage 3, 1k images (min 1 object)
    └── latency.csv                # per-stage latency benchmark

Report

See REPORT.md for the full academic write-up including methodology, results tables, and discussion.