Spaces:

Rishabh12j
/

DepthLens

Sleeping

App Files Files Community

DepthLens / README.md

Rishabh Jain

Fix /info endpoint crash: pin gradio_client and guard bool schemas

d265d8e about 1 month ago

preview code

raw

history blame contribute delete

4.52 kB

	---
	title: Depth-Aware Scene Description
	emoji: 👁
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: "5.49.1"
	app_file: app.py
	pinned: false
	hardware: zero-a10g
	---

	# Depth-Aware Scene Description for Visually Impaired Users

	EE654 — 3D Vision & Augmented Reality \| Maynooth University

	An assistive scene description system that injects structured spatial
	context from monocular depth estimation into a vision-language model,
	enabling natural-language descriptions that include object distances,
	physical sizes, horizontal positions, and depth-ordered scene layout.
	No retraining. No human-written references.

	## System Overview

	Two delivered configurations plus a null baseline:

	\| Configuration \| Models \| Description \|
	\|---\|---\|---\|
	\| Baseline (Stage 1) \| VLM only \| Flat categorical descriptions — reference point only \|
	\| Stage 2 (core) \| VLM + Depth Anything V2 Small \| Depth context preamble injected into every prompt \|
	\| Stage 3 (enhanced) \| VLM + DA-V2 + YOLOv8n \| Per-object depth sampled at YOLO bounding-box centres \|

	The depth context constructor (`src/depth_context.py`) is the technical
	contribution: it converts a per-pixel disparity map into a structured
	natural-language preamble containing per-object distance (cm), physical
	width/height (cm), horizontal position (left/centre/right), nearest-first
	ordering, and foreground/midground/background scene-layout summary.

	## Hardware & Models

	\| Target \| GPU \| VLM \|
	\|---\|---\|---\|
	\| Google Colab T4 (≤16 GB) \| 15 GB VRAM \| Moondream 2B \|
	\| Local RTX 5060 Laptop (≥16 GB) \| 16 GB VRAM \| Qwen2.5-VL-3B \|

	The VLM is selected automatically at runtime via `torch.cuda.mem_get_info`.
	Override with `--force-model moondream` or `--force-model qwen`.

	## Evaluation

	Dataset: ARKitScenes [Baruch et al., NeurIPS 2021] — 1,000 frames
	sampled from 100 indoor scenes (kitchens, bedrooms, offices, corridors)
	with `random.seed(42)`.

	Three reference-free metrics (no human annotations required):

	- STD — Spatial Term Density: spatial vocabulary terms per 100 words
	- SFS — Spatial Faithfulness Score: label + position + distance zone
	propagation from preamble to generated description
	- Preamble BERTScore F1 — generated text vs depth preamble as reference

	## Setup

	```bash
	pip install -r requirements.txt
	```

	Model weights are downloaded automatically on first run (YOLOv8n, Depth
	Anything V2 Small, Moondream 2B or Qwen2.5-VL-3B).

	## Running the Pipeline

	### Gradio UI (webcam + AR overlay)

	```bash
	python -m src.ui.gradio_app
	```

	### Evaluation

	```bash
	# BERTScore ablation (Stage 1 / 2 / 3) on 1,000 images
	python -m src.evaluation.bertscore_ablation \
	--output outputs/results/ablation_1k.csv

	# Spatial Faithfulness Score (Stage 3, min 1 detection)
	python -m src.evaluation.spatial_faithfulness \
	--output outputs/results/sfs_s3_1k.csv \
	--min-objects 1

	# Per-stage latency + VRAM benchmark
	python -m src.evaluation.latency_benchmark \
	--image data/test_images/arkit_41159529_0000.jpg \
	--output outputs/results/latency.csv
	```

	## Project Structure

	```
	src/
	├── config.py # all constants, paths, model IDs
	├── depth_context.py # CORE CONTRIBUTION
	├── pipeline.py # run_stage1 / run_stage2 / run_stage3
	├── data/
	│ └── arkitscenes_loader.py # downloads ARKitScenes frames
	├── models/
	│ ├── depth.py # Depth Anything V2 loader
	│ ├── detector.py # YOLOv8n loader
	│ └── vlm.py # hardware-adaptive Moondream/Qwen loader
	├── evaluation/
	│ ├── bertscore_ablation.py # STD + Preamble BERTScore F1
	│ ├── spatial_faithfulness.py # SFS metric
	│ └── latency_benchmark.py # per-stage timing + VRAM
	└── ui/
	└── gradio_app.py # Gradio UI with AR overlay
	data/
	└── test_images/ # 1,000 ARKitScenes evaluation frames
	outputs/
	└── results/ # evaluation CSVs
	├── ablation_1k.csv # BERTScore ablation, 1k images
	├── sfs_s2_1k.csv # SFS Stage 2, 1k images
	├── sfs_s3_1k.csv # SFS Stage 3, 1k images (min 1 object)
	└── latency.csv # per-stage latency benchmark
	```

	## Report

	See `REPORT.md` for the full academic write-up including methodology,
	results tables, and discussion.