Spaces:
Sleeping
Sleeping
| title: Depth-Aware Scene Description | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: "5.49.1" | |
| app_file: app.py | |
| pinned: false | |
| hardware: zero-a10g | |
| # Depth-Aware Scene Description for Visually Impaired Users | |
| EE654 β 3D Vision & Augmented Reality | Maynooth University | |
| An assistive scene description system that injects structured spatial | |
| context from monocular depth estimation into a vision-language model, | |
| enabling natural-language descriptions that include object distances, | |
| physical sizes, horizontal positions, and depth-ordered scene layout. | |
| No retraining. No human-written references. | |
| ## System Overview | |
| Two delivered configurations plus a null baseline: | |
| | Configuration | Models | Description | | |
| |---|---|---| | |
| | Baseline (Stage 1) | VLM only | Flat categorical descriptions β reference point only | | |
| | **Stage 2** (core) | VLM + Depth Anything V2 Small | Depth context preamble injected into every prompt | | |
| | **Stage 3** (enhanced) | VLM + DA-V2 + YOLOv8n | Per-object depth sampled at YOLO bounding-box centres | | |
| The **depth context constructor** (`src/depth_context.py`) is the technical | |
| contribution: it converts a per-pixel disparity map into a structured | |
| natural-language preamble containing per-object distance (cm), physical | |
| width/height (cm), horizontal position (left/centre/right), nearest-first | |
| ordering, and foreground/midground/background scene-layout summary. | |
| ## Hardware & Models | |
| | Target | GPU | VLM | | |
| |---|---|---| | |
| | Google Colab T4 (β€16 GB) | 15 GB VRAM | Moondream 2B | | |
| | Local RTX 5060 Laptop (β₯16 GB) | 16 GB VRAM | Qwen2.5-VL-3B | | |
| The VLM is selected automatically at runtime via `torch.cuda.mem_get_info`. | |
| Override with `--force-model moondream` or `--force-model qwen`. | |
| ## Evaluation | |
| Dataset: **ARKitScenes** [Baruch et al., NeurIPS 2021] β 1,000 frames | |
| sampled from 100 indoor scenes (kitchens, bedrooms, offices, corridors) | |
| with `random.seed(42)`. | |
| Three reference-free metrics (no human annotations required): | |
| - **STD** β Spatial Term Density: spatial vocabulary terms per 100 words | |
| - **SFS** β Spatial Faithfulness Score: label + position + distance zone | |
| propagation from preamble to generated description | |
| - **Preamble BERTScore F1** β generated text vs depth preamble as reference | |
| ## Setup | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| Model weights are downloaded automatically on first run (YOLOv8n, Depth | |
| Anything V2 Small, Moondream 2B or Qwen2.5-VL-3B). | |
| ## Running the Pipeline | |
| ### Gradio UI (webcam + AR overlay) | |
| ```bash | |
| python -m src.ui.gradio_app | |
| ``` | |
| ### Evaluation | |
| ```bash | |
| # BERTScore ablation (Stage 1 / 2 / 3) on 1,000 images | |
| python -m src.evaluation.bertscore_ablation \ | |
| --output outputs/results/ablation_1k.csv | |
| # Spatial Faithfulness Score (Stage 3, min 1 detection) | |
| python -m src.evaluation.spatial_faithfulness \ | |
| --output outputs/results/sfs_s3_1k.csv \ | |
| --min-objects 1 | |
| # Per-stage latency + VRAM benchmark | |
| python -m src.evaluation.latency_benchmark \ | |
| --image data/test_images/arkit_41159529_0000.jpg \ | |
| --output outputs/results/latency.csv | |
| ``` | |
| ## Project Structure | |
| ``` | |
| src/ | |
| βββ config.py # all constants, paths, model IDs | |
| βββ depth_context.py # CORE CONTRIBUTION | |
| βββ pipeline.py # run_stage1 / run_stage2 / run_stage3 | |
| βββ data/ | |
| β βββ arkitscenes_loader.py # downloads ARKitScenes frames | |
| βββ models/ | |
| β βββ depth.py # Depth Anything V2 loader | |
| β βββ detector.py # YOLOv8n loader | |
| β βββ vlm.py # hardware-adaptive Moondream/Qwen loader | |
| βββ evaluation/ | |
| β βββ bertscore_ablation.py # STD + Preamble BERTScore F1 | |
| β βββ spatial_faithfulness.py # SFS metric | |
| β βββ latency_benchmark.py # per-stage timing + VRAM | |
| βββ ui/ | |
| βββ gradio_app.py # Gradio UI with AR overlay | |
| data/ | |
| βββ test_images/ # 1,000 ARKitScenes evaluation frames | |
| outputs/ | |
| βββ results/ # evaluation CSVs | |
| βββ ablation_1k.csv # BERTScore ablation, 1k images | |
| βββ sfs_s2_1k.csv # SFS Stage 2, 1k images | |
| βββ sfs_s3_1k.csv # SFS Stage 3, 1k images (min 1 object) | |
| βββ latency.csv # per-stage latency benchmark | |
| ``` | |
| ## Report | |
| See `REPORT.md` for the full academic write-up including methodology, | |
| results tables, and discussion. | |