A7med-Ame3's picture
Update README.md
5cfc384 verified

A newer version of the Gradio SDK is available: 6.8.0

Upgrade
metadata
title: Real Time Image Captioning
emoji: 👁️
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
python_version: '3.10'
app_file: app.py
pinned: false

👁 ClearPath — Real-Time Scene Description for Visually-Impaired People

A fully open-source Python system that describes visual scenes in plain language and classifies them as SAFE or DANGEROUS using a regex engine.

┌─────────┐    ┌──────────────┐    ┌───────────────┐    ┌────────────────┐    ┌──────┐
│  Input  │───▶│  Qwen2-VL    │───▶│  Regex Safety │───▶│ SAFE/DANGEROUS │───▶│  TTS │
│ (Image/ │    │  Captioning  │    │  Classifier   │    │  + Hazard tags │    │      │
│ Video / │    │  (HuggingFace│    │  (15 categories│   └────────────────┘    └──────┘
│ Camera) │    │   open src)  │    │   ~30 patterns)│
└─────────┘    └──────────────┘    └───────────────┘

📁 Project Structure

scene_description/
├── app.py                  ← Gradio web UI (main entry point)
├── cli.py                  ← Command-line interface
├── scene_captioner.py      ← Qwen2-VL image captioning module
├── safety_classifier.py    ← Regex-based SAFE/DANGEROUS classifier
├── tts_engine.py           ← Text-to-Speech (pyttsx3 / gTTS)
├── requirements.txt
├── tests/
│   └── test_safety_classifier.py
└── README.md

⚙️ Setup

1. Create a virtual environment

python -m venv venv
source venv/bin/activate        # Linux/Mac
venv\Scripts\activate           # Windows

2. Install dependencies

pip install -r requirements.txt

GPU (recommended): Install the CUDA-enabled PyTorch version first:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

3. (Optional) HuggingFace login for gated models

huggingface-cli login

Qwen2-VL-2B is not gated — no login required for the default model.


🚀 Running

Web UI (Gradio)

python app.py

Open http://localhost:7860 in your browser.

Supports:

  • 📁 Image upload (drag & drop)
  • 📷 Live webcam capture
  • 🎬 Video file analysis (frame-by-frame)

Command Line

# Single image
python cli.py --image photo.jpg --speak

# Video file (capture every 3 seconds)
python cli.py --video footage.mp4 --interval 3 --speak

# Live webcam loop
python cli.py --camera --speak

# Use larger model for better quality
python cli.py --image photo.jpg --model Qwen/Qwen2-VL-7B-Instruct

Run Tests

python -m pytest tests/ -v

🧠 Models

Model Size VRAM Quality
Qwen/Qwen2-VL-2B-Instruct ~5 GB ~5 GB Good ✅ (default)
Qwen/Qwen2-VL-7B-Instruct ~14 GB ~14 GB Better ⭐
Qwen/Qwen2.5-VL-3B-Instruct ~6 GB ~6 GB Good + newer
Salesforce/blip2-opt-2.7b ~5 GB ~5 GB Fallback only

Switch model via environment variable:

QWEN_MODEL=Qwen/Qwen2-VL-7B-Instruct python app.py

🔍 Safety Classifier — Hazard Categories

The regex engine covers 15 hazard categories with ~30 pattern groups:

Category Examples
fire fire, flames, burning, blaze, smoke
flood flooding, flash flood, submerged
storm tornado, hurricane, lightning
traffic oncoming car, near collision
crash accident, wreck, overturned vehicle
weapon gun, knife, rifle, blade, bomb
violence brawl, riot, shooting, assault
fall cliff, ledge, scaffolding, steep drop
collapse rubble, debris, cave-in
electrical exposed wire, live wire, sparking
injury blood, wound, bleeding, unconscious
slip wet floor, icy road, black ice
construction heavy machinery, crane, unsafe structure
chemical chemical spill, gas leak, toxic fumes
crowd stampede, crowd crush, panic

♿ Accessibility Features

  • Auto TTS — every description is read aloud automatically
  • aria-live regions in the web UI for screen reader support
  • High-contrast dark theme with clear visual indicators
  • Keyboard-navigable Gradio interface

📄 License

MIT License — free for personal and commercial use.