A newer version of the Gradio SDK is available:
6.8.0
metadata
title: Real Time Image Captioning
emoji: 👁️
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
python_version: '3.10'
app_file: app.py
pinned: false
👁 ClearPath — Real-Time Scene Description for Visually-Impaired People
A fully open-source Python system that describes visual scenes in plain language and classifies them as SAFE or DANGEROUS using a regex engine.
┌─────────┐ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ ┌──────┐
│ Input │───▶│ Qwen2-VL │───▶│ Regex Safety │───▶│ SAFE/DANGEROUS │───▶│ TTS │
│ (Image/ │ │ Captioning │ │ Classifier │ │ + Hazard tags │ │ │
│ Video / │ │ (HuggingFace│ │ (15 categories│ └────────────────┘ └──────┘
│ Camera) │ │ open src) │ │ ~30 patterns)│
└─────────┘ └──────────────┘ └───────────────┘
📁 Project Structure
scene_description/
├── app.py ← Gradio web UI (main entry point)
├── cli.py ← Command-line interface
├── scene_captioner.py ← Qwen2-VL image captioning module
├── safety_classifier.py ← Regex-based SAFE/DANGEROUS classifier
├── tts_engine.py ← Text-to-Speech (pyttsx3 / gTTS)
├── requirements.txt
├── tests/
│ └── test_safety_classifier.py
└── README.md
⚙️ Setup
1. Create a virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
2. Install dependencies
pip install -r requirements.txt
GPU (recommended): Install the CUDA-enabled PyTorch version first:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
3. (Optional) HuggingFace login for gated models
huggingface-cli login
Qwen2-VL-2B is not gated — no login required for the default model.
🚀 Running
Web UI (Gradio)
python app.py
Open http://localhost:7860 in your browser.
Supports:
- 📁 Image upload (drag & drop)
- 📷 Live webcam capture
- 🎬 Video file analysis (frame-by-frame)
Command Line
# Single image
python cli.py --image photo.jpg --speak
# Video file (capture every 3 seconds)
python cli.py --video footage.mp4 --interval 3 --speak
# Live webcam loop
python cli.py --camera --speak
# Use larger model for better quality
python cli.py --image photo.jpg --model Qwen/Qwen2-VL-7B-Instruct
Run Tests
python -m pytest tests/ -v
🧠 Models
| Model | Size | VRAM | Quality |
|---|---|---|---|
Qwen/Qwen2-VL-2B-Instruct |
~5 GB | ~5 GB | Good ✅ (default) |
Qwen/Qwen2-VL-7B-Instruct |
~14 GB | ~14 GB | Better ⭐ |
Qwen/Qwen2.5-VL-3B-Instruct |
~6 GB | ~6 GB | Good + newer |
Salesforce/blip2-opt-2.7b |
~5 GB | ~5 GB | Fallback only |
Switch model via environment variable:
QWEN_MODEL=Qwen/Qwen2-VL-7B-Instruct python app.py
🔍 Safety Classifier — Hazard Categories
The regex engine covers 15 hazard categories with ~30 pattern groups:
| Category | Examples |
|---|---|
fire |
fire, flames, burning, blaze, smoke |
flood |
flooding, flash flood, submerged |
storm |
tornado, hurricane, lightning |
traffic |
oncoming car, near collision |
crash |
accident, wreck, overturned vehicle |
weapon |
gun, knife, rifle, blade, bomb |
violence |
brawl, riot, shooting, assault |
fall |
cliff, ledge, scaffolding, steep drop |
collapse |
rubble, debris, cave-in |
electrical |
exposed wire, live wire, sparking |
injury |
blood, wound, bleeding, unconscious |
slip |
wet floor, icy road, black ice |
construction |
heavy machinery, crane, unsafe structure |
chemical |
chemical spill, gas leak, toxic fumes |
crowd |
stampede, crowd crush, panic |
♿ Accessibility Features
- Auto TTS — every description is read aloud automatically
aria-liveregions in the web UI for screen reader support- High-contrast dark theme with clear visual indicators
- Keyboard-navigable Gradio interface
📄 License
MIT License — free for personal and commercial use.