| | --- |
| | title: Real Time Image Captioning |
| | emoji: 👁️ |
| | colorFrom: indigo |
| | colorTo: purple |
| | sdk: gradio |
| | sdk_version: "5.9.1" |
| | python_version: "3.10" |
| | app_file: app.py |
| | pinned: false |
| | --- |
| | |
| | # 👁 ClearPath — Real-Time Scene Description for Visually-Impaired People |
| |
|
| | A fully open-source Python system that describes visual scenes in plain language |
| | and classifies them as **SAFE** or **DANGEROUS** using a regex engine. |
| |
|
| | ``` |
| | ┌─────────┐ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ ┌──────┐ |
| | │ Input │───▶│ Qwen2-VL │───▶│ Regex Safety │───▶│ SAFE/DANGEROUS │───▶│ TTS │ |
| | │ (Image/ │ │ Captioning │ │ Classifier │ │ + Hazard tags │ │ │ |
| | │ Video / │ │ (HuggingFace│ │ (15 categories│ └────────────────┘ └──────┘ |
| | │ Camera) │ │ open src) │ │ ~30 patterns)│ |
| | └─────────┘ └──────────────┘ └───────────────┘ |
| | ``` |
| |
|
| | --- |
| |
|
| | ## 📁 Project Structure |
| |
|
| | ``` |
| | scene_description/ |
| | ├── app.py ← Gradio web UI (main entry point) |
| | ├── cli.py ← Command-line interface |
| | ├── scene_captioner.py ← Qwen2-VL image captioning module |
| | ├── safety_classifier.py ← Regex-based SAFE/DANGEROUS classifier |
| | ├── tts_engine.py ← Text-to-Speech (pyttsx3 / gTTS) |
| | ├── requirements.txt |
| | ├── tests/ |
| | │ └── test_safety_classifier.py |
| | └── README.md |
| | ``` |
| |
|
| | --- |
| |
|
| | ## ⚙️ Setup |
| |
|
| | ### 1. Create a virtual environment |
| | ```bash |
| | python -m venv venv |
| | source venv/bin/activate # Linux/Mac |
| | venv\Scripts\activate # Windows |
| | ``` |
| |
|
| | ### 2. Install dependencies |
| | ```bash |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | > **GPU (recommended):** Install the CUDA-enabled PyTorch version first: |
| | > ```bash |
| | > pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121 |
| | > ``` |
| |
|
| | ### 3. (Optional) HuggingFace login for gated models |
| | ```bash |
| | huggingface-cli login |
| | ``` |
| | Qwen2-VL-2B is **not gated** — no login required for the default model. |
| |
|
| | --- |
| |
|
| | ## 🚀 Running |
| |
|
| | ### Web UI (Gradio) |
| | ```bash |
| | python app.py |
| | ``` |
| | Open **http://localhost:7860** in your browser. |
| |
|
| | Supports: |
| | - 📁 Image upload (drag & drop) |
| | - 📷 Live webcam capture |
| | - 🎬 Video file analysis (frame-by-frame) |
| |
|
| | ### Command Line |
| | ```bash |
| | # Single image |
| | python cli.py --image photo.jpg --speak |
| | |
| | # Video file (capture every 3 seconds) |
| | python cli.py --video footage.mp4 --interval 3 --speak |
| | |
| | # Live webcam loop |
| | python cli.py --camera --speak |
| | |
| | # Use larger model for better quality |
| | python cli.py --image photo.jpg --model Qwen/Qwen2-VL-7B-Instruct |
| | ``` |
| |
|
| | ### Run Tests |
| | ```bash |
| | python -m pytest tests/ -v |
| | ``` |
| |
|
| | --- |
| |
|
| | ## 🧠 Models |
| |
|
| | | Model | Size | VRAM | Quality | |
| | |-------|------|------|---------| |
| | | `Qwen/Qwen2-VL-2B-Instruct` | ~5 GB | ~5 GB | Good ✅ (default) | |
| | | `Qwen/Qwen2-VL-7B-Instruct` | ~14 GB | ~14 GB | Better ⭐ | |
| | | `Qwen/Qwen2.5-VL-3B-Instruct` | ~6 GB | ~6 GB | Good + newer | |
| | | `Salesforce/blip2-opt-2.7b` | ~5 GB | ~5 GB | Fallback only | |
| |
|
| | Switch model via environment variable: |
| | ```bash |
| | QWEN_MODEL=Qwen/Qwen2-VL-7B-Instruct python app.py |
| | ``` |
| |
|
| | --- |
| |
|
| | ## 🔍 Safety Classifier — Hazard Categories |
| |
|
| | The regex engine covers **15 hazard categories** with ~30 pattern groups: |
| |
|
| | | Category | Examples | |
| | |----------|---------| |
| | | `fire` | fire, flames, burning, blaze, smoke | |
| | | `flood` | flooding, flash flood, submerged | |
| | | `storm` | tornado, hurricane, lightning | |
| | | `traffic` | oncoming car, near collision | |
| | | `crash` | accident, wreck, overturned vehicle | |
| | | `weapon` | gun, knife, rifle, blade, bomb | |
| | | `violence` | brawl, riot, shooting, assault | |
| | | `fall` | cliff, ledge, scaffolding, steep drop | |
| | | `collapse` | rubble, debris, cave-in | |
| | | `electrical` | exposed wire, live wire, sparking | |
| | | `injury` | blood, wound, bleeding, unconscious | |
| | | `slip` | wet floor, icy road, black ice | |
| | | `construction` | heavy machinery, crane, unsafe structure | |
| | | `chemical` | chemical spill, gas leak, toxic fumes | |
| | | `crowd` | stampede, crowd crush, panic | |
| |
|
| | --- |
| |
|
| | ## ♿ Accessibility Features |
| |
|
| | - **Auto TTS** — every description is read aloud automatically |
| | - `aria-live` regions in the web UI for screen reader support |
| | - High-contrast dark theme with clear visual indicators |
| | - Keyboard-navigable Gradio interface |
| |
|
| | --- |
| |
|
| | ## 📄 License |
| |
|
| | MIT License — free for personal and commercial use. |