--- title: Real Time Image Captioning emoji: πŸ‘οΈ colorFrom: indigo colorTo: purple sdk: gradio sdk_version: "5.9.1" python_version: "3.10" app_file: app.py pinned: false --- # πŸ‘ ClearPath β€” Real-Time Scene Description for Visually-Impaired People A fully open-source Python system that describes visual scenes in plain language and classifies them as **SAFE** or **DANGEROUS** using a regex engine. ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”‚ Input │───▢│ Qwen2-VL │───▢│ Regex Safety │───▢│ SAFE/DANGEROUS │───▢│ TTS β”‚ β”‚ (Image/ β”‚ β”‚ Captioning β”‚ β”‚ Classifier β”‚ β”‚ + Hazard tags β”‚ β”‚ β”‚ β”‚ Video / β”‚ β”‚ (HuggingFaceβ”‚ β”‚ (15 categoriesβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β”‚ Camera) β”‚ β”‚ open src) β”‚ β”‚ ~30 patterns)β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## πŸ“ Project Structure ``` scene_description/ β”œβ”€β”€ app.py ← Gradio web UI (main entry point) β”œβ”€β”€ cli.py ← Command-line interface β”œβ”€β”€ scene_captioner.py ← Qwen2-VL image captioning module β”œβ”€β”€ safety_classifier.py ← Regex-based SAFE/DANGEROUS classifier β”œβ”€β”€ tts_engine.py ← Text-to-Speech (pyttsx3 / gTTS) β”œβ”€β”€ requirements.txt β”œβ”€β”€ tests/ β”‚ └── test_safety_classifier.py └── README.md ``` --- ## βš™οΈ Setup ### 1. Create a virtual environment ```bash python -m venv venv source venv/bin/activate # Linux/Mac venv\Scripts\activate # Windows ``` ### 2. Install dependencies ```bash pip install -r requirements.txt ``` > **GPU (recommended):** Install the CUDA-enabled PyTorch version first: > ```bash > pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121 > ``` ### 3. (Optional) HuggingFace login for gated models ```bash huggingface-cli login ``` Qwen2-VL-2B is **not gated** β€” no login required for the default model. --- ## πŸš€ Running ### Web UI (Gradio) ```bash python app.py ``` Open **http://localhost:7860** in your browser. Supports: - πŸ“ Image upload (drag & drop) - πŸ“· Live webcam capture - 🎬 Video file analysis (frame-by-frame) ### Command Line ```bash # Single image python cli.py --image photo.jpg --speak # Video file (capture every 3 seconds) python cli.py --video footage.mp4 --interval 3 --speak # Live webcam loop python cli.py --camera --speak # Use larger model for better quality python cli.py --image photo.jpg --model Qwen/Qwen2-VL-7B-Instruct ``` ### Run Tests ```bash python -m pytest tests/ -v ``` --- ## 🧠 Models | Model | Size | VRAM | Quality | |-------|------|------|---------| | `Qwen/Qwen2-VL-2B-Instruct` | ~5 GB | ~5 GB | Good βœ… (default) | | `Qwen/Qwen2-VL-7B-Instruct` | ~14 GB | ~14 GB | Better ⭐ | | `Qwen/Qwen2.5-VL-3B-Instruct` | ~6 GB | ~6 GB | Good + newer | | `Salesforce/blip2-opt-2.7b` | ~5 GB | ~5 GB | Fallback only | Switch model via environment variable: ```bash QWEN_MODEL=Qwen/Qwen2-VL-7B-Instruct python app.py ``` --- ## πŸ” Safety Classifier β€” Hazard Categories The regex engine covers **15 hazard categories** with ~30 pattern groups: | Category | Examples | |----------|---------| | `fire` | fire, flames, burning, blaze, smoke | | `flood` | flooding, flash flood, submerged | | `storm` | tornado, hurricane, lightning | | `traffic` | oncoming car, near collision | | `crash` | accident, wreck, overturned vehicle | | `weapon` | gun, knife, rifle, blade, bomb | | `violence` | brawl, riot, shooting, assault | | `fall` | cliff, ledge, scaffolding, steep drop | | `collapse` | rubble, debris, cave-in | | `electrical` | exposed wire, live wire, sparking | | `injury` | blood, wound, bleeding, unconscious | | `slip` | wet floor, icy road, black ice | | `construction` | heavy machinery, crane, unsafe structure | | `chemical` | chemical spill, gas leak, toxic fumes | | `crowd` | stampede, crowd crush, panic | --- ## β™Ώ Accessibility Features - **Auto TTS** β€” every description is read aloud automatically - `aria-live` regions in the web UI for screen reader support - High-contrast dark theme with clear visual indicators - Keyboard-navigable Gradio interface --- ## πŸ“„ License MIT License β€” free for personal and commercial use.