---
title: Real Time Image Captioning
emoji: 👁️
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: "5.9.1"
python_version: "3.10"
app_file: app.py
pinned: false
---

# 👁 ClearPath — Real-Time Scene Description for Visually-Impaired People

A fully open-source Python system that describes visual scenes in plain language
and classifies them as **SAFE** or **DANGEROUS** using a regex engine.

```
┌─────────┐    ┌──────────────┐    ┌───────────────┐    ┌────────────────┐    ┌──────┐
│  Input  │───▶│  Qwen2-VL    │───▶│  Regex Safety │───▶│ SAFE/DANGEROUS │───▶│  TTS │
│ (Image/ │    │  Captioning  │    │  Classifier   │    │  + Hazard tags │    │      │
│ Video / │    │  (HuggingFace│    │  (15 categories│   └────────────────┘    └──────┘
│ Camera) │    │   open src)  │    │   ~30 patterns)│
└─────────┘    └──────────────┘    └───────────────┘
```

---

## 📁 Project Structure

```
scene_description/
├── app.py                  ← Gradio web UI (main entry point)
├── cli.py                  ← Command-line interface
├── scene_captioner.py      ← Qwen2-VL image captioning module
├── safety_classifier.py    ← Regex-based SAFE/DANGEROUS classifier
├── tts_engine.py           ← Text-to-Speech (pyttsx3 / gTTS)
├── requirements.txt
├── tests/
│   └── test_safety_classifier.py
└── README.md
```

---

## ⚙️ Setup

### 1. Create a virtual environment
```bash
python -m venv venv
source venv/bin/activate        # Linux/Mac
venv\Scripts\activate           # Windows
```

### 2. Install dependencies
```bash
pip install -r requirements.txt
```

> **GPU (recommended):** Install the CUDA-enabled PyTorch version first:
> ```bash
> pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
> ```

### 3. (Optional) HuggingFace login for gated models
```bash
huggingface-cli login
```
Qwen2-VL-2B is **not gated** — no login required for the default model.

---

## 🚀 Running

### Web UI (Gradio)
```bash
python app.py
```
Open **http://localhost:7860** in your browser.

Supports:
- 📁 Image upload (drag & drop)
- 📷 Live webcam capture
- 🎬 Video file analysis (frame-by-frame)

### Command Line
```bash
# Single image
python cli.py --image photo.jpg --speak

# Video file (capture every 3 seconds)
python cli.py --video footage.mp4 --interval 3 --speak

# Live webcam loop
python cli.py --camera --speak

# Use larger model for better quality
python cli.py --image photo.jpg --model Qwen/Qwen2-VL-7B-Instruct
```

### Run Tests
```bash
python -m pytest tests/ -v
```

---

## 🧠 Models

| Model | Size | VRAM | Quality |
|-------|------|------|---------|
| `Qwen/Qwen2-VL-2B-Instruct` | ~5 GB | ~5 GB | Good ✅ (default) |
| `Qwen/Qwen2-VL-7B-Instruct` | ~14 GB | ~14 GB | Better ⭐ |
| `Qwen/Qwen2.5-VL-3B-Instruct` | ~6 GB | ~6 GB | Good + newer |
| `Salesforce/blip2-opt-2.7b` | ~5 GB | ~5 GB | Fallback only |

Switch model via environment variable:
```bash
QWEN_MODEL=Qwen/Qwen2-VL-7B-Instruct python app.py
```

---

## 🔍 Safety Classifier — Hazard Categories

The regex engine covers **15 hazard categories** with ~30 pattern groups:

| Category | Examples |
|----------|---------|
| `fire` | fire, flames, burning, blaze, smoke |
| `flood` | flooding, flash flood, submerged |
| `storm` | tornado, hurricane, lightning |
| `traffic` | oncoming car, near collision |
| `crash` | accident, wreck, overturned vehicle |
| `weapon` | gun, knife, rifle, blade, bomb |
| `violence` | brawl, riot, shooting, assault |
| `fall` | cliff, ledge, scaffolding, steep drop |
| `collapse` | rubble, debris, cave-in |
| `electrical` | exposed wire, live wire, sparking |
| `injury` | blood, wound, bleeding, unconscious |
| `slip` | wet floor, icy road, black ice |
| `construction` | heavy machinery, crane, unsafe structure |
| `chemical` | chemical spill, gas leak, toxic fumes |
| `crowd` | stampede, crowd crush, panic |

---

## ♿ Accessibility Features

- **Auto TTS** — every description is read aloud automatically
- `aria-live` regions in the web UI for screen reader support
- High-contrast dark theme with clear visual indicators
- Keyboard-navigable Gradio interface

---

## 📄 License

MIT License — free for personal and commercial use.