Spaces:

A7med-Ame3
/

Real_Time_Image_Captioning

Sleeping

App Files Files Community

Real_Time_Image_Captioning / README.md

A7med-Ame3

Update README.md

5cfc384 verified 6 days ago

preview code

raw

history blame contribute delete

4.67 kB

	---
	title: Real Time Image Captioning
	emoji: 👁️
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	sdk_version: "5.9.1"
	python_version: "3.10"
	app_file: app.py
	pinned: false
	---

	# 👁 ClearPath — Real-Time Scene Description for Visually-Impaired People

	A fully open-source Python system that describes visual scenes in plain language
	and classifies them as SAFE or DANGEROUS using a regex engine.

	```
	┌─────────┐ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ ┌──────┐
	│ Input │───▶│ Qwen2-VL │───▶│ Regex Safety │───▶│ SAFE/DANGEROUS │───▶│ TTS │
	│ (Image/ │ │ Captioning │ │ Classifier │ │ + Hazard tags │ │ │
	│ Video / │ │ (HuggingFace│ │ (15 categories│ └────────────────┘ └──────┘
	│ Camera) │ │ open src) │ │ ~30 patterns)│
	└─────────┘ └──────────────┘ └───────────────┘
	```

	---

	## 📁 Project Structure

	```
	scene_description/
	├── app.py ← Gradio web UI (main entry point)
	├── cli.py ← Command-line interface
	├── scene_captioner.py ← Qwen2-VL image captioning module
	├── safety_classifier.py ← Regex-based SAFE/DANGEROUS classifier
	├── tts_engine.py ← Text-to-Speech (pyttsx3 / gTTS)
	├── requirements.txt
	├── tests/
	│ └── test_safety_classifier.py
	└── README.md
	```

	---

	## ⚙️ Setup

	### 1. Create a virtual environment
	```bash
	python -m venv venv
	source venv/bin/activate # Linux/Mac
	venv\Scripts\activate # Windows
	```

	### 2. Install dependencies
	```bash
	pip install -r requirements.txt
	```

	> GPU (recommended): Install the CUDA-enabled PyTorch version first:
	> ```bash
	> pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
	> ```

	### 3. (Optional) HuggingFace login for gated models
	```bash
	huggingface-cli login
	```
	Qwen2-VL-2B is not gated — no login required for the default model.

	---

	## 🚀 Running

	### Web UI (Gradio)
	```bash
	python app.py
	```
	Open http://localhost:7860 in your browser.

	Supports:
	- 📁 Image upload (drag & drop)
	- 📷 Live webcam capture
	- 🎬 Video file analysis (frame-by-frame)

	### Command Line
	```bash
	# Single image
	python cli.py --image photo.jpg --speak

	# Video file (capture every 3 seconds)
	python cli.py --video footage.mp4 --interval 3 --speak

	# Live webcam loop
	python cli.py --camera --speak

	# Use larger model for better quality
	python cli.py --image photo.jpg --model Qwen/Qwen2-VL-7B-Instruct
	```

	### Run Tests
	```bash
	python -m pytest tests/ -v
	```

	---

	## 🧠 Models

	\| Model \| Size \| VRAM \| Quality \|
	\|-------\|------\|------\|---------\|
	\| `Qwen/Qwen2-VL-2B-Instruct` \| ~5 GB \| ~5 GB \| Good ✅ (default) \|
	\| `Qwen/Qwen2-VL-7B-Instruct` \| ~14 GB \| ~14 GB \| Better ⭐ \|
	\| `Qwen/Qwen2.5-VL-3B-Instruct` \| ~6 GB \| ~6 GB \| Good + newer \|
	\| `Salesforce/blip2-opt-2.7b` \| ~5 GB \| ~5 GB \| Fallback only \|

	Switch model via environment variable:
	```bash
	QWEN_MODEL=Qwen/Qwen2-VL-7B-Instruct python app.py
	```

	---

	## 🔍 Safety Classifier — Hazard Categories

	The regex engine covers 15 hazard categories with ~30 pattern groups:

	\| Category \| Examples \|
	\|----------\|---------\|
	\| `fire` \| fire, flames, burning, blaze, smoke \|
	\| `flood` \| flooding, flash flood, submerged \|
	\| `storm` \| tornado, hurricane, lightning \|
	\| `traffic` \| oncoming car, near collision \|
	\| `crash` \| accident, wreck, overturned vehicle \|
	\| `weapon` \| gun, knife, rifle, blade, bomb \|
	\| `violence` \| brawl, riot, shooting, assault \|
	\| `fall` \| cliff, ledge, scaffolding, steep drop \|
	\| `collapse` \| rubble, debris, cave-in \|
	\| `electrical` \| exposed wire, live wire, sparking \|
	\| `injury` \| blood, wound, bleeding, unconscious \|
	\| `slip` \| wet floor, icy road, black ice \|
	\| `construction` \| heavy machinery, crane, unsafe structure \|
	\| `chemical` \| chemical spill, gas leak, toxic fumes \|
	\| `crowd` \| stampede, crowd crush, panic \|

	---

	## ♿ Accessibility Features

	- Auto TTS — every description is read aloud automatically
	- `aria-live` regions in the web UI for screen reader support
	- High-contrast dark theme with clear visual indicators
	- Keyboard-navigable Gradio interface

	---

	## 📄 License

	MIT License — free for personal and commercial use.