Spaces:

A7med-Ame3
/

Real_Time_Image_Captioning

Sleeping

App Files Files Community

Real_Time_Image_Captioning / README.md

A7med-Ame3

Update README.md

5cfc384 verified 6 days ago

preview code

raw

history blame contribute delete

4.67 kB

A newer version of the Gradio SDK is available: 6.8.0

Upgrade

metadata

title: Real Time Image Captioning
emoji: 👁️
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
python_version: '3.10'
app_file: app.py
pinned: false

👁 ClearPath — Real-Time Scene Description for Visually-Impaired People

A fully open-source Python system that describes visual scenes in plain language and classifies them as SAFE or DANGEROUS using a regex engine.

┌─────────┐    ┌──────────────┐    ┌───────────────┐    ┌────────────────┐    ┌──────┐
│  Input  │───▶│  Qwen2-VL    │───▶│  Regex Safety │───▶│ SAFE/DANGEROUS │───▶│  TTS │
│ (Image/ │    │  Captioning  │    │  Classifier   │    │  + Hazard tags │    │      │
│ Video / │    │  (HuggingFace│    │  (15 categories│   └────────────────┘    └──────┘
│ Camera) │    │   open src)  │    │   ~30 patterns)│
└─────────┘    └──────────────┘    └───────────────┘

📁 Project Structure

scene_description/
├── app.py                  ← Gradio web UI (main entry point)
├── cli.py                  ← Command-line interface
├── scene_captioner.py      ← Qwen2-VL image captioning module
├── safety_classifier.py    ← Regex-based SAFE/DANGEROUS classifier
├── tts_engine.py           ← Text-to-Speech (pyttsx3 / gTTS)
├── requirements.txt
├── tests/
│   └── test_safety_classifier.py
└── README.md

⚙️ Setup

1. Create a virtual environment

python -m venv venv
source venv/bin/activate        # Linux/Mac
venv\Scripts\activate           # Windows

2. Install dependencies

pip install -r requirements.txt

GPU (recommended): Install the CUDA-enabled PyTorch version first:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

3. (Optional) HuggingFace login for gated models

huggingface-cli login

Qwen2-VL-2B is not gated — no login required for the default model.

🚀 Running

Web UI (Gradio)

python app.py

Open http://localhost:7860 in your browser.

Supports:

📁 Image upload (drag & drop)
📷 Live webcam capture
🎬 Video file analysis (frame-by-frame)

Command Line

# Single image
python cli.py --image photo.jpg --speak

# Video file (capture every 3 seconds)
python cli.py --video footage.mp4 --interval 3 --speak

# Live webcam loop
python cli.py --camera --speak

# Use larger model for better quality
python cli.py --image photo.jpg --model Qwen/Qwen2-VL-7B-Instruct

Run Tests

python -m pytest tests/ -v

🧠 Models

Model	Size	VRAM	Quality
`Qwen/Qwen2-VL-2B-Instruct`	~5 GB	~5 GB	Good ✅ (default)
`Qwen/Qwen2-VL-7B-Instruct`	~14 GB	~14 GB	Better ⭐
`Qwen/Qwen2.5-VL-3B-Instruct`	~6 GB	~6 GB	Good + newer
`Salesforce/blip2-opt-2.7b`	~5 GB	~5 GB	Fallback only

Switch model via environment variable:

QWEN_MODEL=Qwen/Qwen2-VL-7B-Instruct python app.py

🔍 Safety Classifier — Hazard Categories

The regex engine covers 15 hazard categories with ~30 pattern groups:

Category	Examples
`fire`	fire, flames, burning, blaze, smoke
`flood`	flooding, flash flood, submerged
`storm`	tornado, hurricane, lightning
`traffic`	oncoming car, near collision
`crash`	accident, wreck, overturned vehicle
`weapon`	gun, knife, rifle, blade, bomb
`violence`	brawl, riot, shooting, assault
`fall`	cliff, ledge, scaffolding, steep drop
`collapse`	rubble, debris, cave-in
`electrical`	exposed wire, live wire, sparking
`injury`	blood, wound, bleeding, unconscious
`slip`	wet floor, icy road, black ice
`construction`	heavy machinery, crane, unsafe structure
`chemical`	chemical spill, gas leak, toxic fumes
`crowd`	stampede, crowd crush, panic

♿ Accessibility Features

Auto TTS — every description is read aloud automatically
aria-live regions in the web UI for screen reader support
High-contrast dark theme with clear visual indicators
Keyboard-navigable Gradio interface

📄 License

MIT License — free for personal and commercial use.