Spaces:

kkkai123456
/

HW_3

Running

File size: 5,737 Bytes

---
title: HW 3 Vision Language AI Demo
emoji: 🤖
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
---
---
title: Vision Language AI Demo
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
license: mit
---

# 🤖 Vision Language AI Demo

A comprehensive web application showcasing state-of-the-art Vision-Language AI models with an intuitive Gradio interface.

## ✨ Features

### 🖼️ Image Captioning
Automatically generate natural language descriptions of images using BLIP model.
- Auto-generates captions when image is uploaded
- Powered by Salesforce BLIP model

### 🔍 Visual Question Answering (VQA)
Ask questions about images and get intelligent answers based on visual content.
- Supports various question types
- Real-time visual understanding

### 🏷️ Zero-Shot Image Classification
Classify images into custom categories without training using CLIP model.
- Define any categories you want
- Visual similarity scoring
- No training data required

### 💬 Multimodal Chat
Interactive conversations about image content with context retention.
- Multi-turn dialogue support
- Natural language interaction

## 📸 Demo Screenshots

### Image Captioning
![Image Captioning](source/image%20(4).png)

### Visual Question Answering
![Visual Question Answering](source/image%20(3).png)

### Zero-Shot Classification
![Zero-Shot Classification](source/image%20(2).png)

### Multimodal Chat
![Multimodal Chat](source/image%20(1).png)

## 🚀 Quick Start

### Local Run
```bash
# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py
```

Access at `http://localhost:7860`

### Deploy to Hugging Face Spaces

1. Go to https://huggingface.co/spaces
2. Click **"Create new Space"**
3. Fill in:
   - Space name: `vision-language-ai-demo`
   - License: MIT
   - SDK: **Gradio**
   - Hardware: CPU (free) or GPU (for faster processing)
4. Upload files:
   - `app.py`
   - `requirements.txt`
   - `README.md`
   - `source/` folder (with screenshots)
5. Space will auto-deploy in 5-10 minutes



## 🛠️ Models Used

| Model | Purpose | Size | Performance |
|-------|---------|------|-------------|
| [BLIP-Captioning](https://huggingface.co/Salesforce/blip-image-captioning-base) | Image Description | 447MB | Fast |
| [BLIP-VQA](https://huggingface.co/Salesforce/blip-vqa-base) | Visual Q&A | 447MB | Fast |
| [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) | Classification | 605MB | Very Fast |

All models are open source and commercially usable.

## 📖 Usage Guide

### 🖼️ Image Captioning
1. Navigate to **"Image Captioning"** tab
2. Upload an image (drag & drop or click to browse)
3. Caption generates automatically
4. Or click **"🎨 Generate Caption"** button

**Example Output:**
```
📝 Image Caption:
a cat sitting on a wooden table looking at the camera
```

**Use Cases:**
- Generate alt text for accessibility
- Auto-tag images for organization
- Content moderation
- Creative writing inspiration

---

### 🔍 Visual Question Answering
1. Go to **"Visual Question Answering"** tab
2. Upload an image
3. Type your question in the text box
4. Click **"🤔 Get Answer"**

**Example Questions:**
- "What color is the car?"
- "How many people are there?"
- "Is there a dog in the image?"
- "What is the person wearing?"

**Example Output:**
```
❓ Question: What color is the car?
✅ Answer: red
```

**Tips:**
- Ask specific, clear questions
- One question at a time works best
- Simple language gets better results

---

### 🏷️ Zero-Shot Classification
1. Open **"Zero-Shot Classification"** tab
2. Upload an image
3. Enter categories (comma-separated)
   - Default: `cat, dog, bird, car, building`
   - Custom: `sunny, cloudy, rainy, snowy`
4. Click **"🎯 Classify"**

**Example Output:**
```
🎯 Classification Results:

cat:     92.50% ██████████████████
dog:      5.20% █
bird:     2.30% ▌
car:      0.00%
building: 0.00%
```

**Use Cases:**
- Content categorization
- Image filtering
- Quality control
- Custom tagging systems

---

### 💬 Multimodal Chat
1. Select **"Multimodal Chat"** tab
2. Upload an image (left panel)
3. Type your message and press Enter or click **"📤 Send"**
4. Continue the conversation naturally
5. Click **"🗑️ Clear Chat"** to start over

**Example Conversation:**
```
👤 You: Describe this image
🤖 AI: a modern living room with a grey sofa

👤 You: What color are the walls?
🤖 AI: white

👤 You: Is there a window?
🤖 AI: yes
```

**Tips:**
- Start with broad questions
- Build on previous responses
- Keep questions related to the image

### Getting Help
- 📖 [Gradio Documentation](https://gradio.app/docs/)
- 🤗 [Hugging Face Forums](https://discuss.huggingface.co/)
- 💬 [Gradio Discord](https://discord.gg/gradio)

## 📋 Requirements

**System Requirements:**
- Python 3.8+
- 8GB RAM minimum (16GB recommended)
- 5GB free storage for models

**Dependencies:**
- gradio >= 4.0.0
- torch >= 2.0.0
- transformers >= 4.35.0
- Pillow >= 10.0.0

See `requirements.txt` for complete list.

## 📄 License

MIT License - See [LICENSE](LICENSE) file for details.

### Model Licenses
- **BLIP**: BSD-3-Clause License
- **CLIP**: MIT License


## 🙏 Acknowledgments

Built with amazing open-source projects:
- [Salesforce BLIP](https://github.com/salesforce/BLIP) - Image captioning and VQA
- [OpenAI CLIP](https://github.com/openai/CLIP) - Zero-shot classification
- [Hugging Face Transformers](https://huggingface.co/docs/transformers) - Model hub and inference
- [Gradio](https://gradio.app/) - Beautiful web interfaces


---