File size: 5,737 Bytes
edda197
ce86ad4
 
edda197
 
 
 
 
 
 
434a1b5
 
 
 
 
 
 
 
 
 
 
ce86ad4
4f09101
 
434a1b5
4f09101
 
 
 
ce86ad4
434a1b5
 
4f09101
 
ce86ad4
434a1b5
 
4f09101
 
ce86ad4
434a1b5
 
 
4f09101
 
ce86ad4
434a1b5
 
ce86ad4
4f09101
 
 
1ca85db
ce86ad4
4f09101
1ca85db
ce86ad4
4f09101
1ca85db
ce86ad4
4f09101
434a1b5
ce86ad4
4f09101
ce86ad4
4f09101
 
434a1b5
4f09101
434a1b5
 
4f09101
 
 
 
 
 
 
434a1b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce86ad4
4f09101
ce86ad4
434a1b5
 
 
 
 
ce86ad4
434a1b5
ce86ad4
434a1b5
 
 
 
 
 
 
4f09101
 
 
ce86ad4
434a1b5
4f09101
 
434a1b5
 
 
 
 
 
 
 
 
 
 
 
 
4f09101
434a1b5
 
 
 
 
 
 
4f09101
434a1b5
 
4f09101
 
434a1b5
 
 
 
4f09101
434a1b5
 
 
 
 
 
 
 
 
 
 
4f09101
434a1b5
 
 
 
 
 
 
4f09101
 
434a1b5
 
 
 
 
 
 
4f09101
434a1b5
 
 
 
 
 
 
 
4f09101
434a1b5
 
 
 
 
 
 
 
4f09101
 
434a1b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f09101
ce86ad4
434a1b5
 
 
 
 
 
4f09101
 
 
434a1b5
 
 
 
 
 
ce86ad4
1ca85db
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
title: HW 3 Vision Language AI Demo
emoji: πŸ€–
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
---
---
title: Vision Language AI Demo
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
license: mit
---

# πŸ€– Vision Language AI Demo

A comprehensive web application showcasing state-of-the-art Vision-Language AI models with an intuitive Gradio interface.

## ✨ Features

### πŸ–ΌοΈ Image Captioning
Automatically generate natural language descriptions of images using BLIP model.
- Auto-generates captions when image is uploaded
- Powered by Salesforce BLIP model

### πŸ” Visual Question Answering (VQA)
Ask questions about images and get intelligent answers based on visual content.
- Supports various question types
- Real-time visual understanding

### 🏷️ Zero-Shot Image Classification
Classify images into custom categories without training using CLIP model.
- Define any categories you want
- Visual similarity scoring
- No training data required

### πŸ’¬ Multimodal Chat
Interactive conversations about image content with context retention.
- Multi-turn dialogue support
- Natural language interaction

## πŸ“Έ Demo Screenshots

### Image Captioning
![Image Captioning](source/image%20(4).png)

### Visual Question Answering
![Visual Question Answering](source/image%20(3).png)

### Zero-Shot Classification
![Zero-Shot Classification](source/image%20(2).png)

### Multimodal Chat
![Multimodal Chat](source/image%20(1).png)

## πŸš€ Quick Start

### Local Run
```bash
# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py
```

Access at `http://localhost:7860`

### Deploy to Hugging Face Spaces

1. Go to https://huggingface.co/spaces
2. Click **"Create new Space"**
3. Fill in:
   - Space name: `vision-language-ai-demo`
   - License: MIT
   - SDK: **Gradio**
   - Hardware: CPU (free) or GPU (for faster processing)
4. Upload files:
   - `app.py`
   - `requirements.txt`
   - `README.md`
   - `source/` folder (with screenshots)
5. Space will auto-deploy in 5-10 minutes



## πŸ› οΈ Models Used

| Model | Purpose | Size | Performance |
|-------|---------|------|-------------|
| [BLIP-Captioning](https://huggingface.co/Salesforce/blip-image-captioning-base) | Image Description | 447MB | Fast |
| [BLIP-VQA](https://huggingface.co/Salesforce/blip-vqa-base) | Visual Q&A | 447MB | Fast |
| [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) | Classification | 605MB | Very Fast |

All models are open source and commercially usable.

## πŸ“– Usage Guide

### πŸ–ΌοΈ Image Captioning
1. Navigate to **"Image Captioning"** tab
2. Upload an image (drag & drop or click to browse)
3. Caption generates automatically
4. Or click **"🎨 Generate Caption"** button

**Example Output:**
```
πŸ“ Image Caption:
a cat sitting on a wooden table looking at the camera
```

**Use Cases:**
- Generate alt text for accessibility
- Auto-tag images for organization
- Content moderation
- Creative writing inspiration

---

### πŸ” Visual Question Answering
1. Go to **"Visual Question Answering"** tab
2. Upload an image
3. Type your question in the text box
4. Click **"πŸ€” Get Answer"**

**Example Questions:**
- "What color is the car?"
- "How many people are there?"
- "Is there a dog in the image?"
- "What is the person wearing?"

**Example Output:**
```
❓ Question: What color is the car?
βœ… Answer: red
```

**Tips:**
- Ask specific, clear questions
- One question at a time works best
- Simple language gets better results

---

### 🏷️ Zero-Shot Classification
1. Open **"Zero-Shot Classification"** tab
2. Upload an image
3. Enter categories (comma-separated)
   - Default: `cat, dog, bird, car, building`
   - Custom: `sunny, cloudy, rainy, snowy`
4. Click **"🎯 Classify"**

**Example Output:**
```
🎯 Classification Results:

cat:     92.50% β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
dog:      5.20% β–ˆ
bird:     2.30% β–Œ
car:      0.00%
building: 0.00%
```

**Use Cases:**
- Content categorization
- Image filtering
- Quality control
- Custom tagging systems

---

### πŸ’¬ Multimodal Chat
1. Select **"Multimodal Chat"** tab
2. Upload an image (left panel)
3. Type your message and press Enter or click **"πŸ“€ Send"**
4. Continue the conversation naturally
5. Click **"πŸ—‘οΈ Clear Chat"** to start over

**Example Conversation:**
```
πŸ‘€ You: Describe this image
πŸ€– AI: a modern living room with a grey sofa

πŸ‘€ You: What color are the walls?
πŸ€– AI: white

πŸ‘€ You: Is there a window?
πŸ€– AI: yes
```

**Tips:**
- Start with broad questions
- Build on previous responses
- Keep questions related to the image

### Getting Help
- πŸ“– [Gradio Documentation](https://gradio.app/docs/)
- πŸ€— [Hugging Face Forums](https://discuss.huggingface.co/)
- πŸ’¬ [Gradio Discord](https://discord.gg/gradio)

## πŸ“‹ Requirements

**System Requirements:**
- Python 3.8+
- 8GB RAM minimum (16GB recommended)
- 5GB free storage for models

**Dependencies:**
- gradio >= 4.0.0
- torch >= 2.0.0
- transformers >= 4.35.0
- Pillow >= 10.0.0

See `requirements.txt` for complete list.

## πŸ“„ License

MIT License - See [LICENSE](LICENSE) file for details.

### Model Licenses
- **BLIP**: BSD-3-Clause License
- **CLIP**: MIT License


## πŸ™ Acknowledgments

Built with amazing open-source projects:
- [Salesforce BLIP](https://github.com/salesforce/BLIP) - Image captioning and VQA
- [OpenAI CLIP](https://github.com/openai/CLIP) - Zero-shot classification
- [Hugging Face Transformers](https://huggingface.co/docs/transformers) - Model hub and inference
- [Gradio](https://gradio.app/) - Beautiful web interfaces


---