kkkai123456 commited on
Commit
434a1b5
Β·
verified Β·
1 Parent(s): 9c03dd2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +239 -81
README.md CHANGED
@@ -8,49 +8,67 @@ sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
10
  ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
13
 
14
  # πŸ€– Vision Language AI Demo
15
 
16
- A comprehensive web application showcasing state-of-the-art Vision-Language AI models.
17
 
18
  ## ✨ Features
19
 
20
  ### πŸ–ΌοΈ Image Captioning
21
  Automatically generate natural language descriptions of images using BLIP model.
 
 
22
 
23
  ### πŸ” Visual Question Answering (VQA)
24
  Ask questions about images and get intelligent answers based on visual content.
 
 
25
 
26
  ### 🏷️ Zero-Shot Image Classification
27
  Classify images into custom categories without training using CLIP model.
 
 
 
28
 
29
  ### πŸ’¬ Multimodal Chat
30
  Interactive conversations about image content with context retention.
 
 
31
 
32
  ## πŸ“Έ Demo Screenshots
33
 
34
- ### Main Interface
35
- ![Main Interface](https://huggingface.co/spaces/kkkai123456/HW_3/resolve/main/source/image%20(1).png)
36
-
37
  ### Image Captioning
38
- ![Image Captioning](https://via.placeholder.com/800x400/667eea/ffffff?text=Image+Captioning)
39
 
40
  ### Visual Question Answering
41
- ![VQA](https://via.placeholder.com/800x400/667eea/ffffff?text=Visual+QA)
42
 
43
  ### Zero-Shot Classification
44
- ![Classification](https://via.placeholder.com/800x400/667eea/ffffff?text=Classification)
45
 
46
  ### Multimodal Chat
47
- ![Chat](https://via.placeholder.com/800x400/667eea/ffffff?text=Multimodal+Chat)
48
 
49
  ## πŸš€ Quick Start
50
 
51
  ### Local Run
52
  ```bash
 
53
  pip install -r requirements.txt
 
 
54
  python app.py
55
  ```
56
 
@@ -58,136 +76,276 @@ Access at `http://localhost:7860`
58
 
59
  ### Deploy to Hugging Face Spaces
60
 
61
- 1. **Create a Space**
62
- - Go to https://huggingface.co/spaces
63
- - Click "Create new Space"
64
- - Choose name and select **Gradio** SDK
65
-
66
- 2. **Upload Files**
67
- - Upload `app.py`, `requirements.txt`, and `README.md`
68
- - Or use Git:
69
- ```bash
70
- git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
71
- cd YOUR_SPACE_NAME
72
- # Copy your files here
73
- git add .
74
- git commit -m "Initial commit"
75
- git push
76
- ```
77
-
78
- 3. **Wait for Build**
79
- - Space will auto-deploy in 5-10 minutes
80
- - Access at: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
81
-
82
- ### Enable GPU (Optional)
83
- - Go to Space Settings β†’ Hardware
84
- - Select GPU option for faster processing
85
- - Restart the Space
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ## πŸ› οΈ Models Used
88
 
89
- | Model | Purpose | Size |
90
- |-------|---------|------|
91
- | [BLIP-Captioning](https://huggingface.co/Salesforce/blip-image-captioning-base) | Image Description | 447MB |
92
- | [BLIP-VQA](https://huggingface.co/Salesforce/blip-vqa-base) | Visual Q&A | 447MB |
93
- | [CLIP](https://huggingface.co/openai/clip-vit-base-patch32) | Classification | 605MB |
94
 
95
- ## πŸ“– Usage Examples
96
 
97
- ### Image Captioning
98
- Upload an image β†’ Click "Generate Caption" β†’ Get description
 
 
 
 
 
99
 
100
  **Example Output:**
101
  ```
102
  πŸ“ Image Caption:
103
- A golden retriever sitting in a park with green grass
104
  ```
105
 
106
- ### Visual Question Answering
107
- Upload image β†’ Ask question β†’ Get answer
 
 
 
 
 
 
 
 
 
 
 
108
 
109
- **Example:**
 
 
 
 
 
 
110
  ```
111
- Q: What color is the car?
112
- A: red
113
  ```
114
 
115
- ### Zero-Shot Classification
116
- Upload image β†’ Define categories (comma-separated) β†’ Get probabilities
 
 
117
 
118
- **Example:**
 
 
 
 
 
 
 
 
 
 
119
  ```
120
- Categories: cat, dog, bird
121
- Results:
122
- cat: 92.5% β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
123
- dog: 5.2% β–ˆ
124
- bird: 2.3% β–Œ
 
 
125
  ```
126
 
127
- ### Multimodal Chat
128
- Upload image β†’ Chat naturally about it
 
 
 
 
 
129
 
130
- **Example:**
 
 
 
 
 
 
 
131
  ```
132
- You: Describe this image
133
- AI: A modern kitchen with white cabinets
134
- You: What color are the walls?
135
- AI: white
 
 
 
 
136
  ```
137
 
138
- ## βš™οΈ Configuration
 
 
 
 
 
 
 
139
 
140
  ### Change Models
141
  Edit `app.py` to use different models:
 
142
  ```python
143
- # Use larger BLIP model
144
  caption_model = BlipForConditionalGeneration.from_pretrained(
145
- "Salesforce/blip-image-captioning-large"
 
 
 
 
 
146
  )
147
  ```
148
 
149
- ### Customize Interface
150
  Modify `custom_css` in `app.py`:
 
151
  ```python
152
  custom_css = """
153
  #title {
154
- background: linear-gradient(90deg, #YOUR_COLOR 0%, #YOUR_COLOR 100%);
 
155
  }
156
  """
157
  ```
158
 
 
 
 
 
 
 
 
 
 
 
 
159
  ## πŸ› Troubleshooting
160
 
161
- **Issue: Models downloading slowly**
 
 
162
  ```bash
163
- # Set cache directory
164
- export HF_HOME=/path/to/storage
 
165
  ```
166
 
167
- **Issue: Out of memory**
168
  ```python
169
- # Use CPU only
 
 
 
 
170
  device = "cpu"
171
  ```
172
 
173
- **Issue: Port already in use**
174
  ```bash
 
175
  python app.py --server-port 8080
176
  ```
177
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
  ## πŸ“„ License
179
 
180
- MIT License - See [LICENSE](LICENSE) file
 
 
 
 
 
 
181
 
182
  ## πŸ™ Acknowledgments
183
 
184
- - [Salesforce BLIP](https://github.com/salesforce/BLIP)
185
- - [OpenAI CLIP](https://github.com/openai/CLIP)
186
- - [Hugging Face](https://huggingface.co/)
187
- - [Gradio](https://gradio.app/)
 
 
 
 
 
 
 
188
 
189
  ---
190
 
191
- **⭐ Star this project if you find it helpful!**
 
 
192
 
 
193
 
 
 
8
  app_file: app.py
9
  pinned: false
10
  ---
11
+ ---
12
+ title: Vision Language AI Demo
13
+ emoji: πŸ€–
14
+ colorFrom: blue
15
+ colorTo: purple
16
+ sdk: gradio
17
+ sdk_version: "4.44.0"
18
+ app_file: app.py
19
+ pinned: false
20
+ license: mit
21
+ ---
22
 
23
  # πŸ€– Vision Language AI Demo
24
 
25
+ A comprehensive web application showcasing state-of-the-art Vision-Language AI models with an intuitive Gradio interface.
26
 
27
  ## ✨ Features
28
 
29
  ### πŸ–ΌοΈ Image Captioning
30
  Automatically generate natural language descriptions of images using BLIP model.
31
+ - Auto-generates captions when image is uploaded
32
+ - Powered by Salesforce BLIP model
33
 
34
  ### πŸ” Visual Question Answering (VQA)
35
  Ask questions about images and get intelligent answers based on visual content.
36
+ - Supports various question types
37
+ - Real-time visual understanding
38
 
39
  ### 🏷️ Zero-Shot Image Classification
40
  Classify images into custom categories without training using CLIP model.
41
+ - Define any categories you want
42
+ - Visual similarity scoring
43
+ - No training data required
44
 
45
  ### πŸ’¬ Multimodal Chat
46
  Interactive conversations about image content with context retention.
47
+ - Multi-turn dialogue support
48
+ - Natural language interaction
49
 
50
  ## πŸ“Έ Demo Screenshots
51
 
 
 
 
52
  ### Image Captioning
53
+ ![Image Captioning](source/image%20(1).png)
54
 
55
  ### Visual Question Answering
56
+ ![Visual Question Answering](source/image%20(1).png)
57
 
58
  ### Zero-Shot Classification
59
+ ![Zero-Shot Classification](source/image%20(1).png)
60
 
61
  ### Multimodal Chat
62
+ ![Multimodal Chat](source/image%20(1).png)
63
 
64
  ## πŸš€ Quick Start
65
 
66
  ### Local Run
67
  ```bash
68
+ # Install dependencies
69
  pip install -r requirements.txt
70
+
71
+ # Run the application
72
  python app.py
73
  ```
74
 
 
76
 
77
  ### Deploy to Hugging Face Spaces
78
 
79
+ #### Method 1: Web Interface
80
+ 1. Go to https://huggingface.co/spaces
81
+ 2. Click **"Create new Space"**
82
+ 3. Fill in:
83
+ - Space name: `vision-language-ai-demo`
84
+ - License: MIT
85
+ - SDK: **Gradio**
86
+ - Hardware: CPU (free) or GPU (for faster processing)
87
+ 4. Upload files:
88
+ - `app.py`
89
+ - `requirements.txt`
90
+ - `README.md`
91
+ - `source/` folder (with screenshots)
92
+ 5. Space will auto-deploy in 5-10 minutes
93
+
94
+ #### Method 2: Git
95
+ ```bash
96
+ # Clone your space repository
97
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
98
+ cd YOUR_SPACE_NAME
99
+
100
+ # Copy your files
101
+ cp app.py requirements.txt README.md ./
102
+ cp -r source ./
103
+
104
+ # Push to Hugging Face
105
+ git add .
106
+ git commit -m "Initial commit"
107
+ git push
108
+ ```
109
+
110
+ #### Enable GPU (Optional)
111
+ 1. Go to **Settings** β†’ **Hardware**
112
+ 2. Select **GPU** option
113
+ 3. Restart the Space
114
+
115
+ GPU provides 10-50x faster processing and better user experience.
116
 
117
  ## πŸ› οΈ Models Used
118
 
119
+ | Model | Purpose | Size | Performance |
120
+ |-------|---------|------|-------------|
121
+ | [BLIP-Captioning](https://huggingface.co/Salesforce/blip-image-captioning-base) | Image Description | 447MB | Fast |
122
+ | [BLIP-VQA](https://huggingface.co/Salesforce/blip-vqa-base) | Visual Q&A | 447MB | Fast |
123
+ | [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) | Classification | 605MB | Very Fast |
124
 
125
+ All models are open source and commercially usable.
126
 
127
+ ## πŸ“– Usage Guide
128
+
129
+ ### πŸ–ΌοΈ Image Captioning
130
+ 1. Navigate to **"Image Captioning"** tab
131
+ 2. Upload an image (drag & drop or click to browse)
132
+ 3. Caption generates automatically
133
+ 4. Or click **"🎨 Generate Caption"** button
134
 
135
  **Example Output:**
136
  ```
137
  πŸ“ Image Caption:
138
+ a cat sitting on a wooden table looking at the camera
139
  ```
140
 
141
+ **Use Cases:**
142
+ - Generate alt text for accessibility
143
+ - Auto-tag images for organization
144
+ - Content moderation
145
+ - Creative writing inspiration
146
+
147
+ ---
148
+
149
+ ### πŸ” Visual Question Answering
150
+ 1. Go to **"Visual Question Answering"** tab
151
+ 2. Upload an image
152
+ 3. Type your question in the text box
153
+ 4. Click **"πŸ€” Get Answer"**
154
 
155
+ **Example Questions:**
156
+ - "What color is the car?"
157
+ - "How many people are there?"
158
+ - "Is there a dog in the image?"
159
+ - "What is the person wearing?"
160
+
161
+ **Example Output:**
162
  ```
163
+ ❓ Question: What color is the car?
164
+ βœ… Answer: red
165
  ```
166
 
167
+ **Tips:**
168
+ - Ask specific, clear questions
169
+ - One question at a time works best
170
+ - Simple language gets better results
171
 
172
+ ---
173
+
174
+ ### 🏷️ Zero-Shot Classification
175
+ 1. Open **"Zero-Shot Classification"** tab
176
+ 2. Upload an image
177
+ 3. Enter categories (comma-separated)
178
+ - Default: `cat, dog, bird, car, building`
179
+ - Custom: `sunny, cloudy, rainy, snowy`
180
+ 4. Click **"🎯 Classify"**
181
+
182
+ **Example Output:**
183
  ```
184
+ 🎯 Classification Results:
185
+
186
+ cat: 92.50% β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
187
+ dog: 5.20% β–ˆ
188
+ bird: 2.30% β–Œ
189
+ car: 0.00%
190
+ building: 0.00%
191
  ```
192
 
193
+ **Use Cases:**
194
+ - Content categorization
195
+ - Image filtering
196
+ - Quality control
197
+ - Custom tagging systems
198
+
199
+ ---
200
 
201
+ ### πŸ’¬ Multimodal Chat
202
+ 1. Select **"Multimodal Chat"** tab
203
+ 2. Upload an image (left panel)
204
+ 3. Type your message and press Enter or click **"πŸ“€ Send"**
205
+ 4. Continue the conversation naturally
206
+ 5. Click **"πŸ—‘οΈ Clear Chat"** to start over
207
+
208
+ **Example Conversation:**
209
  ```
210
+ πŸ‘€ You: Describe this image
211
+ πŸ€– AI: a modern living room with a grey sofa
212
+
213
+ πŸ‘€ You: What color are the walls?
214
+ πŸ€– AI: white
215
+
216
+ πŸ‘€ You: Is there a window?
217
+ πŸ€– AI: yes
218
  ```
219
 
220
+ **Tips:**
221
+ - Start with broad questions
222
+ - Build on previous responses
223
+ - Keep questions related to the image
224
+
225
+ ---
226
+
227
+ ## βš™οΈ Advanced Configuration
228
 
229
  ### Change Models
230
  Edit `app.py` to use different models:
231
+
232
  ```python
233
+ # Use larger BLIP model for better quality
234
  caption_model = BlipForConditionalGeneration.from_pretrained(
235
+ "Salesforce/blip-image-captioning-large" # 990MB, better quality
236
+ )
237
+
238
+ # Use larger CLIP model
239
+ clip_model = CLIPModel.from_pretrained(
240
+ "openai/clip-vit-large-patch14" # 1.7GB, more accurate
241
  )
242
  ```
243
 
244
+ ### Customize Interface Style
245
  Modify `custom_css` in `app.py`:
246
+
247
  ```python
248
  custom_css = """
249
  #title {
250
+ background: linear-gradient(90deg, #FF6B6B 0%, #4ECDC4 100%);
251
+ font-size: 3.5em;
252
  }
253
  """
254
  ```
255
 
256
+ ### Adjust Generation Parameters
257
+ Control model behavior:
258
+
259
+ ```python
260
+ # Generate longer captions
261
+ out = caption_model.generate(**inputs, max_length=100)
262
+
263
+ # More accurate but slower VQA
264
+ out = vqa_model.generate(**inputs, max_length=50, num_beams=5)
265
+ ```
266
+
267
  ## πŸ› Troubleshooting
268
 
269
+ ### Common Issues
270
+
271
+ **Models downloading slowly**
272
  ```bash
273
+ # Set cache directory to a location with more space
274
+ export HF_HOME=/path/to/large/storage
275
+ python app.py
276
  ```
277
 
278
+ **Out of memory error**
279
  ```python
280
+ # Add at the start of app.py
281
+ import torch
282
+ torch.cuda.empty_cache()
283
+
284
+ # Or force CPU usage
285
  device = "cpu"
286
  ```
287
 
288
+ **Port already in use**
289
  ```bash
290
+ # Use different port
291
  python app.py --server-port 8080
292
  ```
293
 
294
+ **Space build failing**
295
+ - Check `requirements.txt` for correct package versions
296
+ - Verify all files are uploaded correctly
297
+ - Check build logs in Space settings
298
+
299
+ ### Getting Help
300
+ - πŸ“– [Gradio Documentation](https://gradio.app/docs/)
301
+ - πŸ€— [Hugging Face Forums](https://discuss.huggingface.co/)
302
+ - πŸ’¬ [Gradio Discord](https://discord.gg/gradio)
303
+
304
+ ## πŸ“‹ Requirements
305
+
306
+ **System Requirements:**
307
+ - Python 3.8+
308
+ - 8GB RAM minimum (16GB recommended)
309
+ - 5GB free storage for models
310
+
311
+ **Dependencies:**
312
+ - gradio >= 4.0.0
313
+ - torch >= 2.0.0
314
+ - transformers >= 4.35.0
315
+ - Pillow >= 10.0.0
316
+
317
+ See `requirements.txt` for complete list.
318
+
319
  ## πŸ“„ License
320
 
321
+ MIT License - See [LICENSE](LICENSE) file for details.
322
+
323
+ ### Model Licenses
324
+ - **BLIP**: BSD-3-Clause License
325
+ - **CLIP**: MIT License
326
+
327
+ All models are free for commercial use.
328
 
329
  ## πŸ™ Acknowledgments
330
 
331
+ Built with amazing open-source projects:
332
+ - [Salesforce BLIP](https://github.com/salesforce/BLIP) - Image captioning and VQA
333
+ - [OpenAI CLIP](https://github.com/openai/CLIP) - Zero-shot classification
334
+ - [Hugging Face Transformers](https://huggingface.co/docs/transformers) - Model hub and inference
335
+ - [Gradio](https://gradio.app/) - Beautiful web interfaces
336
+
337
+ ## πŸ”— Links
338
+
339
+ - **Live Demo**: [Your Space URL]
340
+ - **GitHub Repository**: [Your Repo URL]
341
+ - **Report Issues**: [GitHub Issues]
342
 
343
  ---
344
 
345
+ <div align="center">
346
+
347
+ **⭐ If you find this project helpful, please star it! ⭐**
348
 
349
+ Made with ❀️ by the open-source community
350
 
351
+ </div>