HW_3 / README.md
kkkai123456's picture
Update README.md
1ca85db verified
---
title: HW 3 Vision Language AI Demo
emoji: πŸ€–
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
---
---
title: Vision Language AI Demo
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
license: mit
---
# πŸ€– Vision Language AI Demo
A comprehensive web application showcasing state-of-the-art Vision-Language AI models with an intuitive Gradio interface.
## ✨ Features
### πŸ–ΌοΈ Image Captioning
Automatically generate natural language descriptions of images using BLIP model.
- Auto-generates captions when image is uploaded
- Powered by Salesforce BLIP model
### πŸ” Visual Question Answering (VQA)
Ask questions about images and get intelligent answers based on visual content.
- Supports various question types
- Real-time visual understanding
### 🏷️ Zero-Shot Image Classification
Classify images into custom categories without training using CLIP model.
- Define any categories you want
- Visual similarity scoring
- No training data required
### πŸ’¬ Multimodal Chat
Interactive conversations about image content with context retention.
- Multi-turn dialogue support
- Natural language interaction
## πŸ“Έ Demo Screenshots
### Image Captioning
![Image Captioning](source/image%20(4).png)
### Visual Question Answering
![Visual Question Answering](source/image%20(3).png)
### Zero-Shot Classification
![Zero-Shot Classification](source/image%20(2).png)
### Multimodal Chat
![Multimodal Chat](source/image%20(1).png)
## πŸš€ Quick Start
### Local Run
```bash
# Install dependencies
pip install -r requirements.txt
# Run the application
python app.py
```
Access at `http://localhost:7860`
### Deploy to Hugging Face Spaces
1. Go to https://huggingface.co/spaces
2. Click **"Create new Space"**
3. Fill in:
- Space name: `vision-language-ai-demo`
- License: MIT
- SDK: **Gradio**
- Hardware: CPU (free) or GPU (for faster processing)
4. Upload files:
- `app.py`
- `requirements.txt`
- `README.md`
- `source/` folder (with screenshots)
5. Space will auto-deploy in 5-10 minutes
## πŸ› οΈ Models Used
| Model | Purpose | Size | Performance |
|-------|---------|------|-------------|
| [BLIP-Captioning](https://huggingface.co/Salesforce/blip-image-captioning-base) | Image Description | 447MB | Fast |
| [BLIP-VQA](https://huggingface.co/Salesforce/blip-vqa-base) | Visual Q&A | 447MB | Fast |
| [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) | Classification | 605MB | Very Fast |
All models are open source and commercially usable.
## πŸ“– Usage Guide
### πŸ–ΌοΈ Image Captioning
1. Navigate to **"Image Captioning"** tab
2. Upload an image (drag & drop or click to browse)
3. Caption generates automatically
4. Or click **"🎨 Generate Caption"** button
**Example Output:**
```
πŸ“ Image Caption:
a cat sitting on a wooden table looking at the camera
```
**Use Cases:**
- Generate alt text for accessibility
- Auto-tag images for organization
- Content moderation
- Creative writing inspiration
---
### πŸ” Visual Question Answering
1. Go to **"Visual Question Answering"** tab
2. Upload an image
3. Type your question in the text box
4. Click **"πŸ€” Get Answer"**
**Example Questions:**
- "What color is the car?"
- "How many people are there?"
- "Is there a dog in the image?"
- "What is the person wearing?"
**Example Output:**
```
❓ Question: What color is the car?
βœ… Answer: red
```
**Tips:**
- Ask specific, clear questions
- One question at a time works best
- Simple language gets better results
---
### 🏷️ Zero-Shot Classification
1. Open **"Zero-Shot Classification"** tab
2. Upload an image
3. Enter categories (comma-separated)
- Default: `cat, dog, bird, car, building`
- Custom: `sunny, cloudy, rainy, snowy`
4. Click **"🎯 Classify"**
**Example Output:**
```
🎯 Classification Results:
cat: 92.50% β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
dog: 5.20% β–ˆ
bird: 2.30% β–Œ
car: 0.00%
building: 0.00%
```
**Use Cases:**
- Content categorization
- Image filtering
- Quality control
- Custom tagging systems
---
### πŸ’¬ Multimodal Chat
1. Select **"Multimodal Chat"** tab
2. Upload an image (left panel)
3. Type your message and press Enter or click **"πŸ“€ Send"**
4. Continue the conversation naturally
5. Click **"πŸ—‘οΈ Clear Chat"** to start over
**Example Conversation:**
```
πŸ‘€ You: Describe this image
πŸ€– AI: a modern living room with a grey sofa
πŸ‘€ You: What color are the walls?
πŸ€– AI: white
πŸ‘€ You: Is there a window?
πŸ€– AI: yes
```
**Tips:**
- Start with broad questions
- Build on previous responses
- Keep questions related to the image
### Getting Help
- πŸ“– [Gradio Documentation](https://gradio.app/docs/)
- πŸ€— [Hugging Face Forums](https://discuss.huggingface.co/)
- πŸ’¬ [Gradio Discord](https://discord.gg/gradio)
## πŸ“‹ Requirements
**System Requirements:**
- Python 3.8+
- 8GB RAM minimum (16GB recommended)
- 5GB free storage for models
**Dependencies:**
- gradio >= 4.0.0
- torch >= 2.0.0
- transformers >= 4.35.0
- Pillow >= 10.0.0
See `requirements.txt` for complete list.
## πŸ“„ License
MIT License - See [LICENSE](LICENSE) file for details.
### Model Licenses
- **BLIP**: BSD-3-Clause License
- **CLIP**: MIT License
## πŸ™ Acknowledgments
Built with amazing open-source projects:
- [Salesforce BLIP](https://github.com/salesforce/BLIP) - Image captioning and VQA
- [OpenAI CLIP](https://github.com/openai/CLIP) - Zero-shot classification
- [Hugging Face Transformers](https://huggingface.co/docs/transformers) - Model hub and inference
- [Gradio](https://gradio.app/) - Beautiful web interfaces
---