Spaces:

kkkai123456
/

HW_3

Running

App Files Files Community

HW_3 / README.md

kkkai123456

Update README.md

1ca85db verified about 1 month ago

preview code

raw

history blame contribute delete

5.74 kB

	---
	title: HW 3 Vision Language AI Demo
	emoji: 🤖
	colorFrom: red
	colorTo: green
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	---
	---
	title: Vision Language AI Demo
	emoji: 🤖
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: "4.44.0"
	app_file: app.py
	pinned: false
	license: mit
	---

	# 🤖 Vision Language AI Demo

	A comprehensive web application showcasing state-of-the-art Vision-Language AI models with an intuitive Gradio interface.

	## ✨ Features

	### 🖼️ Image Captioning
	Automatically generate natural language descriptions of images using BLIP model.
	- Auto-generates captions when image is uploaded
	- Powered by Salesforce BLIP model

	### 🔍 Visual Question Answering (VQA)
	Ask questions about images and get intelligent answers based on visual content.
	- Supports various question types
	- Real-time visual understanding

	### 🏷️ Zero-Shot Image Classification
	Classify images into custom categories without training using CLIP model.
	- Define any categories you want
	- Visual similarity scoring
	- No training data required

	### 💬 Multimodal Chat
	Interactive conversations about image content with context retention.
	- Multi-turn dialogue support
	- Natural language interaction

	## 📸 Demo Screenshots

	### Image Captioning
	![Image Captioning](source/image%20(4).png)

	### Visual Question Answering
	![Visual Question Answering](source/image%20(3).png)

	### Zero-Shot Classification
	![Zero-Shot Classification](source/image%20(2).png)

	### Multimodal Chat
	![Multimodal Chat](source/image%20(1).png)

	## 🚀 Quick Start

	### Local Run
	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Run the application
	python app.py
	```

	Access at `http://localhost:7860`

	### Deploy to Hugging Face Spaces

	1. Go to https://huggingface.co/spaces
	2. Click "Create new Space"
	3. Fill in:
	- Space name: `vision-language-ai-demo`
	- License: MIT
	- SDK: Gradio
	- Hardware: CPU (free) or GPU (for faster processing)
	4. Upload files:
	- `app.py`
	- `requirements.txt`
	- `README.md`
	- `source/` folder (with screenshots)
	5. Space will auto-deploy in 5-10 minutes



	## 🛠️ Models Used

	\| Model \| Purpose \| Size \| Performance \|
	\|-------\|---------\|------\|-------------\|
	\| [BLIP-Captioning](https://huggingface.co/Salesforce/blip-image-captioning-base) \| Image Description \| 447MB \| Fast \|
	\| [BLIP-VQA](https://huggingface.co/Salesforce/blip-vqa-base) \| Visual Q&A \| 447MB \| Fast \|
	\| [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) \| Classification \| 605MB \| Very Fast \|

	All models are open source and commercially usable.

	## 📖 Usage Guide

	### 🖼️ Image Captioning
	1. Navigate to "Image Captioning" tab
	2. Upload an image (drag & drop or click to browse)
	3. Caption generates automatically
	4. Or click "🎨 Generate Caption" button

	Example Output:
	```
	📝 Image Caption:
	a cat sitting on a wooden table looking at the camera
	```

	Use Cases:
	- Generate alt text for accessibility
	- Auto-tag images for organization
	- Content moderation
	- Creative writing inspiration

	---

	### 🔍 Visual Question Answering
	1. Go to "Visual Question Answering" tab
	2. Upload an image
	3. Type your question in the text box
	4. Click "🤔 Get Answer"

	Example Questions:
	- "What color is the car?"
	- "How many people are there?"
	- "Is there a dog in the image?"
	- "What is the person wearing?"

	Example Output:
	```
	❓ Question: What color is the car?
	✅ Answer: red
	```

	Tips:
	- Ask specific, clear questions
	- One question at a time works best
	- Simple language gets better results

	---

	### 🏷️ Zero-Shot Classification
	1. Open "Zero-Shot Classification" tab
	2. Upload an image
	3. Enter categories (comma-separated)
	- Default: `cat, dog, bird, car, building`
	- Custom: `sunny, cloudy, rainy, snowy`
	4. Click "🎯 Classify"

	Example Output:
	```
	🎯 Classification Results:

	cat: 92.50% ██████████████████
	dog: 5.20% █
	bird: 2.30% ▌
	car: 0.00%
	building: 0.00%
	```

	Use Cases:
	- Content categorization
	- Image filtering
	- Quality control
	- Custom tagging systems

	---

	### 💬 Multimodal Chat
	1. Select "Multimodal Chat" tab
	2. Upload an image (left panel)
	3. Type your message and press Enter or click "📤 Send"
	4. Continue the conversation naturally
	5. Click "🗑️ Clear Chat" to start over

	Example Conversation:
	```
	👤 You: Describe this image
	🤖 AI: a modern living room with a grey sofa

	👤 You: What color are the walls?
	🤖 AI: white

	👤 You: Is there a window?
	🤖 AI: yes
	```

	Tips:
	- Start with broad questions
	- Build on previous responses
	- Keep questions related to the image

	### Getting Help
	- 📖 [Gradio Documentation](https://gradio.app/docs/)
	- 🤗 [Hugging Face Forums](https://discuss.huggingface.co/)
	- 💬 [Gradio Discord](https://discord.gg/gradio)

	## 📋 Requirements

	System Requirements:
	- Python 3.8+
	- 8GB RAM minimum (16GB recommended)
	- 5GB free storage for models

	Dependencies:
	- gradio >= 4.0.0
	- torch >= 2.0.0
	- transformers >= 4.35.0
	- Pillow >= 10.0.0

	See `requirements.txt` for complete list.

	## 📄 License

	MIT License - See [LICENSE](LICENSE) file for details.

	### Model Licenses
	- BLIP: BSD-3-Clause License
	- CLIP: MIT License


	## 🙏 Acknowledgments

	Built with amazing open-source projects:
	- [Salesforce BLIP](https://github.com/salesforce/BLIP) - Image captioning and VQA
	- [OpenAI CLIP](https://github.com/openai/CLIP) - Zero-shot classification
	- [Hugging Face Transformers](https://huggingface.co/docs/transformers) - Model hub and inference
	- [Gradio](https://gradio.app/) - Beautiful web interfaces


	---