Spaces:

kkkai123456
/

HW_3

Running

App Files Files Community

HW_3 / README.md

kkkai123456

Update README.md

1ca85db verified about 1 month ago

preview code

raw

history blame contribute delete

5.74 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: HW 3 Vision Language AI Demo
emoji: 🤖
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false

title: Vision Language AI Demo emoji: 🤖 colorFrom: blue colorTo: purple sdk: gradio sdk_version: "4.44.0" app_file: app.py pinned: false license: mit

🤖 Vision Language AI Demo

A comprehensive web application showcasing state-of-the-art Vision-Language AI models with an intuitive Gradio interface.

✨ Features

🖼️ Image Captioning

Automatically generate natural language descriptions of images using BLIP model.

Auto-generates captions when image is uploaded
Powered by Salesforce BLIP model

🔍 Visual Question Answering (VQA)

Ask questions about images and get intelligent answers based on visual content.

Supports various question types
Real-time visual understanding

🏷️ Zero-Shot Image Classification

Classify images into custom categories without training using CLIP model.

Define any categories you want
Visual similarity scoring
No training data required

💬 Multimodal Chat

Interactive conversations about image content with context retention.

Multi-turn dialogue support
Natural language interaction

📸 Demo Screenshots

Image Captioning

Visual Question Answering

Zero-Shot Classification

Multimodal Chat

🚀 Quick Start

Local Run

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py

Access at http://localhost:7860

Deploy to Hugging Face Spaces

Go to https://huggingface.co/spaces
Click "Create new Space"
Fill in:
- Space name: vision-language-ai-demo
- License: MIT
- SDK: Gradio
- Hardware: CPU (free) or GPU (for faster processing)
Upload files:
- app.py
- requirements.txt
- README.md
- source/ folder (with screenshots)
Space will auto-deploy in 5-10 minutes

🛠️ Models Used

Model	Purpose	Size	Performance
BLIP-Captioning	Image Description	447MB	Fast
BLIP-VQA	Visual Q&A	447MB	Fast
CLIP-ViT-B/32	Classification	605MB	Very Fast

All models are open source and commercially usable.

📖 Usage Guide

🖼️ Image Captioning

Navigate to "Image Captioning" tab
Upload an image (drag & drop or click to browse)
Caption generates automatically
Or click "🎨 Generate Caption" button

Example Output:

📝 Image Caption:
a cat sitting on a wooden table looking at the camera

Use Cases:

Generate alt text for accessibility
Auto-tag images for organization
Content moderation
Creative writing inspiration

🔍 Visual Question Answering

Go to "Visual Question Answering" tab
Upload an image
Type your question in the text box
Click "🤔 Get Answer"

Example Questions:

"What color is the car?"
"How many people are there?"
"Is there a dog in the image?"
"What is the person wearing?"

Example Output:

❓ Question: What color is the car?
✅ Answer: red

Tips:

Ask specific, clear questions
One question at a time works best
Simple language gets better results

🏷️ Zero-Shot Classification

Open "Zero-Shot Classification" tab
Upload an image
Enter categories (comma-separated)
- Default: cat, dog, bird, car, building
- Custom: sunny, cloudy, rainy, snowy
Click "🎯 Classify"

Example Output:

🎯 Classification Results:

cat:     92.50% ██████████████████
dog:      5.20% █
bird:     2.30% ▌
car:      0.00%
building: 0.00%

Use Cases:

Content categorization
Image filtering
Quality control
Custom tagging systems

💬 Multimodal Chat

Select "Multimodal Chat" tab
Upload an image (left panel)
Type your message and press Enter or click "📤 Send"
Continue the conversation naturally
Click "🗑️ Clear Chat" to start over

Example Conversation:

👤 You: Describe this image
🤖 AI: a modern living room with a grey sofa

👤 You: What color are the walls?
🤖 AI: white

👤 You: Is there a window?
🤖 AI: yes

Tips:

Start with broad questions
Build on previous responses
Keep questions related to the image

Getting Help

📋 Requirements

System Requirements:

Python 3.8+
8GB RAM minimum (16GB recommended)
5GB free storage for models

Dependencies:

gradio >= 4.0.0
torch >= 2.0.0
transformers >= 4.35.0
Pillow >= 10.0.0

See requirements.txt for complete list.

📄 License

MIT License - See LICENSE file for details.

Model Licenses

BLIP: BSD-3-Clause License
CLIP: MIT License

🙏 Acknowledgments

Built with amazing open-source projects:

Salesforce BLIP - Image captioning and VQA
OpenAI CLIP - Zero-shot classification
Hugging Face Transformers - Model hub and inference
Gradio - Beautiful web interfaces