Spaces:
Running
A newer version of the Gradio SDK is available:
6.1.0
title: HW 3 Vision Language AI Demo
emoji: π€
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
title: Vision Language AI Demo emoji: π€ colorFrom: blue colorTo: purple sdk: gradio sdk_version: "4.44.0" app_file: app.py pinned: false license: mit
π€ Vision Language AI Demo
A comprehensive web application showcasing state-of-the-art Vision-Language AI models with an intuitive Gradio interface.
β¨ Features
πΌοΈ Image Captioning
Automatically generate natural language descriptions of images using BLIP model.
- Auto-generates captions when image is uploaded
- Powered by Salesforce BLIP model
π Visual Question Answering (VQA)
Ask questions about images and get intelligent answers based on visual content.
- Supports various question types
- Real-time visual understanding
π·οΈ Zero-Shot Image Classification
Classify images into custom categories without training using CLIP model.
- Define any categories you want
- Visual similarity scoring
- No training data required
π¬ Multimodal Chat
Interactive conversations about image content with context retention.
- Multi-turn dialogue support
- Natural language interaction
πΈ Demo Screenshots
Image Captioning
Visual Question Answering
Zero-Shot Classification
Multimodal Chat
π Quick Start
Local Run
# Install dependencies
pip install -r requirements.txt
# Run the application
python app.py
Access at http://localhost:7860
Deploy to Hugging Face Spaces
- Go to https://huggingface.co/spaces
- Click "Create new Space"
- Fill in:
- Space name:
vision-language-ai-demo - License: MIT
- SDK: Gradio
- Hardware: CPU (free) or GPU (for faster processing)
- Space name:
- Upload files:
app.pyrequirements.txtREADME.mdsource/folder (with screenshots)
- Space will auto-deploy in 5-10 minutes
π οΈ Models Used
| Model | Purpose | Size | Performance |
|---|---|---|---|
| BLIP-Captioning | Image Description | 447MB | Fast |
| BLIP-VQA | Visual Q&A | 447MB | Fast |
| CLIP-ViT-B/32 | Classification | 605MB | Very Fast |
All models are open source and commercially usable.
π Usage Guide
πΌοΈ Image Captioning
- Navigate to "Image Captioning" tab
- Upload an image (drag & drop or click to browse)
- Caption generates automatically
- Or click "π¨ Generate Caption" button
Example Output:
π Image Caption:
a cat sitting on a wooden table looking at the camera
Use Cases:
- Generate alt text for accessibility
- Auto-tag images for organization
- Content moderation
- Creative writing inspiration
π Visual Question Answering
- Go to "Visual Question Answering" tab
- Upload an image
- Type your question in the text box
- Click "π€ Get Answer"
Example Questions:
- "What color is the car?"
- "How many people are there?"
- "Is there a dog in the image?"
- "What is the person wearing?"
Example Output:
β Question: What color is the car?
β
Answer: red
Tips:
- Ask specific, clear questions
- One question at a time works best
- Simple language gets better results
π·οΈ Zero-Shot Classification
- Open "Zero-Shot Classification" tab
- Upload an image
- Enter categories (comma-separated)
- Default:
cat, dog, bird, car, building - Custom:
sunny, cloudy, rainy, snowy
- Default:
- Click "π― Classify"
Example Output:
π― Classification Results:
cat: 92.50% ββββββββββββββββββ
dog: 5.20% β
bird: 2.30% β
car: 0.00%
building: 0.00%
Use Cases:
- Content categorization
- Image filtering
- Quality control
- Custom tagging systems
π¬ Multimodal Chat
- Select "Multimodal Chat" tab
- Upload an image (left panel)
- Type your message and press Enter or click "π€ Send"
- Continue the conversation naturally
- Click "ποΈ Clear Chat" to start over
Example Conversation:
π€ You: Describe this image
π€ AI: a modern living room with a grey sofa
π€ You: What color are the walls?
π€ AI: white
π€ You: Is there a window?
π€ AI: yes
Tips:
- Start with broad questions
- Build on previous responses
- Keep questions related to the image
Getting Help
- π Gradio Documentation
- π€ Hugging Face Forums
- π¬ Gradio Discord
π Requirements
System Requirements:
- Python 3.8+
- 8GB RAM minimum (16GB recommended)
- 5GB free storage for models
Dependencies:
- gradio >= 4.0.0
- torch >= 2.0.0
- transformers >= 4.35.0
- Pillow >= 10.0.0
See requirements.txt for complete list.
π License
MIT License - See LICENSE file for details.
Model Licenses
- BLIP: BSD-3-Clause License
- CLIP: MIT License
π Acknowledgments
Built with amazing open-source projects:
- Salesforce BLIP - Image captioning and VQA
- OpenAI CLIP - Zero-shot classification
- Hugging Face Transformers - Model hub and inference
- Gradio - Beautiful web interfaces
.png)
.png)
.png)
.png)