HW_3 / README.md
kkkai123456's picture
Update README.md
1ca85db verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: HW 3 Vision Language AI Demo
emoji: πŸ€–
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false

title: Vision Language AI Demo emoji: πŸ€– colorFrom: blue colorTo: purple sdk: gradio sdk_version: "4.44.0" app_file: app.py pinned: false license: mit

πŸ€– Vision Language AI Demo

A comprehensive web application showcasing state-of-the-art Vision-Language AI models with an intuitive Gradio interface.

✨ Features

πŸ–ΌοΈ Image Captioning

Automatically generate natural language descriptions of images using BLIP model.

  • Auto-generates captions when image is uploaded
  • Powered by Salesforce BLIP model

πŸ” Visual Question Answering (VQA)

Ask questions about images and get intelligent answers based on visual content.

  • Supports various question types
  • Real-time visual understanding

🏷️ Zero-Shot Image Classification

Classify images into custom categories without training using CLIP model.

  • Define any categories you want
  • Visual similarity scoring
  • No training data required

πŸ’¬ Multimodal Chat

Interactive conversations about image content with context retention.

  • Multi-turn dialogue support
  • Natural language interaction

πŸ“Έ Demo Screenshots

Image Captioning

Image Captioning

Visual Question Answering

Visual Question Answering

Zero-Shot Classification

Zero-Shot Classification

Multimodal Chat

Multimodal Chat

πŸš€ Quick Start

Local Run

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py

Access at http://localhost:7860

Deploy to Hugging Face Spaces

  1. Go to https://huggingface.co/spaces
  2. Click "Create new Space"
  3. Fill in:
    • Space name: vision-language-ai-demo
    • License: MIT
    • SDK: Gradio
    • Hardware: CPU (free) or GPU (for faster processing)
  4. Upload files:
    • app.py
    • requirements.txt
    • README.md
    • source/ folder (with screenshots)
  5. Space will auto-deploy in 5-10 minutes

πŸ› οΈ Models Used

Model Purpose Size Performance
BLIP-Captioning Image Description 447MB Fast
BLIP-VQA Visual Q&A 447MB Fast
CLIP-ViT-B/32 Classification 605MB Very Fast

All models are open source and commercially usable.

πŸ“– Usage Guide

πŸ–ΌοΈ Image Captioning

  1. Navigate to "Image Captioning" tab
  2. Upload an image (drag & drop or click to browse)
  3. Caption generates automatically
  4. Or click "🎨 Generate Caption" button

Example Output:

πŸ“ Image Caption:
a cat sitting on a wooden table looking at the camera

Use Cases:

  • Generate alt text for accessibility
  • Auto-tag images for organization
  • Content moderation
  • Creative writing inspiration

πŸ” Visual Question Answering

  1. Go to "Visual Question Answering" tab
  2. Upload an image
  3. Type your question in the text box
  4. Click "πŸ€” Get Answer"

Example Questions:

  • "What color is the car?"
  • "How many people are there?"
  • "Is there a dog in the image?"
  • "What is the person wearing?"

Example Output:

❓ Question: What color is the car?
βœ… Answer: red

Tips:

  • Ask specific, clear questions
  • One question at a time works best
  • Simple language gets better results

🏷️ Zero-Shot Classification

  1. Open "Zero-Shot Classification" tab
  2. Upload an image
  3. Enter categories (comma-separated)
    • Default: cat, dog, bird, car, building
    • Custom: sunny, cloudy, rainy, snowy
  4. Click "🎯 Classify"

Example Output:

🎯 Classification Results:

cat:     92.50% β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
dog:      5.20% β–ˆ
bird:     2.30% β–Œ
car:      0.00%
building: 0.00%

Use Cases:

  • Content categorization
  • Image filtering
  • Quality control
  • Custom tagging systems

πŸ’¬ Multimodal Chat

  1. Select "Multimodal Chat" tab
  2. Upload an image (left panel)
  3. Type your message and press Enter or click "πŸ“€ Send"
  4. Continue the conversation naturally
  5. Click "πŸ—‘οΈ Clear Chat" to start over

Example Conversation:

πŸ‘€ You: Describe this image
πŸ€– AI: a modern living room with a grey sofa

πŸ‘€ You: What color are the walls?
πŸ€– AI: white

πŸ‘€ You: Is there a window?
πŸ€– AI: yes

Tips:

  • Start with broad questions
  • Build on previous responses
  • Keep questions related to the image

Getting Help

πŸ“‹ Requirements

System Requirements:

  • Python 3.8+
  • 8GB RAM minimum (16GB recommended)
  • 5GB free storage for models

Dependencies:

  • gradio >= 4.0.0
  • torch >= 2.0.0
  • transformers >= 4.35.0
  • Pillow >= 10.0.0

See requirements.txt for complete list.

πŸ“„ License

MIT License - See LICENSE file for details.

Model Licenses

  • BLIP: BSD-3-Clause License
  • CLIP: MIT License

πŸ™ Acknowledgments

Built with amazing open-source projects: