HW_3 / README.md
kkkai123456's picture
Update README.md
4f09101 verified
|
raw
history blame
4.44 kB
metadata
title: HW 3 Vision Language AI Demo
emoji: πŸ€–
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

πŸ€– Vision Language AI Demo

A comprehensive web application showcasing state-of-the-art Vision-Language AI models.

✨ Features

πŸ–ΌοΈ Image Captioning

Automatically generate natural language descriptions of images using BLIP model.

πŸ” Visual Question Answering (VQA)

Ask questions about images and get intelligent answers based on visual content.

🏷️ Zero-Shot Image Classification

Classify images into custom categories without training using CLIP model.

πŸ’¬ Multimodal Chat

Interactive conversations about image content with context retention.

πŸ“Έ Demo Screenshots

Main Interface

Main Interface

Image Captioning

Image Captioning

Visual Question Answering

VQA

Zero-Shot Classification

Classification

Multimodal Chat

Chat

πŸš€ Quick Start

Local Run

pip install -r requirements.txt
python app.py

Access at http://localhost:7860

Deploy to Hugging Face Spaces

  1. Create a Space

  2. Upload Files

    • Upload app.py, requirements.txt, and README.md
    • Or use Git:
    git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
    cd YOUR_SPACE_NAME
    # Copy your files here
    git add .
    git commit -m "Initial commit"
    git push
    
  3. Wait for Build

    • Space will auto-deploy in 5-10 minutes
    • Access at: https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME

Enable GPU (Optional)

  • Go to Space Settings β†’ Hardware
  • Select GPU option for faster processing
  • Restart the Space

πŸ› οΈ Models Used

Model Purpose Size
BLIP-Captioning Image Description 447MB
BLIP-VQA Visual Q&A 447MB
CLIP Classification 605MB

πŸ“– Usage Examples

Image Captioning

Upload an image β†’ Click "Generate Caption" β†’ Get description

Example Output:

πŸ“ Image Caption:
A golden retriever sitting in a park with green grass

Visual Question Answering

Upload image β†’ Ask question β†’ Get answer

Example:

Q: What color is the car?
A: red

Zero-Shot Classification

Upload image β†’ Define categories (comma-separated) β†’ Get probabilities

Example:

Categories: cat, dog, bird
Results:
cat:  92.5% β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
dog:   5.2% β–ˆ
bird:  2.3% β–Œ

Multimodal Chat

Upload image β†’ Chat naturally about it

Example:

You: Describe this image
AI: A modern kitchen with white cabinets
You: What color are the walls?
AI: white

βš™οΈ Configuration

Change Models

Edit app.py to use different models:

# Use larger BLIP model
caption_model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-large"
)

Customize Interface

Modify custom_css in app.py:

custom_css = """
#title {
    background: linear-gradient(90deg, #YOUR_COLOR 0%, #YOUR_COLOR 100%);
}
"""

πŸ› Troubleshooting

Issue: Models downloading slowly

# Set cache directory
export HF_HOME=/path/to/storage

Issue: Out of memory

# Use CPU only
device = "cpu"

Issue: Port already in use

python app.py --server-port 8080

πŸ“„ License

MIT License - See LICENSE file

πŸ™ Acknowledgments


⭐ Star this project if you find it helpful!