Spaces:
Running
Running
| title: HW 3 Vision Language AI Demo | |
| emoji: π€ | |
| colorFrom: red | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| --- | |
| title: Vision Language AI Demo | |
| emoji: π€ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: "4.44.0" | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| --- | |
| # π€ Vision Language AI Demo | |
| A comprehensive web application showcasing state-of-the-art Vision-Language AI models with an intuitive Gradio interface. | |
| ## β¨ Features | |
| ### πΌοΈ Image Captioning | |
| Automatically generate natural language descriptions of images using BLIP model. | |
| - Auto-generates captions when image is uploaded | |
| - Powered by Salesforce BLIP model | |
| ### π Visual Question Answering (VQA) | |
| Ask questions about images and get intelligent answers based on visual content. | |
| - Supports various question types | |
| - Real-time visual understanding | |
| ### π·οΈ Zero-Shot Image Classification | |
| Classify images into custom categories without training using CLIP model. | |
| - Define any categories you want | |
| - Visual similarity scoring | |
| - No training data required | |
| ### π¬ Multimodal Chat | |
| Interactive conversations about image content with context retention. | |
| - Multi-turn dialogue support | |
| - Natural language interaction | |
| ## πΈ Demo Screenshots | |
| ### Image Captioning | |
| .png) | |
| ### Visual Question Answering | |
| .png) | |
| ### Zero-Shot Classification | |
| .png) | |
| ### Multimodal Chat | |
| .png) | |
| ## π Quick Start | |
| ### Local Run | |
| ```bash | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run the application | |
| python app.py | |
| ``` | |
| Access at `http://localhost:7860` | |
| ### Deploy to Hugging Face Spaces | |
| 1. Go to https://huggingface.co/spaces | |
| 2. Click **"Create new Space"** | |
| 3. Fill in: | |
| - Space name: `vision-language-ai-demo` | |
| - License: MIT | |
| - SDK: **Gradio** | |
| - Hardware: CPU (free) or GPU (for faster processing) | |
| 4. Upload files: | |
| - `app.py` | |
| - `requirements.txt` | |
| - `README.md` | |
| - `source/` folder (with screenshots) | |
| 5. Space will auto-deploy in 5-10 minutes | |
| ## π οΈ Models Used | |
| | Model | Purpose | Size | Performance | | |
| |-------|---------|------|-------------| | |
| | [BLIP-Captioning](https://huggingface.co/Salesforce/blip-image-captioning-base) | Image Description | 447MB | Fast | | |
| | [BLIP-VQA](https://huggingface.co/Salesforce/blip-vqa-base) | Visual Q&A | 447MB | Fast | | |
| | [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) | Classification | 605MB | Very Fast | | |
| All models are open source and commercially usable. | |
| ## π Usage Guide | |
| ### πΌοΈ Image Captioning | |
| 1. Navigate to **"Image Captioning"** tab | |
| 2. Upload an image (drag & drop or click to browse) | |
| 3. Caption generates automatically | |
| 4. Or click **"π¨ Generate Caption"** button | |
| **Example Output:** | |
| ``` | |
| π Image Caption: | |
| a cat sitting on a wooden table looking at the camera | |
| ``` | |
| **Use Cases:** | |
| - Generate alt text for accessibility | |
| - Auto-tag images for organization | |
| - Content moderation | |
| - Creative writing inspiration | |
| --- | |
| ### π Visual Question Answering | |
| 1. Go to **"Visual Question Answering"** tab | |
| 2. Upload an image | |
| 3. Type your question in the text box | |
| 4. Click **"π€ Get Answer"** | |
| **Example Questions:** | |
| - "What color is the car?" | |
| - "How many people are there?" | |
| - "Is there a dog in the image?" | |
| - "What is the person wearing?" | |
| **Example Output:** | |
| ``` | |
| β Question: What color is the car? | |
| β Answer: red | |
| ``` | |
| **Tips:** | |
| - Ask specific, clear questions | |
| - One question at a time works best | |
| - Simple language gets better results | |
| --- | |
| ### π·οΈ Zero-Shot Classification | |
| 1. Open **"Zero-Shot Classification"** tab | |
| 2. Upload an image | |
| 3. Enter categories (comma-separated) | |
| - Default: `cat, dog, bird, car, building` | |
| - Custom: `sunny, cloudy, rainy, snowy` | |
| 4. Click **"π― Classify"** | |
| **Example Output:** | |
| ``` | |
| π― Classification Results: | |
| cat: 92.50% ββββββββββββββββββ | |
| dog: 5.20% β | |
| bird: 2.30% β | |
| car: 0.00% | |
| building: 0.00% | |
| ``` | |
| **Use Cases:** | |
| - Content categorization | |
| - Image filtering | |
| - Quality control | |
| - Custom tagging systems | |
| --- | |
| ### π¬ Multimodal Chat | |
| 1. Select **"Multimodal Chat"** tab | |
| 2. Upload an image (left panel) | |
| 3. Type your message and press Enter or click **"π€ Send"** | |
| 4. Continue the conversation naturally | |
| 5. Click **"ποΈ Clear Chat"** to start over | |
| **Example Conversation:** | |
| ``` | |
| π€ You: Describe this image | |
| π€ AI: a modern living room with a grey sofa | |
| π€ You: What color are the walls? | |
| π€ AI: white | |
| π€ You: Is there a window? | |
| π€ AI: yes | |
| ``` | |
| **Tips:** | |
| - Start with broad questions | |
| - Build on previous responses | |
| - Keep questions related to the image | |
| ### Getting Help | |
| - π [Gradio Documentation](https://gradio.app/docs/) | |
| - π€ [Hugging Face Forums](https://discuss.huggingface.co/) | |
| - π¬ [Gradio Discord](https://discord.gg/gradio) | |
| ## π Requirements | |
| **System Requirements:** | |
| - Python 3.8+ | |
| - 8GB RAM minimum (16GB recommended) | |
| - 5GB free storage for models | |
| **Dependencies:** | |
| - gradio >= 4.0.0 | |
| - torch >= 2.0.0 | |
| - transformers >= 4.35.0 | |
| - Pillow >= 10.0.0 | |
| See `requirements.txt` for complete list. | |
| ## π License | |
| MIT License - See [LICENSE](LICENSE) file for details. | |
| ### Model Licenses | |
| - **BLIP**: BSD-3-Clause License | |
| - **CLIP**: MIT License | |
| ## π Acknowledgments | |
| Built with amazing open-source projects: | |
| - [Salesforce BLIP](https://github.com/salesforce/BLIP) - Image captioning and VQA | |
| - [OpenAI CLIP](https://github.com/openai/CLIP) - Zero-shot classification | |
| - [Hugging Face Transformers](https://huggingface.co/docs/transformers) - Model hub and inference | |
| - [Gradio](https://gradio.app/) - Beautiful web interfaces | |
| --- |