CXM06's picture
Update README.md
a1ecce5 verified
---
title: AI Image Caption Generator
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.8.0
app_file: app.py
pinned: false
license: mit
---
# 🖼️ AI Image Caption Generator
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.1.0-EE4C2C.svg)](https://pytorch.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/ChinmayM06/ai-image-caption-generator)
> Generate AI-powered image captions with multiple style options—completely free, no API costs.
A lightweight, GPU-accelerated image captioning tool using state-of-the-art vision-language models (BLIP & GIT) with style customization powered by Groq's free LLM API.
---
## ✨ Features
- 🎯 **Dual Model Support**: Both BLIP-base (fast) and GIT-large (high quality) run simultaneously
- 🎨 **5 Caption Styles**: None, Creative, Social Media, Professional, Technical
-**GPU Accelerated**: Optimized for NVIDIA GPUs (works on CPU too)
- 📊 **Analytics Tracking**: Built-in usage statistics and performance metrics
- 🖼️ **Image Processing**: Automatic validation, resizing, and format conversion
- 🔄 **Fallback Mechanisms**: Graceful degradation when API is unavailable
- 💰 **100% Free**: No OpenAI credits, no hidden costs
- 🔒 **Privacy First**: Local inference option available
---
## 🚀 Live Demo
Try it out without any installation:
**[🎮 Launch Live Demo →](https://huggingface.co/spaces/CXM06/ai-image-caption-generation)**
*Will be a little slow as it is running on a CPU instead of GPU*
---
## 🛠️ Tech Stack
| Component | Technology |
|-----------|-----------|
| **Vision Models** | BLIP-base, GIT-large (Hugging Face) |
| **Style LLM** | Groq API (free tier) |
| **Framework** | PyTorch 2.1.0 + CUDA 11.8 |
| **Interface** | Gradio 4.8.0 |
| **Deployment** | Hugging Face Spaces (T4 GPU) |
---
## 📦 Quick Start
### Prerequisites
- Python 3.10+
- NVIDIA GPU with 4GB+ VRAM (recommended) or CPU
- CUDA 11.8 (for GPU acceleration)
### Installation
```bash
# Clone repository
git clone https://github.com/ChinmayM06/ai-image-caption-generator.git
cd ai-image-caption-generator
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install PyTorch with CUDA support
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118
# Install dependencies
pip install -r requirements.txt
# Set up environment variables (optional)
# Create a .env file in the project root with:
# GROQ_API_KEY=your_groq_api_key_here
# Get your free API key at https://console.groq.com
# Note: The app works without API key but styling features will use fallback templates
# Run the application
python app.py
```
Access at `http://localhost:7860`
---
## 🎯 Usage
### Basic Usage
```python
from src.models import get_model_manager, get_style_model
from src.utils import get_image_processor
from PIL import Image
# Initialize components (singleton pattern)
model_manager = get_model_manager()
style_model = get_style_model()
image_processor = get_image_processor()
# Load models (BLIP and GIT)
blip_success, git_success = model_manager.load_all_models()
# Load and preprocess image
image = Image.open("your_image.jpg")
processed_img, metadata = image_processor.preprocess_image(image)
# Generate captions from both models
captions = model_manager.generate_captions(processed_img)
blip_caption = captions["blip"]
git_caption = captions["git"]
# Apply style (optional)
styled_blip = style_model.style_caption(blip_caption, style="Professional")
styled_git = style_model.style_caption(git_caption, style="Creative")
```
### Available Models
Both models run simultaneously to provide comparison:
- **BLIP-base**: Fast inference (~1-2s), good quality, efficient
- **GIT-large**: Slower (~3-4s), superior caption quality, more detailed
### Caption Styles
| Style | Use Case | Example |
|-------|----------|---------|
| **None** | Raw model output | "A dog sitting on grass" |
| **Creative** | Artistic, imaginative | "A joyful golden retriever basking in nature's embrace" |
| **Social Media** | Engaging, hashtag-ready | "Meet this good boy enjoying sunny vibes! 🐕☀️ #DogLife" |
| **Professional** | Business, formal | "Canine subject positioned in outdoor environment" |
| **Technical** | Detailed, analytical | "Golden retriever breed, seated posture, natural lighting, outdoor setting" |
---
## 🐳 Docker Deployment
```bash
# Build image
docker build -t caption-generator .
# Run container (with GPU)
docker run --gpus all -p 7860:7860 caption-generator
# Run container (CPU only)
docker run -p 7860:7860 -e DEVICE=cpu caption-generator
```
---
## ⚙️ Configuration
### Environment Variables
Create a `.env` file in the project root (optional):
```bash
# Groq API Key (required for advanced styling, fallback available)
GROQ_API_KEY=your_groq_api_key_here
# Hardware Configuration (optional, defaults to 'cuda' if available)
DEVICE=cuda # or 'cpu'
# Logging Level (optional)
LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR
```
---
## 🎓 Why This Project?
Built as a learning project to explore:
- **GenAI Fundamentals**: Vision-language models, prompt engineering
- **Practical ML Skills**: GPU optimization, model deployment, API integration
- **Cost Optimization**: Demonstrating production-quality AI without expensive APIs
- **Software Architecture**: Caching, analytics, error handling, thread safety
Perfect for understanding how modern image captioning works under the hood while keeping infrastructure costs at zero.
---
## 🤝 Contributing
Contributions welcome! Feel free to:
- Report bugs
- Suggest features
- Submit pull requests
- Improve documentation
- Add new caption styles
- Optimize performance
---
## 📝 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## 🙏 Acknowledgments
- [Salesforce BLIP](https://github.com/salesforce/BLIP) - Image captioning model
- [Microsoft GIT](https://github.com/microsoft/GenerativeImage2Text) - High-quality captions
- [Groq](https://groq.com) - Free LLM inference API
- [Hugging Face](https://huggingface.co) - Model hosting & deployment
---
## 📬 Contact
**Chinmay M** - [@ChinmayM06](https://github.com/ChinmayM06)
Project Link: [https://github.com/ChinmayM06/ai-image-caption-generator](https://github.com/ChinmayM06/ai-image-caption-generator)
---
<div align="center">
**[⭐ Star this repo](https://github.com/ChinmayM06/ai-image-caption-generator)** if you find it helpful!
Made with ❤️ and lots of ☕
</div>