Spaces:

CXM06
/

ai-image-caption-generation

Sleeping

App Files Files Community

ai-image-caption-generation / README.md

CXM06

Update README.md

a1ecce5 verified 7 months ago

preview code

raw

history blame contribute delete

6.88 kB

	---
	title: AI Image Caption Generator
	emoji: 🤖
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 4.8.0
	app_file: app.py
	pinned: false
	license: mit
	---

	# 🖼️ AI Image Caption Generator

	[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.1.0-EE4C2C.svg)](https://pytorch.org/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/ChinmayM06/ai-image-caption-generator)

	> Generate AI-powered image captions with multiple style options—completely free, no API costs.

	A lightweight, GPU-accelerated image captioning tool using state-of-the-art vision-language models (BLIP & GIT) with style customization powered by Groq's free LLM API.

	---

	## ✨ Features

	- 🎯 Dual Model Support: Both BLIP-base (fast) and GIT-large (high quality) run simultaneously
	- 🎨 5 Caption Styles: None, Creative, Social Media, Professional, Technical
	- ⚡ GPU Accelerated: Optimized for NVIDIA GPUs (works on CPU too)
	- 📊 Analytics Tracking: Built-in usage statistics and performance metrics
	- 🖼️ Image Processing: Automatic validation, resizing, and format conversion
	- 🔄 Fallback Mechanisms: Graceful degradation when API is unavailable
	- 💰 100% Free: No OpenAI credits, no hidden costs
	- 🔒 Privacy First: Local inference option available

	---

	## 🚀 Live Demo

	Try it out without any installation:

	[🎮 Launch Live Demo →](https://huggingface.co/spaces/CXM06/ai-image-caption-generation)

	Will be a little slow as it is running on a CPU instead of GPU

	---

	## 🛠️ Tech Stack

	\| Component \| Technology \|
	\|-----------\|-----------\|
	\| Vision Models \| BLIP-base, GIT-large (Hugging Face) \|
	\| Style LLM \| Groq API (free tier) \|
	\| Framework \| PyTorch 2.1.0 + CUDA 11.8 \|
	\| Interface \| Gradio 4.8.0 \|
	\| Deployment \| Hugging Face Spaces (T4 GPU) \|

	---

	## 📦 Quick Start

	### Prerequisites

	- Python 3.10+
	- NVIDIA GPU with 4GB+ VRAM (recommended) or CPU
	- CUDA 11.8 (for GPU acceleration)

	### Installation

	```bash
	# Clone repository
	git clone https://github.com/ChinmayM06/ai-image-caption-generator.git
	cd ai-image-caption-generator

	# Create virtual environment
	python -m venv venv
	source venv/bin/activate # Windows: venv\Scripts\activate

	# Install PyTorch with CUDA support
	pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118

	# Install dependencies
	pip install -r requirements.txt

	# Set up environment variables (optional)
	# Create a .env file in the project root with:
	# GROQ_API_KEY=your_groq_api_key_here
	# Get your free API key at https://console.groq.com
	# Note: The app works without API key but styling features will use fallback templates

	# Run the application
	python app.py
	```

	Access at `http://localhost:7860`

	---

	## 🎯 Usage

	### Basic Usage

	```python
	from src.models import get_model_manager, get_style_model
	from src.utils import get_image_processor
	from PIL import Image

	# Initialize components (singleton pattern)
	model_manager = get_model_manager()
	style_model = get_style_model()
	image_processor = get_image_processor()

	# Load models (BLIP and GIT)
	blip_success, git_success = model_manager.load_all_models()

	# Load and preprocess image
	image = Image.open("your_image.jpg")
	processed_img, metadata = image_processor.preprocess_image(image)

	# Generate captions from both models
	captions = model_manager.generate_captions(processed_img)
	blip_caption = captions["blip"]
	git_caption = captions["git"]

	# Apply style (optional)
	styled_blip = style_model.style_caption(blip_caption, style="Professional")
	styled_git = style_model.style_caption(git_caption, style="Creative")
	```

	### Available Models

	Both models run simultaneously to provide comparison:
	- BLIP-base: Fast inference (~1-2s), good quality, efficient
	- GIT-large: Slower (~3-4s), superior caption quality, more detailed

	### Caption Styles

	\| Style \| Use Case \| Example \|
	\|-------\|----------\|---------\|
	\| None \| Raw model output \| "A dog sitting on grass" \|
	\| Creative \| Artistic, imaginative \| "A joyful golden retriever basking in nature's embrace" \|
	\| Social Media \| Engaging, hashtag-ready \| "Meet this good boy enjoying sunny vibes! 🐕☀️ #DogLife" \|
	\| Professional \| Business, formal \| "Canine subject positioned in outdoor environment" \|
	\| Technical \| Detailed, analytical \| "Golden retriever breed, seated posture, natural lighting, outdoor setting" \|

	---

	## 🐳 Docker Deployment

	```bash
	# Build image
	docker build -t caption-generator .

	# Run container (with GPU)
	docker run --gpus all -p 7860:7860 caption-generator

	# Run container (CPU only)
	docker run -p 7860:7860 -e DEVICE=cpu caption-generator
	```

	---

	## ⚙️ Configuration

	### Environment Variables

	Create a `.env` file in the project root (optional):

	```bash
	# Groq API Key (required for advanced styling, fallback available)
	GROQ_API_KEY=your_groq_api_key_here

	# Hardware Configuration (optional, defaults to 'cuda' if available)
	DEVICE=cuda # or 'cpu'

	# Logging Level (optional)
	LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR
	```

	---

	## 🎓 Why This Project?

	Built as a learning project to explore:
	- GenAI Fundamentals: Vision-language models, prompt engineering
	- Practical ML Skills: GPU optimization, model deployment, API integration
	- Cost Optimization: Demonstrating production-quality AI without expensive APIs
	- Software Architecture: Caching, analytics, error handling, thread safety

	Perfect for understanding how modern image captioning works under the hood while keeping infrastructure costs at zero.

	---

	## 🤝 Contributing

	Contributions welcome! Feel free to:
	- Report bugs
	- Suggest features
	- Submit pull requests
	- Improve documentation
	- Add new caption styles
	- Optimize performance


	---

	## 📝 License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

	---

	## 🙏 Acknowledgments

	- [Salesforce BLIP](https://github.com/salesforce/BLIP) - Image captioning model
	- [Microsoft GIT](https://github.com/microsoft/GenerativeImage2Text) - High-quality captions
	- [Groq](https://groq.com) - Free LLM inference API
	- [Hugging Face](https://huggingface.co) - Model hosting & deployment

	---

	## 📬 Contact

	Chinmay M - [@ChinmayM06](https://github.com/ChinmayM06)

	Project Link: [https://github.com/ChinmayM06/ai-image-caption-generator](https://github.com/ChinmayM06/ai-image-caption-generator)

	---

	<div align="center">

	[⭐ Star this repo](https://github.com/ChinmayM06/ai-image-caption-generator) if you find it helpful!

	Made with ❤️ and lots of ☕

	</div>