Spaces:

CXM06
/

ai-image-caption-generation

Sleeping

App Files Files Community

ai-image-caption-generation / README.md

CXM06

Update README.md

a1ecce5 verified 7 months ago

preview code

raw

history blame contribute delete

6.88 kB

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

metadata

title: AI Image Caption Generator
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.8.0
app_file: app.py
pinned: false
license: mit

🖼️ AI Image Caption Generator

Generate AI-powered image captions with multiple style options—completely free, no API costs.

A lightweight, GPU-accelerated image captioning tool using state-of-the-art vision-language models (BLIP & GIT) with style customization powered by Groq's free LLM API.

✨ Features

🎯 Dual Model Support: Both BLIP-base (fast) and GIT-large (high quality) run simultaneously
🎨 5 Caption Styles: None, Creative, Social Media, Professional, Technical
⚡ GPU Accelerated: Optimized for NVIDIA GPUs (works on CPU too)
📊 Analytics Tracking: Built-in usage statistics and performance metrics
🖼️ Image Processing: Automatic validation, resizing, and format conversion
🔄 Fallback Mechanisms: Graceful degradation when API is unavailable
💰 100% Free: No OpenAI credits, no hidden costs
🔒 Privacy First: Local inference option available

🚀 Live Demo

Try it out without any installation:

🎮 Launch Live Demo →

Will be a little slow as it is running on a CPU instead of GPU

🛠️ Tech Stack

Component	Technology
Vision Models	BLIP-base, GIT-large (Hugging Face)
Style LLM	Groq API (free tier)
Framework	PyTorch 2.1.0 + CUDA 11.8
Interface	Gradio 4.8.0
Deployment	Hugging Face Spaces (T4 GPU)

📦 Quick Start

Prerequisites

Python 3.10+
NVIDIA GPU with 4GB+ VRAM (recommended) or CPU
CUDA 11.8 (for GPU acceleration)

Installation

# Clone repository
git clone https://github.com/ChinmayM06/ai-image-caption-generator.git
cd ai-image-caption-generator

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install PyTorch with CUDA support
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118

# Install dependencies
pip install -r requirements.txt

# Set up environment variables (optional)
# Create a .env file in the project root with:
# GROQ_API_KEY=your_groq_api_key_here
# Get your free API key at https://console.groq.com
# Note: The app works without API key but styling features will use fallback templates

# Run the application
python app.py

Access at http://localhost:7860

🎯 Usage

Basic Usage

from src.models import get_model_manager, get_style_model
from src.utils import get_image_processor
from PIL import Image

# Initialize components (singleton pattern)
model_manager = get_model_manager()
style_model = get_style_model()
image_processor = get_image_processor()

# Load models (BLIP and GIT)
blip_success, git_success = model_manager.load_all_models()

# Load and preprocess image
image = Image.open("your_image.jpg")
processed_img, metadata = image_processor.preprocess_image(image)

# Generate captions from both models
captions = model_manager.generate_captions(processed_img)
blip_caption = captions["blip"]
git_caption = captions["git"]

# Apply style (optional)
styled_blip = style_model.style_caption(blip_caption, style="Professional")
styled_git = style_model.style_caption(git_caption, style="Creative")

Available Models

Both models run simultaneously to provide comparison:

BLIP-base: Fast inference (~1-2s), good quality, efficient
GIT-large: Slower (~3-4s), superior caption quality, more detailed

Caption Styles

Style	Use Case	Example
None	Raw model output	"A dog sitting on grass"
Creative	Artistic, imaginative	"A joyful golden retriever basking in nature's embrace"
Social Media	Engaging, hashtag-ready	"Meet this good boy enjoying sunny vibes! 🐕☀️ #DogLife"
Professional	Business, formal	"Canine subject positioned in outdoor environment"
Technical	Detailed, analytical	"Golden retriever breed, seated posture, natural lighting, outdoor setting"

🐳 Docker Deployment

# Build image
docker build -t caption-generator .

# Run container (with GPU)
docker run --gpus all -p 7860:7860 caption-generator

# Run container (CPU only)
docker run -p 7860:7860 -e DEVICE=cpu caption-generator

⚙️ Configuration

Environment Variables

Create a .env file in the project root (optional):

# Groq API Key (required for advanced styling, fallback available)
GROQ_API_KEY=your_groq_api_key_here

# Hardware Configuration (optional, defaults to 'cuda' if available)
DEVICE=cuda  # or 'cpu'

# Logging Level (optional)
LOG_LEVEL=INFO  # DEBUG, INFO, WARNING, ERROR

🎓 Why This Project?

Built as a learning project to explore:

GenAI Fundamentals: Vision-language models, prompt engineering
Practical ML Skills: GPU optimization, model deployment, API integration
Cost Optimization: Demonstrating production-quality AI without expensive APIs
Software Architecture: Caching, analytics, error handling, thread safety

Perfect for understanding how modern image captioning works under the hood while keeping infrastructure costs at zero.

🤝 Contributing

Contributions welcome! Feel free to:

Report bugs
Suggest features
Submit pull requests
Improve documentation
Add new caption styles
Optimize performance

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Salesforce BLIP - Image captioning model
Microsoft GIT - High-quality captions
Groq - Free LLM inference API
Hugging Face - Model hosting & deployment

📬 Contact

Chinmay M - @ChinmayM06

Project Link: https://github.com/ChinmayM06/ai-image-caption-generator

⭐ Star this repo if you find it helpful!

Made with ❤️ and lots of ☕