CXM06's picture
Update README.md
a1ecce5 verified

A newer version of the Gradio SDK is available: 6.15.2

Upgrade
metadata
title: AI Image Caption Generator
emoji: ๐Ÿค–
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.8.0
app_file: app.py
pinned: false
license: mit

๐Ÿ–ผ๏ธ AI Image Caption Generator

Python 3.10+ PyTorch License: MIT Hugging Face

Generate AI-powered image captions with multiple style optionsโ€”completely free, no API costs.

A lightweight, GPU-accelerated image captioning tool using state-of-the-art vision-language models (BLIP & GIT) with style customization powered by Groq's free LLM API.


โœจ Features

  • ๐ŸŽฏ Dual Model Support: Both BLIP-base (fast) and GIT-large (high quality) run simultaneously
  • ๐ŸŽจ 5 Caption Styles: None, Creative, Social Media, Professional, Technical
  • โšก GPU Accelerated: Optimized for NVIDIA GPUs (works on CPU too)
  • ๐Ÿ“Š Analytics Tracking: Built-in usage statistics and performance metrics
  • ๐Ÿ–ผ๏ธ Image Processing: Automatic validation, resizing, and format conversion
  • ๐Ÿ”„ Fallback Mechanisms: Graceful degradation when API is unavailable
  • ๐Ÿ’ฐ 100% Free: No OpenAI credits, no hidden costs
  • ๐Ÿ”’ Privacy First: Local inference option available

๐Ÿš€ Live Demo

Try it out without any installation:

๐ŸŽฎ Launch Live Demo โ†’

Will be a little slow as it is running on a CPU instead of GPU


๐Ÿ› ๏ธ Tech Stack

Component Technology
Vision Models BLIP-base, GIT-large (Hugging Face)
Style LLM Groq API (free tier)
Framework PyTorch 2.1.0 + CUDA 11.8
Interface Gradio 4.8.0
Deployment Hugging Face Spaces (T4 GPU)

๐Ÿ“ฆ Quick Start

Prerequisites

  • Python 3.10+
  • NVIDIA GPU with 4GB+ VRAM (recommended) or CPU
  • CUDA 11.8 (for GPU acceleration)

Installation

# Clone repository
git clone https://github.com/ChinmayM06/ai-image-caption-generator.git
cd ai-image-caption-generator

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install PyTorch with CUDA support
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118

# Install dependencies
pip install -r requirements.txt

# Set up environment variables (optional)
# Create a .env file in the project root with:
# GROQ_API_KEY=your_groq_api_key_here
# Get your free API key at https://console.groq.com
# Note: The app works without API key but styling features will use fallback templates

# Run the application
python app.py

Access at http://localhost:7860


๐ŸŽฏ Usage

Basic Usage

from src.models import get_model_manager, get_style_model
from src.utils import get_image_processor
from PIL import Image

# Initialize components (singleton pattern)
model_manager = get_model_manager()
style_model = get_style_model()
image_processor = get_image_processor()

# Load models (BLIP and GIT)
blip_success, git_success = model_manager.load_all_models()

# Load and preprocess image
image = Image.open("your_image.jpg")
processed_img, metadata = image_processor.preprocess_image(image)

# Generate captions from both models
captions = model_manager.generate_captions(processed_img)
blip_caption = captions["blip"]
git_caption = captions["git"]

# Apply style (optional)
styled_blip = style_model.style_caption(blip_caption, style="Professional")
styled_git = style_model.style_caption(git_caption, style="Creative")

Available Models

Both models run simultaneously to provide comparison:

  • BLIP-base: Fast inference (~1-2s), good quality, efficient
  • GIT-large: Slower (~3-4s), superior caption quality, more detailed

Caption Styles

Style Use Case Example
None Raw model output "A dog sitting on grass"
Creative Artistic, imaginative "A joyful golden retriever basking in nature's embrace"
Social Media Engaging, hashtag-ready "Meet this good boy enjoying sunny vibes! ๐Ÿ•โ˜€๏ธ #DogLife"
Professional Business, formal "Canine subject positioned in outdoor environment"
Technical Detailed, analytical "Golden retriever breed, seated posture, natural lighting, outdoor setting"

๐Ÿณ Docker Deployment

# Build image
docker build -t caption-generator .

# Run container (with GPU)
docker run --gpus all -p 7860:7860 caption-generator

# Run container (CPU only)
docker run -p 7860:7860 -e DEVICE=cpu caption-generator

โš™๏ธ Configuration

Environment Variables

Create a .env file in the project root (optional):

# Groq API Key (required for advanced styling, fallback available)
GROQ_API_KEY=your_groq_api_key_here

# Hardware Configuration (optional, defaults to 'cuda' if available)
DEVICE=cuda  # or 'cpu'

# Logging Level (optional)
LOG_LEVEL=INFO  # DEBUG, INFO, WARNING, ERROR

๐ŸŽ“ Why This Project?

Built as a learning project to explore:

  • GenAI Fundamentals: Vision-language models, prompt engineering
  • Practical ML Skills: GPU optimization, model deployment, API integration
  • Cost Optimization: Demonstrating production-quality AI without expensive APIs
  • Software Architecture: Caching, analytics, error handling, thread safety

Perfect for understanding how modern image captioning works under the hood while keeping infrastructure costs at zero.


๐Ÿค Contributing

Contributions welcome! Feel free to:

  • Report bugs
  • Suggest features
  • Submit pull requests
  • Improve documentation
  • Add new caption styles
  • Optimize performance

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments


๐Ÿ“ฌ Contact

Chinmay M - @ChinmayM06

Project Link: https://github.com/ChinmayM06/ai-image-caption-generator


โญ Star this repo if you find it helpful!

Made with โค๏ธ and lots of โ˜•