Spaces:

HarshitX
/

Multi_LLM_Image_Captioning

Sleeping

App Files Files Community

Multi_LLM_Image_Captioning / README.md

HarshitX

Update README.md

8925aaa verified 7 months ago

preview code

raw

history blame contribute delete

7.93 kB

A newer version of the Streamlit SDK is available: 1.53.1

Upgrade

metadata

license: mit
title: Multi_LLM_Image_Captioning
sdk: streamlit
emoji: 💻
colorFrom: purple
colorTo: indigo
pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/662234af4dd89a733b09e612/gnrlvy8935CNe0fcx0HZs.png
short_description: A powerful Streamlit application that generates captions for
sdk_version: 1.46.1

🖼️ Multi-Model Image Caption Generator

A powerful Streamlit application that generates captions for images using multiple AI models (OpenAI GPT-4o, Google Gemini, and GROQ Vision) with advanced image processing capabilities using OpenCV and LangChain for history management.

✨ Features

Multi-Model Support: Choose from OpenAI GPT-4o, Google Gemini, or GROQ Vision models
Smart Caption Generation: Clean, professional captions (10-50 words, no emojis/symbols)
Advanced Image Processing: Two caption overlay methods using OpenCV
LangChain Integration: Comprehensive history management and conversation memory
Custom Typography: Uses Poppins font with intelligent fallbacks
Interactive UI: Modern Streamlit interface with real-time preview
Export Functionality: Download processed images with captions

🚀 Quick Start

Prerequisites

Python 3.8+
API keys for at least one of the supported models

Installation

Clone the repository

git clone <your-repo-url>
cd multi-model-caption-generator

Install dependencies

pip install streamlit opencv-python pillow openai google-generativeai groq langchain python-dotenv

Set up environment variables Create a .env file in the project root:

OPENAI_API_KEY_IC=your_openai_api_key_here
GEMINI_API_KEY_IC=your_gemini_api_key_here
GROQ_API_KEY_IC=your_groq_api_key_here

Set up fonts (optional) Place your font file at:

fonts/Poppins-Regular.ttf

Run the application

streamlit run main.py

📁 Project Structure

multi-model-caption-generator/
├── main.py                    # Main Streamlit application
├── caption_generation.py     # Multi-model caption generation
├── caption_history.py        # LangChain history management
├── caption_overlay.py        # OpenCV image processing
├── fonts/                    # Font directory
│   └── Poppins-Regular.ttf   # Custom font (optional)
├── .env                      # Environment variables
├── caption_history.json     # Auto-generated history file
└── README.md                # This file

🤖 Supported AI Models

OpenAI GPT-4o

Model: gpt-4o
Strengths: Detailed image analysis, high accuracy
API: OpenAI Vision API

Google Gemini

Model: gemini-1.5-flash
Strengths: Fast processing, multimodal understanding
API: Google Generative AI

GROQ Vision

Model: llama-3.2-11b-vision-preview
Strengths: High-speed inference, efficient processing
API: GROQ API

🎨 Caption Overlay Options

1. Overlay on Image

Position: Top, Center, or Bottom
Customizable font size and thickness
Auto text wrapping for long captions
Semi-transparent background for readability

2. Background Behind Image

Caption appears above the image
Customizable background and text colors
Adjustable margins
Uses Poppins font with fallbacks

📝 Caption History Management

The application uses LangChain for sophisticated history management:

Persistent Storage: All captions saved to caption_history.json
Memory Integration: LangChain ConversationBufferMemory
Search & Filter: Find previous captions by image name or content
Export History: View and manage generation history

🔧 Configuration

API Keys Setup

Get your API keys from:

OpenAI: https://platform.openai.com/api-keys
Google Gemini: https://makersuite.google.com/app/apikey
GROQ: https://console.groq.com/keys

Font Configuration

The app automatically uses fonts in this priority:

Custom font path (if specified in UI)
fonts/Poppins-Regular.ttf (if available)
System default font

Caption Settings

Word Limit: 10-50 words maximum
Format: Plain text only (no emojis or special characters)
Style: Descriptive but concise

🖥️ Usage

Configure APIs: Add your API keys to .env file and click "Configure APIs"
Upload Image: Choose PNG, JPG, JPEG, BMP, or TIFF files
Select Model: Choose from OpenAI, Gemini, or GROQ
Generate Caption: Click to generate and see real-time preview
Customize Overlay: Adjust position, colors, and styling
Download: Save the final image with caption

🎯 Key Features Explained

Smart Caption Generation

All models generate clean, professional captions
Consistent 10-50 word length
No emojis or special characters
Perfect for image overlays

Advanced Image Processing

OpenCV-powered text rendering
Automatic text wrapping
High-quality font rendering with PIL
Multiple overlay styles

History Management

LangChain integration for conversation memory
Searchable history with timestamps
Model tracking for each generation
Easy history clearing and management

🛠️ Technical Details

Dependencies

streamlit>=1.28.0
opencv-python>=4.8.0
pillow>=10.0.0
openai>=1.0.0
google-generativeai>=0.3.0
groq>=0.4.0
langchain>=0.1.0
python-dotenv>=1.0.0
numpy>=1.24.0

Performance Optimizations

Efficient base64 encoding for API calls
Optimized image processing with OpenCV
Smart memory management with LangChain
Reduced token limits for faster generation

🔍 Troubleshooting

Common Issues

API Key Errors

Ensure all API keys are correctly set in .env file
Check API key validity and quotas
Restart the application after adding keys

Font Loading Issues

Verify font file exists at fonts/Poppins-Regular.ttf
Check file permissions
App will fallback to default font if custom font fails

Image Processing Errors

Ensure uploaded images are valid formats
Check image file size (very large images may cause issues)
Try different image formats if problems persist

Model-Specific Issues

OpenAI: Verify you have access to GPT-4o vision model
Gemini: Ensure Gemini API is enabled in your Google Cloud project
GROQ: Check that vision models are available in your region

Error Messages

Error	Solution
"API key not configured"	Add the required API key to `.env` file
"Model not available"	Check model name and API access
"Image processing failed"	Try a different image format or size
"Font loading error"	Check font file path or use default font

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and commit: git commit -m 'Add feature'
Push to the branch: git push origin feature-name
Submit a pull request

📄 License

This project is licensed under the MIT License - see the MIT LICENSE file for details.

🙏 Acknowledgments

Streamlit for the amazing web app framework
OpenCV for powerful image processing capabilities
LangChain for conversation memory management
OpenAI, Google, and GROQ for providing excellent vision APIs
Poppins Font for beautiful typography

📞 Support

If you encounter any issues or have questions:

Check the troubleshooting section above
Review the Issues page
Create a new issue with detailed information
Provide error messages and steps to reproduce

Built with ❤️ using Streamlit, LangChain, OpenCV, and multi-model AI APIs