Spaces:
Sleeping
Sleeping
A newer version of the Streamlit SDK is available:
1.53.1
metadata
license: mit
title: Multi_LLM_Image_Captioning
sdk: streamlit
emoji: π»
colorFrom: purple
colorTo: indigo
pinned: true
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/662234af4dd89a733b09e612/gnrlvy8935CNe0fcx0HZs.png
short_description: A powerful Streamlit application that generates captions for
sdk_version: 1.46.1
πΌοΈ Multi-Model Image Caption Generator
A powerful Streamlit application that generates captions for images using multiple AI models (OpenAI GPT-4o, Google Gemini, and GROQ Vision) with advanced image processing capabilities using OpenCV and LangChain for history management.
β¨ Features
- Multi-Model Support: Choose from OpenAI GPT-4o, Google Gemini, or GROQ Vision models
- Smart Caption Generation: Clean, professional captions (10-50 words, no emojis/symbols)
- Advanced Image Processing: Two caption overlay methods using OpenCV
- LangChain Integration: Comprehensive history management and conversation memory
- Custom Typography: Uses Poppins font with intelligent fallbacks
- Interactive UI: Modern Streamlit interface with real-time preview
- Export Functionality: Download processed images with captions
π Quick Start
Prerequisites
- Python 3.8+
- API keys for at least one of the supported models
Installation
- Clone the repository
git clone <your-repo-url>
cd multi-model-caption-generator
- Install dependencies
pip install streamlit opencv-python pillow openai google-generativeai groq langchain python-dotenv
- Set up environment variables
Create a
.envfile in the project root:
OPENAI_API_KEY_IC=your_openai_api_key_here
GEMINI_API_KEY_IC=your_gemini_api_key_here
GROQ_API_KEY_IC=your_groq_api_key_here
- Set up fonts (optional) Place your font file at:
fonts/Poppins-Regular.ttf
- Run the application
streamlit run main.py
π Project Structure
multi-model-caption-generator/
βββ main.py # Main Streamlit application
βββ caption_generation.py # Multi-model caption generation
βββ caption_history.py # LangChain history management
βββ caption_overlay.py # OpenCV image processing
βββ fonts/ # Font directory
β βββ Poppins-Regular.ttf # Custom font (optional)
βββ .env # Environment variables
βββ caption_history.json # Auto-generated history file
βββ README.md # This file
π€ Supported AI Models
OpenAI GPT-4o
- Model:
gpt-4o - Strengths: Detailed image analysis, high accuracy
- API: OpenAI Vision API
Google Gemini
- Model:
gemini-1.5-flash - Strengths: Fast processing, multimodal understanding
- API: Google Generative AI
GROQ Vision
- Model:
llama-3.2-11b-vision-preview - Strengths: High-speed inference, efficient processing
- API: GROQ API
π¨ Caption Overlay Options
1. Overlay on Image
- Position: Top, Center, or Bottom
- Customizable font size and thickness
- Auto text wrapping for long captions
- Semi-transparent background for readability
2. Background Behind Image
- Caption appears above the image
- Customizable background and text colors
- Adjustable margins
- Uses Poppins font with fallbacks
π Caption History Management
The application uses LangChain for sophisticated history management:
- Persistent Storage: All captions saved to
caption_history.json - Memory Integration: LangChain ConversationBufferMemory
- Search & Filter: Find previous captions by image name or content
- Export History: View and manage generation history
π§ Configuration
API Keys Setup
Get your API keys from:
- OpenAI: https://platform.openai.com/api-keys
- Google Gemini: https://makersuite.google.com/app/apikey
- GROQ: https://console.groq.com/keys
Font Configuration
The app automatically uses fonts in this priority:
- Custom font path (if specified in UI)
fonts/Poppins-Regular.ttf(if available)- System default font
Caption Settings
- Word Limit: 10-50 words maximum
- Format: Plain text only (no emojis or special characters)
- Style: Descriptive but concise
π₯οΈ Usage
- Configure APIs: Add your API keys to
.envfile and click "Configure APIs" - Upload Image: Choose PNG, JPG, JPEG, BMP, or TIFF files
- Select Model: Choose from OpenAI, Gemini, or GROQ
- Generate Caption: Click to generate and see real-time preview
- Customize Overlay: Adjust position, colors, and styling
- Download: Save the final image with caption
π― Key Features Explained
Smart Caption Generation
- All models generate clean, professional captions
- Consistent 10-50 word length
- No emojis or special characters
- Perfect for image overlays
Advanced Image Processing
- OpenCV-powered text rendering
- Automatic text wrapping
- High-quality font rendering with PIL
- Multiple overlay styles
History Management
- LangChain integration for conversation memory
- Searchable history with timestamps
- Model tracking for each generation
- Easy history clearing and management
π οΈ Technical Details
Dependencies
streamlit>=1.28.0
opencv-python>=4.8.0
pillow>=10.0.0
openai>=1.0.0
google-generativeai>=0.3.0
groq>=0.4.0
langchain>=0.1.0
python-dotenv>=1.0.0
numpy>=1.24.0
Performance Optimizations
- Efficient base64 encoding for API calls
- Optimized image processing with OpenCV
- Smart memory management with LangChain
- Reduced token limits for faster generation
π Troubleshooting
Common Issues
API Key Errors
- Ensure all API keys are correctly set in
.envfile - Check API key validity and quotas
- Restart the application after adding keys
Font Loading Issues
- Verify font file exists at
fonts/Poppins-Regular.ttf - Check file permissions
- App will fallback to default font if custom font fails
Image Processing Errors
- Ensure uploaded images are valid formats
- Check image file size (very large images may cause issues)
- Try different image formats if problems persist
Model-Specific Issues
- OpenAI: Verify you have access to GPT-4o vision model
- Gemini: Ensure Gemini API is enabled in your Google Cloud project
- GROQ: Check that vision models are available in your region
Error Messages
| Error | Solution |
|---|---|
| "API key not configured" | Add the required API key to .env file |
| "Model not available" | Check model name and API access |
| "Image processing failed" | Try a different image format or size |
| "Font loading error" | Check font file path or use default font |
π€ Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and commit:
git commit -m 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
π License
This project is licensed under the MIT License - see the MIT LICENSE file for details.
π Acknowledgments
- Streamlit for the amazing web app framework
- OpenCV for powerful image processing capabilities
- LangChain for conversation memory management
- OpenAI, Google, and GROQ for providing excellent vision APIs
- Poppins Font for beautiful typography
π Support
If you encounter any issues or have questions:
- Check the troubleshooting section above
- Review the Issues page
- Create a new issue with detailed information
- Provide error messages and steps to reproduce
Built with β€οΈ using Streamlit, LangChain, OpenCV, and multi-model AI APIs