--- license: mit title: Multi_LLM_Image_Captioning sdk: streamlit emoji: 💻 colorFrom: purple colorTo: indigo pinned: true thumbnail: >- https://cdn-uploads.huggingface.co/production/uploads/662234af4dd89a733b09e612/gnrlvy8935CNe0fcx0HZs.png short_description: A powerful Streamlit application that generates captions for sdk_version: 1.46.1 --- # 🖼️ Multi-Model Image Caption Generator A powerful Streamlit application that generates captions for images using multiple AI models (OpenAI GPT-4o, Google Gemini, and GROQ Vision) with advanced image processing capabilities using OpenCV and LangChain for history management. ## ✨ Features - **Multi-Model Support**: Choose from OpenAI GPT-4o, Google Gemini, or GROQ Vision models - **Smart Caption Generation**: Clean, professional captions (10-50 words, no emojis/symbols) - **Advanced Image Processing**: Two caption overlay methods using OpenCV - **LangChain Integration**: Comprehensive history management and conversation memory - **Custom Typography**: Uses Poppins font with intelligent fallbacks - **Interactive UI**: Modern Streamlit interface with real-time preview - **Export Functionality**: Download processed images with captions ## 🚀 Quick Start ### Prerequisites - Python 3.8+ - API keys for at least one of the supported models ### Installation 1. **Clone the repository** ```bash git clone cd multi-model-caption-generator ``` 2. **Install dependencies** ```bash pip install streamlit opencv-python pillow openai google-generativeai groq langchain python-dotenv ``` 3. **Set up environment variables** Create a `.env` file in the project root: ```env OPENAI_API_KEY_IC=your_openai_api_key_here GEMINI_API_KEY_IC=your_gemini_api_key_here GROQ_API_KEY_IC=your_groq_api_key_here ``` 4. **Set up fonts (optional)** Place your font file at: ``` fonts/Poppins-Regular.ttf ``` 5. **Run the application** ```bash streamlit run main.py ``` ## 📁 Project Structure ``` multi-model-caption-generator/ ├── main.py # Main Streamlit application ├── caption_generation.py # Multi-model caption generation ├── caption_history.py # LangChain history management ├── caption_overlay.py # OpenCV image processing ├── fonts/ # Font directory │ └── Poppins-Regular.ttf # Custom font (optional) ├── .env # Environment variables ├── caption_history.json # Auto-generated history file └── README.md # This file ``` ## 🤖 Supported AI Models ### OpenAI GPT-4o - **Model**: `gpt-4o` - **Strengths**: Detailed image analysis, high accuracy - **API**: OpenAI Vision API ### Google Gemini - **Model**: `gemini-1.5-flash` - **Strengths**: Fast processing, multimodal understanding - **API**: Google Generative AI ### GROQ Vision - **Model**: `llama-3.2-11b-vision-preview` - **Strengths**: High-speed inference, efficient processing - **API**: GROQ API ## 🎨 Caption Overlay Options ### 1. Overlay on Image - Position: Top, Center, or Bottom - Customizable font size and thickness - Auto text wrapping for long captions - Semi-transparent background for readability ### 2. Background Behind Image - Caption appears above the image - Customizable background and text colors - Adjustable margins - Uses Poppins font with fallbacks ## 📝 Caption History Management The application uses LangChain for sophisticated history management: - **Persistent Storage**: All captions saved to `caption_history.json` - **Memory Integration**: LangChain ConversationBufferMemory - **Search & Filter**: Find previous captions by image name or content - **Export History**: View and manage generation history ## 🔧 Configuration ### API Keys Setup Get your API keys from: - **OpenAI**: [https://platform.openai.com/api-keys](https://platform.openai.com/api-keys) - **Google Gemini**: [https://makersuite.google.com/app/apikey](https://makersuite.google.com/app/apikey) - **GROQ**: [https://console.groq.com/keys](https://console.groq.com/keys) ### Font Configuration The app automatically uses fonts in this priority: 1. Custom font path (if specified in UI) 2. `fonts/Poppins-Regular.ttf` (if available) 3. System default font ### Caption Settings - **Word Limit**: 10-50 words maximum - **Format**: Plain text only (no emojis or special characters) - **Style**: Descriptive but concise ## 🖥️ Usage 1. **Configure APIs**: Add your API keys to `.env` file and click "Configure APIs" 2. **Upload Image**: Choose PNG, JPG, JPEG, BMP, or TIFF files 3. **Select Model**: Choose from OpenAI, Gemini, or GROQ 4. **Generate Caption**: Click to generate and see real-time preview 5. **Customize Overlay**: Adjust position, colors, and styling 6. **Download**: Save the final image with caption ## 🎯 Key Features Explained ### Smart Caption Generation - All models generate clean, professional captions - Consistent 10-50 word length - No emojis or special characters - Perfect for image overlays ### Advanced Image Processing - OpenCV-powered text rendering - Automatic text wrapping - High-quality font rendering with PIL - Multiple overlay styles ### History Management - LangChain integration for conversation memory - Searchable history with timestamps - Model tracking for each generation - Easy history clearing and management ## 🛠️ Technical Details ### Dependencies ``` streamlit>=1.28.0 opencv-python>=4.8.0 pillow>=10.0.0 openai>=1.0.0 google-generativeai>=0.3.0 groq>=0.4.0 langchain>=0.1.0 python-dotenv>=1.0.0 numpy>=1.24.0 ``` ### Performance Optimizations - Efficient base64 encoding for API calls - Optimized image processing with OpenCV - Smart memory management with LangChain - Reduced token limits for faster generation ## 🔍 Troubleshooting ### Common Issues **API Key Errors** - Ensure all API keys are correctly set in `.env` file - Check API key validity and quotas - Restart the application after adding keys **Font Loading Issues** - Verify font file exists at `fonts/Poppins-Regular.ttf` - Check file permissions - App will fallback to default font if custom font fails **Image Processing Errors** - Ensure uploaded images are valid formats - Check image file size (very large images may cause issues) - Try different image formats if problems persist **Model-Specific Issues** - **OpenAI**: Verify you have access to GPT-4o vision model - **Gemini**: Ensure Gemini API is enabled in your Google Cloud project - **GROQ**: Check that vision models are available in your region ### Error Messages | Error | Solution | |-------|----------| | "API key not configured" | Add the required API key to `.env` file | | "Model not available" | Check model name and API access | | "Image processing failed" | Try a different image format or size | | "Font loading error" | Check font file path or use default font | ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch: `git checkout -b feature-name` 3. Make your changes and commit: `git commit -m 'Add feature'` 4. Push to the branch: `git push origin feature-name` 5. Submit a pull request ## 📄 License This project is licensed under the MIT License - see the [MIT LICENSE](https://mit-license.org/) file for details. ## 🙏 Acknowledgments - **Streamlit** for the amazing web app framework - **OpenCV** for powerful image processing capabilities - **LangChain** for conversation memory management - **OpenAI, Google, and GROQ** for providing excellent vision APIs - **Poppins Font** for beautiful typography ## 📞 Support If you encounter any issues or have questions: 1. Check the troubleshooting section above 2. Review the [Issues](https://github.com/your-repo/issues) page 3. Create a new issue with detailed information 4. Provide error messages and steps to reproduce --- **Built with ❤️ using Streamlit, LangChain, OpenCV, and multi-model AI APIs**