Spaces:
Sleeping
Sleeping
| license: mit | |
| title: Multi_LLM_Image_Captioning | |
| sdk: streamlit | |
| emoji: π» | |
| colorFrom: purple | |
| colorTo: indigo | |
| pinned: true | |
| thumbnail: >- | |
| https://cdn-uploads.huggingface.co/production/uploads/662234af4dd89a733b09e612/gnrlvy8935CNe0fcx0HZs.png | |
| short_description: A powerful Streamlit application that generates captions for | |
| sdk_version: 1.46.1 | |
| # πΌοΈ Multi-Model Image Caption Generator | |
| A powerful Streamlit application that generates captions for images using multiple AI models (OpenAI GPT-4o, Google Gemini, and GROQ Vision) with advanced image processing capabilities using OpenCV and LangChain for history management. | |
| ## β¨ Features | |
| - **Multi-Model Support**: Choose from OpenAI GPT-4o, Google Gemini, or GROQ Vision models | |
| - **Smart Caption Generation**: Clean, professional captions (10-50 words, no emojis/symbols) | |
| - **Advanced Image Processing**: Two caption overlay methods using OpenCV | |
| - **LangChain Integration**: Comprehensive history management and conversation memory | |
| - **Custom Typography**: Uses Poppins font with intelligent fallbacks | |
| - **Interactive UI**: Modern Streamlit interface with real-time preview | |
| - **Export Functionality**: Download processed images with captions | |
| ## π Quick Start | |
| ### Prerequisites | |
| - Python 3.8+ | |
| - API keys for at least one of the supported models | |
| ### Installation | |
| 1. **Clone the repository** | |
| ```bash | |
| git clone <your-repo-url> | |
| cd multi-model-caption-generator | |
| ``` | |
| 2. **Install dependencies** | |
| ```bash | |
| pip install streamlit opencv-python pillow openai google-generativeai groq langchain python-dotenv | |
| ``` | |
| 3. **Set up environment variables** | |
| Create a `.env` file in the project root: | |
| ```env | |
| OPENAI_API_KEY_IC=your_openai_api_key_here | |
| GEMINI_API_KEY_IC=your_gemini_api_key_here | |
| GROQ_API_KEY_IC=your_groq_api_key_here | |
| ``` | |
| 4. **Set up fonts (optional)** | |
| Place your font file at: | |
| ``` | |
| fonts/Poppins-Regular.ttf | |
| ``` | |
| 5. **Run the application** | |
| ```bash | |
| streamlit run main.py | |
| ``` | |
| ## π Project Structure | |
| ``` | |
| multi-model-caption-generator/ | |
| βββ main.py # Main Streamlit application | |
| βββ caption_generation.py # Multi-model caption generation | |
| βββ caption_history.py # LangChain history management | |
| βββ caption_overlay.py # OpenCV image processing | |
| βββ fonts/ # Font directory | |
| β βββ Poppins-Regular.ttf # Custom font (optional) | |
| βββ .env # Environment variables | |
| βββ caption_history.json # Auto-generated history file | |
| βββ README.md # This file | |
| ``` | |
| ## π€ Supported AI Models | |
| ### OpenAI GPT-4o | |
| - **Model**: `gpt-4o` | |
| - **Strengths**: Detailed image analysis, high accuracy | |
| - **API**: OpenAI Vision API | |
| ### Google Gemini | |
| - **Model**: `gemini-1.5-flash` | |
| - **Strengths**: Fast processing, multimodal understanding | |
| - **API**: Google Generative AI | |
| ### GROQ Vision | |
| - **Model**: `llama-3.2-11b-vision-preview` | |
| - **Strengths**: High-speed inference, efficient processing | |
| - **API**: GROQ API | |
| ## π¨ Caption Overlay Options | |
| ### 1. Overlay on Image | |
| - Position: Top, Center, or Bottom | |
| - Customizable font size and thickness | |
| - Auto text wrapping for long captions | |
| - Semi-transparent background for readability | |
| ### 2. Background Behind Image | |
| - Caption appears above the image | |
| - Customizable background and text colors | |
| - Adjustable margins | |
| - Uses Poppins font with fallbacks | |
| ## π Caption History Management | |
| The application uses LangChain for sophisticated history management: | |
| - **Persistent Storage**: All captions saved to `caption_history.json` | |
| - **Memory Integration**: LangChain ConversationBufferMemory | |
| - **Search & Filter**: Find previous captions by image name or content | |
| - **Export History**: View and manage generation history | |
| ## π§ Configuration | |
| ### API Keys Setup | |
| Get your API keys from: | |
| - **OpenAI**: [https://platform.openai.com/api-keys](https://platform.openai.com/api-keys) | |
| - **Google Gemini**: [https://makersuite.google.com/app/apikey](https://makersuite.google.com/app/apikey) | |
| - **GROQ**: [https://console.groq.com/keys](https://console.groq.com/keys) | |
| ### Font Configuration | |
| The app automatically uses fonts in this priority: | |
| 1. Custom font path (if specified in UI) | |
| 2. `fonts/Poppins-Regular.ttf` (if available) | |
| 3. System default font | |
| ### Caption Settings | |
| - **Word Limit**: 10-50 words maximum | |
| - **Format**: Plain text only (no emojis or special characters) | |
| - **Style**: Descriptive but concise | |
| ## π₯οΈ Usage | |
| 1. **Configure APIs**: Add your API keys to `.env` file and click "Configure APIs" | |
| 2. **Upload Image**: Choose PNG, JPG, JPEG, BMP, or TIFF files | |
| 3. **Select Model**: Choose from OpenAI, Gemini, or GROQ | |
| 4. **Generate Caption**: Click to generate and see real-time preview | |
| 5. **Customize Overlay**: Adjust position, colors, and styling | |
| 6. **Download**: Save the final image with caption | |
| ## π― Key Features Explained | |
| ### Smart Caption Generation | |
| - All models generate clean, professional captions | |
| - Consistent 10-50 word length | |
| - No emojis or special characters | |
| - Perfect for image overlays | |
| ### Advanced Image Processing | |
| - OpenCV-powered text rendering | |
| - Automatic text wrapping | |
| - High-quality font rendering with PIL | |
| - Multiple overlay styles | |
| ### History Management | |
| - LangChain integration for conversation memory | |
| - Searchable history with timestamps | |
| - Model tracking for each generation | |
| - Easy history clearing and management | |
| ## π οΈ Technical Details | |
| ### Dependencies | |
| ``` | |
| streamlit>=1.28.0 | |
| opencv-python>=4.8.0 | |
| pillow>=10.0.0 | |
| openai>=1.0.0 | |
| google-generativeai>=0.3.0 | |
| groq>=0.4.0 | |
| langchain>=0.1.0 | |
| python-dotenv>=1.0.0 | |
| numpy>=1.24.0 | |
| ``` | |
| ### Performance Optimizations | |
| - Efficient base64 encoding for API calls | |
| - Optimized image processing with OpenCV | |
| - Smart memory management with LangChain | |
| - Reduced token limits for faster generation | |
| ## π Troubleshooting | |
| ### Common Issues | |
| **API Key Errors** | |
| - Ensure all API keys are correctly set in `.env` file | |
| - Check API key validity and quotas | |
| - Restart the application after adding keys | |
| **Font Loading Issues** | |
| - Verify font file exists at `fonts/Poppins-Regular.ttf` | |
| - Check file permissions | |
| - App will fallback to default font if custom font fails | |
| **Image Processing Errors** | |
| - Ensure uploaded images are valid formats | |
| - Check image file size (very large images may cause issues) | |
| - Try different image formats if problems persist | |
| **Model-Specific Issues** | |
| - **OpenAI**: Verify you have access to GPT-4o vision model | |
| - **Gemini**: Ensure Gemini API is enabled in your Google Cloud project | |
| - **GROQ**: Check that vision models are available in your region | |
| ### Error Messages | |
| | Error | Solution | | |
| |-------|----------| | |
| | "API key not configured" | Add the required API key to `.env` file | | |
| | "Model not available" | Check model name and API access | | |
| | "Image processing failed" | Try a different image format or size | | |
| | "Font loading error" | Check font file path or use default font | | |
| ## π€ Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch: `git checkout -b feature-name` | |
| 3. Make your changes and commit: `git commit -m 'Add feature'` | |
| 4. Push to the branch: `git push origin feature-name` | |
| 5. Submit a pull request | |
| ## π License | |
| This project is licensed under the MIT License - see the [MIT LICENSE](https://mit-license.org/) file for details. | |
| ## π Acknowledgments | |
| - **Streamlit** for the amazing web app framework | |
| - **OpenCV** for powerful image processing capabilities | |
| - **LangChain** for conversation memory management | |
| - **OpenAI, Google, and GROQ** for providing excellent vision APIs | |
| - **Poppins Font** for beautiful typography | |
| ## π Support | |
| If you encounter any issues or have questions: | |
| 1. Check the troubleshooting section above | |
| 2. Review the [Issues](https://github.com/your-repo/issues) page | |
| 3. Create a new issue with detailed information | |
| 4. Provide error messages and steps to reproduce | |
| --- | |
| **Built with β€οΈ using Streamlit, LangChain, OpenCV, and multi-model AI APIs** |