Spaces:

HarshitX
/

Multi_LLM_Image_Captioning

Sleeping

App Files Files Community

Multi_LLM_Image_Captioning / README.md

HarshitX

Update README.md

8925aaa verified 7 months ago

preview code

raw

history blame contribute delete

7.93 kB

	---
	license: mit
	title: Multi_LLM_Image_Captioning
	sdk: streamlit
	emoji: 💻
	colorFrom: purple
	colorTo: indigo
	pinned: true
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/662234af4dd89a733b09e612/gnrlvy8935CNe0fcx0HZs.png
	short_description: A powerful Streamlit application that generates captions for
	sdk_version: 1.46.1
	---
	# 🖼️ Multi-Model Image Caption Generator

	A powerful Streamlit application that generates captions for images using multiple AI models (OpenAI GPT-4o, Google Gemini, and GROQ Vision) with advanced image processing capabilities using OpenCV and LangChain for history management.

	## ✨ Features

	- Multi-Model Support: Choose from OpenAI GPT-4o, Google Gemini, or GROQ Vision models
	- Smart Caption Generation: Clean, professional captions (10-50 words, no emojis/symbols)
	- Advanced Image Processing: Two caption overlay methods using OpenCV
	- LangChain Integration: Comprehensive history management and conversation memory
	- Custom Typography: Uses Poppins font with intelligent fallbacks
	- Interactive UI: Modern Streamlit interface with real-time preview
	- Export Functionality: Download processed images with captions

	## 🚀 Quick Start

	### Prerequisites

	- Python 3.8+
	- API keys for at least one of the supported models

	### Installation

	1. Clone the repository
	```bash
	git clone <your-repo-url>
	cd multi-model-caption-generator
	```

	2. Install dependencies
	```bash
	pip install streamlit opencv-python pillow openai google-generativeai groq langchain python-dotenv
	```

	3. Set up environment variables
	Create a `.env` file in the project root:
	```env
	OPENAI_API_KEY_IC=your_openai_api_key_here
	GEMINI_API_KEY_IC=your_gemini_api_key_here
	GROQ_API_KEY_IC=your_groq_api_key_here
	```

	4. Set up fonts (optional)
	Place your font file at:
	```
	fonts/Poppins-Regular.ttf
	```

	5. Run the application
	```bash
	streamlit run main.py
	```

	## 📁 Project Structure

	```
	multi-model-caption-generator/
	├── main.py # Main Streamlit application
	├── caption_generation.py # Multi-model caption generation
	├── caption_history.py # LangChain history management
	├── caption_overlay.py # OpenCV image processing
	├── fonts/ # Font directory
	│ └── Poppins-Regular.ttf # Custom font (optional)
	├── .env # Environment variables
	├── caption_history.json # Auto-generated history file
	└── README.md # This file
	```

	## 🤖 Supported AI Models

	### OpenAI GPT-4o
	- Model: `gpt-4o`
	- Strengths: Detailed image analysis, high accuracy
	- API: OpenAI Vision API

	### Google Gemini
	- Model: `gemini-1.5-flash`
	- Strengths: Fast processing, multimodal understanding
	- API: Google Generative AI

	### GROQ Vision
	- Model: `llama-3.2-11b-vision-preview`
	- Strengths: High-speed inference, efficient processing
	- API: GROQ API

	## 🎨 Caption Overlay Options

	### 1. Overlay on Image
	- Position: Top, Center, or Bottom
	- Customizable font size and thickness
	- Auto text wrapping for long captions
	- Semi-transparent background for readability

	### 2. Background Behind Image
	- Caption appears above the image
	- Customizable background and text colors
	- Adjustable margins
	- Uses Poppins font with fallbacks

	## 📝 Caption History Management

	The application uses LangChain for sophisticated history management:

	- Persistent Storage: All captions saved to `caption_history.json`
	- Memory Integration: LangChain ConversationBufferMemory
	- Search & Filter: Find previous captions by image name or content
	- Export History: View and manage generation history

	## 🔧 Configuration

	### API Keys Setup

	Get your API keys from:
	- OpenAI: [https://platform.openai.com/api-keys](https://platform.openai.com/api-keys)
	- Google Gemini: [https://makersuite.google.com/app/apikey](https://makersuite.google.com/app/apikey)
	- GROQ: [https://console.groq.com/keys](https://console.groq.com/keys)

	### Font Configuration

	The app automatically uses fonts in this priority:
	1. Custom font path (if specified in UI)
	2. `fonts/Poppins-Regular.ttf` (if available)
	3. System default font

	### Caption Settings

	- Word Limit: 10-50 words maximum
	- Format: Plain text only (no emojis or special characters)
	- Style: Descriptive but concise

	## 🖥️ Usage

	1. Configure APIs: Add your API keys to `.env` file and click "Configure APIs"
	2. Upload Image: Choose PNG, JPG, JPEG, BMP, or TIFF files
	3. Select Model: Choose from OpenAI, Gemini, or GROQ
	4. Generate Caption: Click to generate and see real-time preview
	5. Customize Overlay: Adjust position, colors, and styling
	6. Download: Save the final image with caption

	## 🎯 Key Features Explained

	### Smart Caption Generation
	- All models generate clean, professional captions
	- Consistent 10-50 word length
	- No emojis or special characters
	- Perfect for image overlays

	### Advanced Image Processing
	- OpenCV-powered text rendering
	- Automatic text wrapping
	- High-quality font rendering with PIL
	- Multiple overlay styles

	### History Management
	- LangChain integration for conversation memory
	- Searchable history with timestamps
	- Model tracking for each generation
	- Easy history clearing and management

	## 🛠️ Technical Details

	### Dependencies
	```
	streamlit>=1.28.0
	opencv-python>=4.8.0
	pillow>=10.0.0
	openai>=1.0.0
	google-generativeai>=0.3.0
	groq>=0.4.0
	langchain>=0.1.0
	python-dotenv>=1.0.0
	numpy>=1.24.0
	```

	### Performance Optimizations
	- Efficient base64 encoding for API calls
	- Optimized image processing with OpenCV
	- Smart memory management with LangChain
	- Reduced token limits for faster generation

	## 🔍 Troubleshooting

	### Common Issues

	API Key Errors
	- Ensure all API keys are correctly set in `.env` file
	- Check API key validity and quotas
	- Restart the application after adding keys

	Font Loading Issues
	- Verify font file exists at `fonts/Poppins-Regular.ttf`
	- Check file permissions
	- App will fallback to default font if custom font fails

	Image Processing Errors
	- Ensure uploaded images are valid formats
	- Check image file size (very large images may cause issues)
	- Try different image formats if problems persist

	Model-Specific Issues
	- OpenAI: Verify you have access to GPT-4o vision model
	- Gemini: Ensure Gemini API is enabled in your Google Cloud project
	- GROQ: Check that vision models are available in your region

	### Error Messages

	\| Error \| Solution \|
	\|-------\|----------\|
	\| "API key not configured" \| Add the required API key to `.env` file \|
	\| "Model not available" \| Check model name and API access \|
	\| "Image processing failed" \| Try a different image format or size \|
	\| "Font loading error" \| Check font file path or use default font \|

	## 🤝 Contributing

	1. Fork the repository
	2. Create a feature branch: `git checkout -b feature-name`
	3. Make your changes and commit: `git commit -m 'Add feature'`
	4. Push to the branch: `git push origin feature-name`
	5. Submit a pull request

	## 📄 License

	This project is licensed under the MIT License - see the [MIT LICENSE](https://mit-license.org/) file for details.

	## 🙏 Acknowledgments

	- Streamlit for the amazing web app framework
	- OpenCV for powerful image processing capabilities
	- LangChain for conversation memory management
	- OpenAI, Google, and GROQ for providing excellent vision APIs
	- Poppins Font for beautiful typography

	## 📞 Support

	If you encounter any issues or have questions:

	1. Check the troubleshooting section above
	2. Review the [Issues](https://github.com/your-repo/issues) page
	3. Create a new issue with detailed information
	4. Provide error messages and steps to reproduce

	---

	Built with ❤️ using Streamlit, LangChain, OpenCV, and multi-model AI APIs