Spaces:

HarshitX
/

Multi_LLM_Image_Captioning

Sleeping

File size: 7,932 Bytes

---
license: mit
title: Multi_LLM_Image_Captioning
sdk: streamlit
emoji: 💻
colorFrom: purple
colorTo: indigo
pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/662234af4dd89a733b09e612/gnrlvy8935CNe0fcx0HZs.png
short_description: A powerful Streamlit application that generates captions for
sdk_version: 1.46.1
---
# 🖼️ Multi-Model Image Caption Generator

A powerful Streamlit application that generates captions for images using multiple AI models (OpenAI GPT-4o, Google Gemini, and GROQ Vision) with advanced image processing capabilities using OpenCV and LangChain for history management.

## ✨ Features

- **Multi-Model Support**: Choose from OpenAI GPT-4o, Google Gemini, or GROQ Vision models
- **Smart Caption Generation**: Clean, professional captions (10-50 words, no emojis/symbols)
- **Advanced Image Processing**: Two caption overlay methods using OpenCV
- **LangChain Integration**: Comprehensive history management and conversation memory
- **Custom Typography**: Uses Poppins font with intelligent fallbacks
- **Interactive UI**: Modern Streamlit interface with real-time preview
- **Export Functionality**: Download processed images with captions

## 🚀 Quick Start

### Prerequisites

- Python 3.8+
- API keys for at least one of the supported models

### Installation

1. **Clone the repository**
```bash
git clone <your-repo-url>
cd multi-model-caption-generator
```

2. **Install dependencies**
```bash
pip install streamlit opencv-python pillow openai google-generativeai groq langchain python-dotenv
```

3. **Set up environment variables**
Create a `.env` file in the project root:
```env
OPENAI_API_KEY_IC=your_openai_api_key_here
GEMINI_API_KEY_IC=your_gemini_api_key_here
GROQ_API_KEY_IC=your_groq_api_key_here
```

4. **Set up fonts (optional)**
Place your font file at:
```
fonts/Poppins-Regular.ttf
```

5. **Run the application**
```bash
streamlit run main.py
```

## 📁 Project Structure

```
multi-model-caption-generator/
├── main.py                    # Main Streamlit application
├── caption_generation.py     # Multi-model caption generation
├── caption_history.py        # LangChain history management
├── caption_overlay.py        # OpenCV image processing
├── fonts/                    # Font directory
│   └── Poppins-Regular.ttf   # Custom font (optional)
├── .env                      # Environment variables
├── caption_history.json     # Auto-generated history file
└── README.md                # This file
```

## 🤖 Supported AI Models

### OpenAI GPT-4o
- **Model**: `gpt-4o`
- **Strengths**: Detailed image analysis, high accuracy
- **API**: OpenAI Vision API

### Google Gemini
- **Model**: `gemini-1.5-flash`
- **Strengths**: Fast processing, multimodal understanding
- **API**: Google Generative AI

### GROQ Vision
- **Model**: `llama-3.2-11b-vision-preview`
- **Strengths**: High-speed inference, efficient processing
- **API**: GROQ API

## 🎨 Caption Overlay Options

### 1. Overlay on Image
- Position: Top, Center, or Bottom
- Customizable font size and thickness
- Auto text wrapping for long captions
- Semi-transparent background for readability

### 2. Background Behind Image
- Caption appears above the image
- Customizable background and text colors
- Adjustable margins
- Uses Poppins font with fallbacks

## 📝 Caption History Management

The application uses LangChain for sophisticated history management:

- **Persistent Storage**: All captions saved to `caption_history.json`
- **Memory Integration**: LangChain ConversationBufferMemory
- **Search & Filter**: Find previous captions by image name or content
- **Export History**: View and manage generation history

## 🔧 Configuration

### API Keys Setup

Get your API keys from:
- **OpenAI**: [https://platform.openai.com/api-keys](https://platform.openai.com/api-keys)
- **Google Gemini**: [https://makersuite.google.com/app/apikey](https://makersuite.google.com/app/apikey)
- **GROQ**: [https://console.groq.com/keys](https://console.groq.com/keys)

### Font Configuration

The app automatically uses fonts in this priority:
1. Custom font path (if specified in UI)
2. `fonts/Poppins-Regular.ttf` (if available)
3. System default font

### Caption Settings

- **Word Limit**: 10-50 words maximum
- **Format**: Plain text only (no emojis or special characters)
- **Style**: Descriptive but concise

## 🖥️ Usage

1. **Configure APIs**: Add your API keys to `.env` file and click "Configure APIs"
2. **Upload Image**: Choose PNG, JPG, JPEG, BMP, or TIFF files
3. **Select Model**: Choose from OpenAI, Gemini, or GROQ
4. **Generate Caption**: Click to generate and see real-time preview
5. **Customize Overlay**: Adjust position, colors, and styling
6. **Download**: Save the final image with caption

## 🎯 Key Features Explained

### Smart Caption Generation
- All models generate clean, professional captions
- Consistent 10-50 word length
- No emojis or special characters
- Perfect for image overlays

### Advanced Image Processing
- OpenCV-powered text rendering
- Automatic text wrapping
- High-quality font rendering with PIL
- Multiple overlay styles

### History Management
- LangChain integration for conversation memory
- Searchable history with timestamps
- Model tracking for each generation
- Easy history clearing and management

## 🛠️ Technical Details

### Dependencies
```
streamlit>=1.28.0
opencv-python>=4.8.0
pillow>=10.0.0
openai>=1.0.0
google-generativeai>=0.3.0
groq>=0.4.0
langchain>=0.1.0
python-dotenv>=1.0.0
numpy>=1.24.0
```

### Performance Optimizations
- Efficient base64 encoding for API calls
- Optimized image processing with OpenCV
- Smart memory management with LangChain
- Reduced token limits for faster generation

## 🔍 Troubleshooting

### Common Issues

**API Key Errors**
- Ensure all API keys are correctly set in `.env` file
- Check API key validity and quotas
- Restart the application after adding keys

**Font Loading Issues**
- Verify font file exists at `fonts/Poppins-Regular.ttf`
- Check file permissions
- App will fallback to default font if custom font fails

**Image Processing Errors**
- Ensure uploaded images are valid formats
- Check image file size (very large images may cause issues)
- Try different image formats if problems persist

**Model-Specific Issues**
- **OpenAI**: Verify you have access to GPT-4o vision model
- **Gemini**: Ensure Gemini API is enabled in your Google Cloud project
- **GROQ**: Check that vision models are available in your region

### Error Messages

| Error | Solution |
|-------|----------|
| "API key not configured" | Add the required API key to `.env` file |
| "Model not available" | Check model name and API access |
| "Image processing failed" | Try a different image format or size |
| "Font loading error" | Check font file path or use default font |

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make your changes and commit: `git commit -m 'Add feature'`
4. Push to the branch: `git push origin feature-name`
5. Submit a pull request

## 📄 License

This project is licensed under the MIT License - see the [MIT LICENSE](https://mit-license.org/) file for details.

## 🙏 Acknowledgments

- **Streamlit** for the amazing web app framework
- **OpenCV** for powerful image processing capabilities
- **LangChain** for conversation memory management
- **OpenAI, Google, and GROQ** for providing excellent vision APIs
- **Poppins Font** for beautiful typography

## 📞 Support

If you encounter any issues or have questions:

1. Check the troubleshooting section above
2. Review the [Issues](https://github.com/your-repo/issues) page
3. Create a new issue with detailed information
4. Provide error messages and steps to reproduce

---

**Built with ❤️ using Streamlit, LangChain, OpenCV, and multi-model AI APIs**