Easonwangzk's picture
Fix YAML metadata: use allowed color value
57a2f48
|
raw
history blame
12.2 kB
---
title: Amazon Multimodal RAG Assistant
emoji: πŸ›’
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
license: mit
app_port: 7860
---
# Amazon Multimodal RAG Assistant
An AI-powered e-commerce search assistant that combines multimodal embeddings (CLIP), vector search (ChromaDB), and large language models to provide intelligent product recommendations and natural language responses.
![Project Status](https://img.shields.io/badge/status-active-success.svg)
![Python Version](https://img.shields.io/badge/python-3.8%2B-blue.svg)
## Features
- **Multimodal Search**: Search products using text, images, or both simultaneously
- **Intelligent Retrieval**: CLIP-based embeddings for semantic product matching
- **Dual LLM Support**: Choose between OpenAI GPT-4 or local open-source models
- **Natural Language Responses**: Context-aware answers powered by advanced LLMs
- **Modern Web Interface**: Clean, responsive UI with real-time search
- **Vector Database**: Persistent ChromaDB storage for fast retrieval
- **Prompt Engineering**: Supports zero-shot, few-shot, and multi-shot prompting
- **Chat History**: Multi-turn conversations with context awareness
- **Flexible Configuration**: Environment-based setup for easy customization
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Frontend β”‚ (HTML/JS/TailwindCSS)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
β”‚ HTTP/JSON
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FastAPI β”‚ (REST API Server)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LLM β”‚ β”‚ RAG β”‚
β”‚ (GPT-4 or β”‚ β”‚ (CLIP + β”‚
β”‚ Local HF) β”‚ β”‚ ChromaDB) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Components
1. **rag.py**: Retrieval system with CLIP embeddings and ChromaDB
2. **llm.py**: LLM interface with prompt engineering
3. **api_server.py**: FastAPI backend with singleton LLM pattern
4. **frontend/**: Modern web UI with drag-and-drop support
5. **config.py**: Centralized configuration management
## Requirements
- Python 3.8+
- CUDA-compatible GPU (optional, but recommended for faster inference)
- 8GB+ RAM (16GB+ recommended)
- 10GB+ disk space for models and data
## Installation
### 1. Clone the Repository
```bash
cd Multimodel
```
### 2. Create Virtual Environment
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
### 3. Install Dependencies
```bash
pip install -r requirements.txt
```
**Note**: CLIP installation requires git. If you encounter issues:
```bash
pip install git+https://github.com/openai/CLIP.git
```
### 4. Configure Environment
Create a `.env` file in the project root (copy from `.env.example`):
```bash
cp .env.example .env
```
**For OpenAI GPT-4 (Recommended):**
```bash
# .env file
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-api-key-here
OPENAI_MODEL=gpt-4o
```
**For Local Models (Free, but requires more compute):**
```bash
# .env file
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
```
See [.env.example](.env.example) for all configuration options.
### 5. Prepare Data
Place your Amazon product CSV file in the project root:
```
amazon_multimodal_clean.csv
```
Expected CSV columns:
- `uniq_id`: Unique product identifier
- `product_name`: Product name
- `product_text`: Product description
- `main_category`: Product category
- `image`: Image URLs (pipe-separated)
## Usage
### Step 1: Build Vector Index
```bash
python rag.py --build --csv amazon_multimodal_clean.csv --max 1000
```
Options:
- `--csv`: Path to your CSV file
- `--max`: Maximum number of products to index (optional, removes limit if omitted)
- `--db`: Database directory (default: `chromadb_store`)
This will:
- Download product images
- Generate CLIP embeddings
- Build ChromaDB vector index
- Save to `chromadb_store/`
### Step 2: Start API Server
```bash
python api_server.py
```
The server will start on `http://localhost:8000`
**Startup Notes:**
- **GPT-4 Mode**: Server starts instantly, first request takes 2-5 seconds (API call)
- **Local Model Mode**: First request takes 10-60 seconds as the model loads into memory, subsequent requests are fast (model cached)
### Step 3: Open Web Interface
Navigate to: `http://localhost:8000`
#### Search Modes:
- **Text Only**: Search using natural language queries
- **Image Only**: Upload a product image to find similar items
- **Multimodal**: Combine text and image for refined search
#### Example Queries:
- "Wireless earbuds with noise cancellation under $150"
- "What is this product and how is it used?" (with image)
- "Compare the top two smartwatches you found"
## πŸ”§ Configuration
### LLM Backend Selection
The system supports two LLM backends that can be switched via environment variables:
#### Option 1: OpenAI GPT-4 (Recommended)
**Advantages:**
- Superior response quality
- Faster response times (2-5 seconds)
- No GPU required
- Lower memory footprint
**Requirements:**
- OpenAI API key
- Internet connection
- Cost: ~$0.01-0.03 per query
**Configuration:**
```bash
# .env file
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-api-key-here
OPENAI_MODEL=gpt-4o
OPENAI_MAX_TOKENS=512
OPENAI_TEMPERATURE=0.2
```
#### Option 2: Local Open-Source Models
**Advantages:**
- Free (no API costs)
- Complete data privacy
- Works offline
- Customizable (fine-tuning possible)
**Requirements:**
- 16GB+ RAM (32GB+ for Mixtral)
- GPU recommended (CUDA-compatible)
**Supported Models:**
- `mistralai/Mistral-7B-Instruct-v0.3` (7B params, recommended)
- `meta-llama/Meta-Llama-3-8B-Instruct` (8B params)
- `mistralai/Mixtral-8x7B-Instruct-v0.1` (47B params, requires 32GB+ RAM)
**Configuration:**
```bash
# .env file
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
LLM_MAX_TOKENS=512
LLM_TEMPERATURE=0.2
```
### Other Configuration Options
```bash
# Data paths
CSV_PATH=amazon_multimodal_clean.csv
CHROMA_DIR=chromadb_store
IMAGE_DIR=images
# CLIP model
CLIP_MODEL=ViT-B/32 # Options: ViT-B/32, ViT-B/16, ViT-L/14
# API server
API_HOST=0.0.0.0
API_PORT=8000
ALLOWED_ORIGINS=*
# Retrieval settings
TOP_K_PRODUCTS=5
MAX_TEXT_LENGTH=400
# Logging
LOG_LEVEL=INFO # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
```
See [.env.example](.env.example) for the complete configuration template.
## Evaluation
Evaluate retrieval quality:
```bash
python rag.py --eval --csv amazon_multimodal_clean.csv
```
Metrics computed:
- Accuracy@1: Top result category match
- Recall@1, @5, @10: Category match in top K results
## Testing
### Test Retrieval Only
```bash
# Text query
python rag.py --text "wireless headphones" --db chromadb_store
# Image query
python rag.py --image path/to/product.jpg --db chromadb_store
```
### Test LLM Generation
```bash
python llm.py
```
## Project Structure
```
Multimodel/
β”œβ”€β”€ rag.py # CLIP + ChromaDB retrieval system
β”œβ”€β”€ llm.py # LLM interface with prompt engineering
β”œβ”€β”€ api_server.py # FastAPI REST API
β”œβ”€β”€ config.py # Configuration management
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ .gitignore # Git ignore rules
β”œβ”€β”€ frontend/
β”‚ β”œβ”€β”€ index.html # Web UI
β”‚ β”œβ”€β”€ main.js # Frontend JavaScript
β”‚ └── amazon-logo.png # Logo asset
β”œβ”€β”€ chromadb_store/ # Vector database (generated)
β”œβ”€β”€ images/ # Downloaded product images (generated)
└── amazon_multimodal_clean.csv # Your dataset
```
## Troubleshooting
### Issue: "OpenAI API key is required"
**Solution**: Ensure you've created a `.env` file and added `python-dotenv` dependency:
```bash
# Install dotenv if missing
pip install python-dotenv
# Create .env file
cp .env.example .env
# Edit .env and add your API key
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-actual-api-key-here
```
### Issue: "TypeError: failed to extract enum MetadataValue"
**Solution**: This occurs during index building with ChromaDB. Update to the latest version:
```bash
pip install --upgrade chromadb
```
The code now handles None values properly by converting them to empty strings.
### Issue: "CUDA out of memory" (Local Models)
**Solution**: Use CPU mode or reduce batch size
```bash
# Force CPU mode
export CUDA_VISIBLE_DEVICES=-1
python api_server.py
```
### Issue: "Model loading takes too long" (Local Models)
**Solution**: This is normal for first request (10-60s). The model is cached in memory for subsequent requests. Consider using GPT-4 for faster response times.
### Issue: "Image download failures"
**Solution**: Some product URLs may be invalid or expired. This is normal and logged. The system will use text-only embeddings for those products.
### Issue: Port 8000 already in use
**Solution**: Change port via environment variable
```bash
export API_PORT=8080
python api_server.py
```
### Issue: Duplicate products after multiple index builds
**Solution**: ChromaDB uses `add()` which doesn't prevent duplicates. To rebuild the index, delete the database directory first:
```bash
rm -rf chromadb_store
python rag.py --build --csv amazon_multimodal_clean.csv
```
## Security Notes
- **CORS**: Currently set to `allow_origins=["*"]` for development
- For production, configure `ALLOWED_ORIGINS` to specific domains
- **Error Messages**: Generic errors are returned to clients; detailed logs are server-side only
- **File Uploads**: Images are validated and temporarily stored, then cleaned up
## Performance Optimization
### Implemented Optimizations:
1. **LLM Singleton Pattern**: Model loads once at server startup and is reused across requests (5-20x speedup)
2. **CLIP Embedding Caching**: CLIP model stays in memory after first load
3. **ChromaDB HNSW Indexing**: Approximate nearest neighbor search with O(log N) complexity
4. **L2 Normalized Embeddings**: Cosine similarity computed via efficient dot products
5. **Graceful Error Handling**: Image download failures don't block indexing process
### Additional Optimizations for Production:
1. **Use GPU**: CUDA-enabled GPU for 10-50x faster CLIP inference (local models)
2. **Use GPT-4**: Cloud-based LLM eliminates model loading overhead
3. **Batch Processing**: Build index in batches for large datasets
4. **CDN for Images**: Serve product images via CDN
5. **Load Balancer**: Use multiple API instances behind a load balancer
6. **Redis Caching**: Cache frequent queries and embeddings
## Future Enhancements
- [ ] Add user authentication
- [ ] Implement product filtering (price, brand, etc.)
- [ ] Add bookmark/favorites functionality
- [ ] Support multilingual queries
- [ ] Integrate with real Amazon API
- [ ] Add A/B testing for different prompts
- [ ] Implement caching layer (Redis)
- [ ] Add monitoring and analytics
## Contributing
Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/YourFeature`)
3. Commit changes (`git commit -m 'Add YourFeature'`)
4. Push to branch (`git push origin feature/YourFeature`)
5. Open a Pull Request
## License
This project is for educational and research purposes.
## Acknowledgments
- **OpenAI**: CLIP multimodal embeddings and GPT-4 API
- **ChromaDB**: Vector database with HNSW indexing
- **HuggingFace**: Transformers library and model hosting
- **FastAPI**: Modern web framework
- **Mistral AI / Meta**: Open-source LLM models
- **Tailwind CSS**: Frontend styling framework
---
## Additional Documentation
- **[Research Report](research_report.tex)**: Comprehensive technical report in LaTeX format covering implementation details, challenges, solutions, and future improvements
- **[Quick Start Guide for GPT-4](QUICKSTART_GPT4.md)**: Step-by-step guide for setting up with OpenAI GPT-4
---
**Built with ❀️ using CLIP, ChromaDB, GPT-4, and Open-Source LLMs**