---
title: Amazon Multimodal RAG Assistant
emoji: 🛒
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
license: mit
app_port: 7860
---

# Amazon Multimodal RAG Assistant

An AI-powered e-commerce search assistant that combines multimodal embeddings (CLIP), vector search (ChromaDB), and large language models to provide intelligent product recommendations and natural language responses.

![Project Status](https://img.shields.io/badge/status-active-success.svg)
![Python Version](https://img.shields.io/badge/python-3.8%2B-blue.svg)

## Features

- **Multimodal Search**: Search products using text, images, or both simultaneously
- **Intelligent Retrieval**: CLIP-based embeddings for semantic product matching
- **Dual LLM Support**: Choose between OpenAI GPT-4 or local open-source models
- **Natural Language Responses**: Context-aware answers powered by advanced LLMs
- **Modern Web Interface**: Clean, responsive UI with real-time search
- **Vector Database**: Persistent ChromaDB storage for fast retrieval
- **Prompt Engineering**: Supports zero-shot, few-shot, and multi-shot prompting
- **Chat History**: Multi-turn conversations with context awareness
- **Flexible Configuration**: Environment-based setup for easy customization

## Architecture

```
┌─────────────┐
│   Frontend  │  (HTML/JS/TailwindCSS)
└──────┬──────┘
       │ HTTP/JSON
       ▼
┌─────────────┐
│  FastAPI    │  (REST API Server)
└──────┬──────┘
       │
       ├─────────────────┐
       ▼                 ▼
┌─────────────┐   ┌─────────────┐
│   LLM       │   │    RAG      │
│ (GPT-4 or   │   │   (CLIP +   │
│  Local HF)  │   │  ChromaDB)  │
└─────────────┘   └─────────────┘
```

### Components

1. **rag.py**: Retrieval system with CLIP embeddings and ChromaDB
2. **llm.py**: LLM interface with prompt engineering
3. **api_server.py**: FastAPI backend with singleton LLM pattern
4. **frontend/**: Modern web UI with drag-and-drop support
5. **config.py**: Centralized configuration management

## Requirements

- Python 3.8+
- CUDA-compatible GPU (optional, but recommended for faster inference)
- 8GB+ RAM (16GB+ recommended)
- 10GB+ disk space for models and data

## Installation

### 1. Clone the Repository

```bash
cd Multimodel
```

### 2. Create Virtual Environment

```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

### 3. Install Dependencies

```bash
pip install -r requirements.txt
```

**Note**: CLIP installation requires git. If you encounter issues:

```bash
pip install git+https://github.com/openai/CLIP.git
```

### 4. Configure Environment

Create a `.env` file in the project root (copy from `.env.example`):

```bash
cp .env.example .env
```

**For OpenAI GPT-4 (Recommended):**
```bash
# .env file
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-api-key-here
OPENAI_MODEL=gpt-4o
```

**For Local Models (Free, but requires more compute):**
```bash
# .env file
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
```

See [.env.example](.env.example) for all configuration options.

### 5. Prepare Data

Place your Amazon product CSV file in the project root:

```
amazon_multimodal_clean.csv
```

Expected CSV columns:
- `uniq_id`: Unique product identifier
- `product_name`: Product name
- `product_text`: Product description
- `main_category`: Product category
- `image`: Image URLs (pipe-separated)

## Usage

### Step 1: Build Vector Index

```bash
python rag.py --build --csv amazon_multimodal_clean.csv --max 1000
```

Options:
- `--csv`: Path to your CSV file
- `--max`: Maximum number of products to index (optional, removes limit if omitted)
- `--db`: Database directory (default: `chromadb_store`)

This will:
- Download product images
- Generate CLIP embeddings
- Build ChromaDB vector index
- Save to `chromadb_store/`

### Step 2: Start API Server

```bash
python api_server.py
```

The server will start on `http://localhost:8000`

**Startup Notes:**
- **GPT-4 Mode**: Server starts instantly, first request takes 2-5 seconds (API call)
- **Local Model Mode**: First request takes 10-60 seconds as the model loads into memory, subsequent requests are fast (model cached)

### Step 3: Open Web Interface

Navigate to: `http://localhost:8000`

#### Search Modes:
- **Text Only**: Search using natural language queries
- **Image Only**: Upload a product image to find similar items
- **Multimodal**: Combine text and image for refined search

#### Example Queries:
- "Wireless earbuds with noise cancellation under $150"
- "What is this product and how is it used?" (with image)
- "Compare the top two smartwatches you found"

## 🔧 Configuration

### LLM Backend Selection

The system supports two LLM backends that can be switched via environment variables:

#### Option 1: OpenAI GPT-4 (Recommended)

**Advantages:**
- Superior response quality
- Faster response times (2-5 seconds)
- No GPU required
- Lower memory footprint

**Requirements:**
- OpenAI API key
- Internet connection
- Cost: ~$0.01-0.03 per query

**Configuration:**
```bash
# .env file
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-api-key-here
OPENAI_MODEL=gpt-4o
OPENAI_MAX_TOKENS=512
OPENAI_TEMPERATURE=0.2
```

#### Option 2: Local Open-Source Models

**Advantages:**
- Free (no API costs)
- Complete data privacy
- Works offline
- Customizable (fine-tuning possible)

**Requirements:**
- 16GB+ RAM (32GB+ for Mixtral)
- GPU recommended (CUDA-compatible)

**Supported Models:**
- `mistralai/Mistral-7B-Instruct-v0.3` (7B params, recommended)
- `meta-llama/Meta-Llama-3-8B-Instruct` (8B params)
- `mistralai/Mixtral-8x7B-Instruct-v0.1` (47B params, requires 32GB+ RAM)

**Configuration:**
```bash
# .env file
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
LLM_MAX_TOKENS=512
LLM_TEMPERATURE=0.2
```

### Other Configuration Options

```bash
# Data paths
CSV_PATH=amazon_multimodal_clean.csv
CHROMA_DIR=chromadb_store
IMAGE_DIR=images

# CLIP model
CLIP_MODEL=ViT-B/32  # Options: ViT-B/32, ViT-B/16, ViT-L/14

# API server
API_HOST=0.0.0.0
API_PORT=8000
ALLOWED_ORIGINS=*

# Retrieval settings
TOP_K_PRODUCTS=5
MAX_TEXT_LENGTH=400

# Logging
LOG_LEVEL=INFO  # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
```

See [.env.example](.env.example) for the complete configuration template.

## Evaluation

Evaluate retrieval quality:

```bash
python rag.py --eval --csv amazon_multimodal_clean.csv
```

Metrics computed:
- Accuracy@1: Top result category match
- Recall@1, @5, @10: Category match in top K results

## Testing

### Test Retrieval Only

```bash
# Text query
python rag.py --text "wireless headphones" --db chromadb_store

# Image query
python rag.py --image path/to/product.jpg --db chromadb_store
```

### Test LLM Generation

```bash
python llm.py
```

## Project Structure

```
Multimodel/
├── rag.py                  # CLIP + ChromaDB retrieval system
├── llm.py                  # LLM interface with prompt engineering
├── api_server.py           # FastAPI REST API
├── config.py               # Configuration management
├── requirements.txt        # Python dependencies
├── README.md               # This file
├── .gitignore              # Git ignore rules
├── frontend/
│   ├── index.html          # Web UI
│   ├── main.js             # Frontend JavaScript
│   └── amazon-logo.png     # Logo asset
├── chromadb_store/         # Vector database (generated)
├── images/                 # Downloaded product images (generated)
└── amazon_multimodal_clean.csv  # Your dataset
```

## Troubleshooting

### Issue: "OpenAI API key is required"

**Solution**: Ensure you've created a `.env` file and added `python-dotenv` dependency:
```bash
# Install dotenv if missing
pip install python-dotenv

# Create .env file
cp .env.example .env

# Edit .env and add your API key
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-actual-api-key-here
```

### Issue: "TypeError: failed to extract enum MetadataValue"

**Solution**: This occurs during index building with ChromaDB. Update to the latest version:
```bash
pip install --upgrade chromadb
```

The code now handles None values properly by converting them to empty strings.

### Issue: "CUDA out of memory" (Local Models)

**Solution**: Use CPU mode or reduce batch size
```bash
# Force CPU mode
export CUDA_VISIBLE_DEVICES=-1
python api_server.py
```

### Issue: "Model loading takes too long" (Local Models)

**Solution**: This is normal for first request (10-60s). The model is cached in memory for subsequent requests. Consider using GPT-4 for faster response times.

### Issue: "Image download failures"

**Solution**: Some product URLs may be invalid or expired. This is normal and logged. The system will use text-only embeddings for those products.

### Issue: Port 8000 already in use

**Solution**: Change port via environment variable
```bash
export API_PORT=8080
python api_server.py
```

### Issue: Duplicate products after multiple index builds

**Solution**: ChromaDB uses `add()` which doesn't prevent duplicates. To rebuild the index, delete the database directory first:
```bash
rm -rf chromadb_store
python rag.py --build --csv amazon_multimodal_clean.csv
```

## Security Notes

- **CORS**: Currently set to `allow_origins=["*"]` for development
  - For production, configure `ALLOWED_ORIGINS` to specific domains
- **Error Messages**: Generic errors are returned to clients; detailed logs are server-side only
- **File Uploads**: Images are validated and temporarily stored, then cleaned up

## Performance Optimization

### Implemented Optimizations:

1. **LLM Singleton Pattern**: Model loads once at server startup and is reused across requests (5-20x speedup)
2. **CLIP Embedding Caching**: CLIP model stays in memory after first load
3. **ChromaDB HNSW Indexing**: Approximate nearest neighbor search with O(log N) complexity
4. **L2 Normalized Embeddings**: Cosine similarity computed via efficient dot products
5. **Graceful Error Handling**: Image download failures don't block indexing process

### Additional Optimizations for Production:

1. **Use GPU**: CUDA-enabled GPU for 10-50x faster CLIP inference (local models)
2. **Use GPT-4**: Cloud-based LLM eliminates model loading overhead
3. **Batch Processing**: Build index in batches for large datasets
4. **CDN for Images**: Serve product images via CDN
5. **Load Balancer**: Use multiple API instances behind a load balancer
6. **Redis Caching**: Cache frequent queries and embeddings

## Future Enhancements

- [ ] Add user authentication
- [ ] Implement product filtering (price, brand, etc.)
- [ ] Add bookmark/favorites functionality
- [ ] Support multilingual queries
- [ ] Integrate with real Amazon API
- [ ] Add A/B testing for different prompts
- [ ] Implement caching layer (Redis)
- [ ] Add monitoring and analytics

## Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/YourFeature`)
3. Commit changes (`git commit -m 'Add YourFeature'`)
4. Push to branch (`git push origin feature/YourFeature`)
5. Open a Pull Request

## License

This project is for educational and research purposes.

## Acknowledgments

- **OpenAI**: CLIP multimodal embeddings and GPT-4 API
- **ChromaDB**: Vector database with HNSW indexing
- **HuggingFace**: Transformers library and model hosting
- **FastAPI**: Modern web framework
- **Mistral AI / Meta**: Open-source LLM models
- **Tailwind CSS**: Frontend styling framework


---

## Additional Documentation

- **[Research Report](research_report.tex)**: Comprehensive technical report in LaTeX format covering implementation details, challenges, solutions, and future improvements
- **[Quick Start Guide for GPT-4](QUICKSTART_GPT4.md)**: Step-by-step guide for setting up with OpenAI GPT-4

---

**Built with ❤️ using CLIP, ChromaDB, GPT-4, and Open-Source LLMs**