Spaces:

Easonwangzk
/

Amazon-Multimodal-RAG-Assistant

Running

App Files Files Community

Amazon-Multimodal-RAG-Assistant / README.md

Easonwangzk

Fix YAML metadata: use allowed color value

57a2f48 3 days ago

preview code

raw

history blame

12.2 kB

	---
	title: Amazon Multimodal RAG Assistant
	emoji: 🛒
	colorFrom: yellow
	colorTo: blue
	sdk: docker
	pinned: false
	license: mit
	app_port: 7860
	---

	# Amazon Multimodal RAG Assistant

	An AI-powered e-commerce search assistant that combines multimodal embeddings (CLIP), vector search (ChromaDB), and large language models to provide intelligent product recommendations and natural language responses.

	![Project Status](https://img.shields.io/badge/status-active-success.svg)
	![Python Version](https://img.shields.io/badge/python-3.8%2B-blue.svg)

	## Features

	- Multimodal Search: Search products using text, images, or both simultaneously
	- Intelligent Retrieval: CLIP-based embeddings for semantic product matching
	- Dual LLM Support: Choose between OpenAI GPT-4 or local open-source models
	- Natural Language Responses: Context-aware answers powered by advanced LLMs
	- Modern Web Interface: Clean, responsive UI with real-time search
	- Vector Database: Persistent ChromaDB storage for fast retrieval
	- Prompt Engineering: Supports zero-shot, few-shot, and multi-shot prompting
	- Chat History: Multi-turn conversations with context awareness
	- Flexible Configuration: Environment-based setup for easy customization

	## Architecture

	```
	┌─────────────┐
	│ Frontend │ (HTML/JS/TailwindCSS)
	└──────┬──────┘
	│ HTTP/JSON
	▼
	┌─────────────┐
	│ FastAPI │ (REST API Server)
	└──────┬──────┘
	│
	├─────────────────┐
	▼ ▼
	┌─────────────┐ ┌─────────────┐
	│ LLM │ │ RAG │
	│ (GPT-4 or │ │ (CLIP + │
	│ Local HF) │ │ ChromaDB) │
	└─────────────┘ └─────────────┘
	```

	### Components

	1. rag.py: Retrieval system with CLIP embeddings and ChromaDB
	2. llm.py: LLM interface with prompt engineering
	3. api_server.py: FastAPI backend with singleton LLM pattern
	4. frontend/: Modern web UI with drag-and-drop support
	5. config.py: Centralized configuration management

	## Requirements

	- Python 3.8+
	- CUDA-compatible GPU (optional, but recommended for faster inference)
	- 8GB+ RAM (16GB+ recommended)
	- 10GB+ disk space for models and data

	## Installation

	### 1. Clone the Repository

	```bash
	cd Multimodel
	```

	### 2. Create Virtual Environment

	```bash
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scripts\activate
	```

	### 3. Install Dependencies

	```bash
	pip install -r requirements.txt
	```

	Note: CLIP installation requires git. If you encounter issues:

	```bash
	pip install git+https://github.com/openai/CLIP.git
	```

	### 4. Configure Environment

	Create a `.env` file in the project root (copy from `.env.example`):

	```bash
	cp .env.example .env
	```

	For OpenAI GPT-4 (Recommended):
	```bash
	# .env file
	USE_OPENAI=true
	OPENAI_API_KEY=sk-proj-your-api-key-here
	OPENAI_MODEL=gpt-4o
	```

	For Local Models (Free, but requires more compute):
	```bash
	# .env file
	USE_OPENAI=false
	LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
	```

	See [.env.example](.env.example) for all configuration options.

	### 5. Prepare Data

	Place your Amazon product CSV file in the project root:

	```
	amazon_multimodal_clean.csv
	```

	Expected CSV columns:
	- `uniq_id`: Unique product identifier
	- `product_name`: Product name
	- `product_text`: Product description
	- `main_category`: Product category
	- `image`: Image URLs (pipe-separated)

	## Usage

	### Step 1: Build Vector Index

	```bash
	python rag.py --build --csv amazon_multimodal_clean.csv --max 1000
	```

	Options:
	- `--csv`: Path to your CSV file
	- `--max`: Maximum number of products to index (optional, removes limit if omitted)
	- `--db`: Database directory (default: `chromadb_store`)

	This will:
	- Download product images
	- Generate CLIP embeddings
	- Build ChromaDB vector index
	- Save to `chromadb_store/`

	### Step 2: Start API Server

	```bash
	python api_server.py
	```

	The server will start on `http://localhost:8000`

	Startup Notes:
	- GPT-4 Mode: Server starts instantly, first request takes 2-5 seconds (API call)
	- Local Model Mode: First request takes 10-60 seconds as the model loads into memory, subsequent requests are fast (model cached)

	### Step 3: Open Web Interface

	Navigate to: `http://localhost:8000`

	#### Search Modes:
	- Text Only: Search using natural language queries
	- Image Only: Upload a product image to find similar items
	- Multimodal: Combine text and image for refined search

	#### Example Queries:
	- "Wireless earbuds with noise cancellation under $150"
	- "What is this product and how is it used?" (with image)
	- "Compare the top two smartwatches you found"

	## 🔧 Configuration

	### LLM Backend Selection

	The system supports two LLM backends that can be switched via environment variables:

	#### Option 1: OpenAI GPT-4 (Recommended)

	Advantages:
	- Superior response quality
	- Faster response times (2-5 seconds)
	- No GPU required
	- Lower memory footprint

	Requirements:
	- OpenAI API key
	- Internet connection
	- Cost: ~$0.01-0.03 per query

	Configuration:
	```bash
	# .env file
	USE_OPENAI=true
	OPENAI_API_KEY=sk-proj-your-api-key-here
	OPENAI_MODEL=gpt-4o
	OPENAI_MAX_TOKENS=512
	OPENAI_TEMPERATURE=0.2
	```

	#### Option 2: Local Open-Source Models

	Advantages:
	- Free (no API costs)
	- Complete data privacy
	- Works offline
	- Customizable (fine-tuning possible)

	Requirements:
	- 16GB+ RAM (32GB+ for Mixtral)
	- GPU recommended (CUDA-compatible)

	Supported Models:
	- `mistralai/Mistral-7B-Instruct-v0.3` (7B params, recommended)
	- `meta-llama/Meta-Llama-3-8B-Instruct` (8B params)
	- `mistralai/Mixtral-8x7B-Instruct-v0.1` (47B params, requires 32GB+ RAM)

	Configuration:
	```bash
	# .env file
	USE_OPENAI=false
	LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
	LLM_MAX_TOKENS=512
	LLM_TEMPERATURE=0.2
	```

	### Other Configuration Options

	```bash
	# Data paths
	CSV_PATH=amazon_multimodal_clean.csv
	CHROMA_DIR=chromadb_store
	IMAGE_DIR=images

	# CLIP model
	CLIP_MODEL=ViT-B/32 # Options: ViT-B/32, ViT-B/16, ViT-L/14

	# API server
	API_HOST=0.0.0.0
	API_PORT=8000
	ALLOWED_ORIGINS=*

	# Retrieval settings
	TOP_K_PRODUCTS=5
	MAX_TEXT_LENGTH=400

	# Logging
	LOG_LEVEL=INFO # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
	```

	See [.env.example](.env.example) for the complete configuration template.

	## Evaluation

	Evaluate retrieval quality:

	```bash
	python rag.py --eval --csv amazon_multimodal_clean.csv
	```

	Metrics computed:
	- Accuracy@1: Top result category match
	- Recall@1, @5, @10: Category match in top K results

	## Testing

	### Test Retrieval Only

	```bash
	# Text query
	python rag.py --text "wireless headphones" --db chromadb_store

	# Image query
	python rag.py --image path/to/product.jpg --db chromadb_store
	```

	### Test LLM Generation

	```bash
	python llm.py
	```

	## Project Structure

	```
	Multimodel/
	├── rag.py # CLIP + ChromaDB retrieval system
	├── llm.py # LLM interface with prompt engineering
	├── api_server.py # FastAPI REST API
	├── config.py # Configuration management
	├── requirements.txt # Python dependencies
	├── README.md # This file
	├── .gitignore # Git ignore rules
	├── frontend/
	│ ├── index.html # Web UI
	│ ├── main.js # Frontend JavaScript
	│ └── amazon-logo.png # Logo asset
	├── chromadb_store/ # Vector database (generated)
	├── images/ # Downloaded product images (generated)
	└── amazon_multimodal_clean.csv # Your dataset
	```

	## Troubleshooting

	### Issue: "OpenAI API key is required"

	Solution: Ensure you've created a `.env` file and added `python-dotenv` dependency:
	```bash
	# Install dotenv if missing
	pip install python-dotenv

	# Create .env file
	cp .env.example .env

	# Edit .env and add your API key
	USE_OPENAI=true
	OPENAI_API_KEY=sk-proj-your-actual-api-key-here
	```

	### Issue: "TypeError: failed to extract enum MetadataValue"

	Solution: This occurs during index building with ChromaDB. Update to the latest version:
	```bash
	pip install --upgrade chromadb
	```

	The code now handles None values properly by converting them to empty strings.

	### Issue: "CUDA out of memory" (Local Models)

	Solution: Use CPU mode or reduce batch size
	```bash
	# Force CPU mode
	export CUDA_VISIBLE_DEVICES=-1
	python api_server.py
	```

	### Issue: "Model loading takes too long" (Local Models)

	Solution: This is normal for first request (10-60s). The model is cached in memory for subsequent requests. Consider using GPT-4 for faster response times.

	### Issue: "Image download failures"

	Solution: Some product URLs may be invalid or expired. This is normal and logged. The system will use text-only embeddings for those products.

	### Issue: Port 8000 already in use

	Solution: Change port via environment variable
	```bash
	export API_PORT=8080
	python api_server.py
	```

	### Issue: Duplicate products after multiple index builds

	Solution: ChromaDB uses `add()` which doesn't prevent duplicates. To rebuild the index, delete the database directory first:
	```bash
	rm -rf chromadb_store
	python rag.py --build --csv amazon_multimodal_clean.csv
	```

	## Security Notes

	- CORS: Currently set to `allow_origins=["*"]` for development
	- For production, configure `ALLOWED_ORIGINS` to specific domains
	- Error Messages: Generic errors are returned to clients; detailed logs are server-side only
	- File Uploads: Images are validated and temporarily stored, then cleaned up

	## Performance Optimization

	### Implemented Optimizations:

	1. LLM Singleton Pattern: Model loads once at server startup and is reused across requests (5-20x speedup)
	2. CLIP Embedding Caching: CLIP model stays in memory after first load
	3. ChromaDB HNSW Indexing: Approximate nearest neighbor search with O(log N) complexity
	4. L2 Normalized Embeddings: Cosine similarity computed via efficient dot products
	5. Graceful Error Handling: Image download failures don't block indexing process

	### Additional Optimizations for Production:

	1. Use GPU: CUDA-enabled GPU for 10-50x faster CLIP inference (local models)
	2. Use GPT-4: Cloud-based LLM eliminates model loading overhead
	3. Batch Processing: Build index in batches for large datasets
	4. CDN for Images: Serve product images via CDN
	5. Load Balancer: Use multiple API instances behind a load balancer
	6. Redis Caching: Cache frequent queries and embeddings

	## Future Enhancements

	- [ ] Add user authentication
	- [ ] Implement product filtering (price, brand, etc.)
	- [ ] Add bookmark/favorites functionality
	- [ ] Support multilingual queries
	- [ ] Integrate with real Amazon API
	- [ ] Add A/B testing for different prompts
	- [ ] Implement caching layer (Redis)
	- [ ] Add monitoring and analytics

	## Contributing

	Contributions are welcome! Please:

	1. Fork the repository
	2. Create a feature branch (`git checkout -b feature/YourFeature`)
	3. Commit changes (`git commit -m 'Add YourFeature'`)
	4. Push to branch (`git push origin feature/YourFeature`)
	5. Open a Pull Request

	## License

	This project is for educational and research purposes.

	## Acknowledgments

	- OpenAI: CLIP multimodal embeddings and GPT-4 API
	- ChromaDB: Vector database with HNSW indexing
	- HuggingFace: Transformers library and model hosting
	- FastAPI: Modern web framework
	- Mistral AI / Meta: Open-source LLM models
	- Tailwind CSS: Frontend styling framework


	---

	## Additional Documentation

	- [Research Report](research_report.tex): Comprehensive technical report in LaTeX format covering implementation details, challenges, solutions, and future improvements
	- [Quick Start Guide for GPT-4](QUICKSTART_GPT4.md): Step-by-step guide for setting up with OpenAI GPT-4

	---

	Built with ❤️ using CLIP, ChromaDB, GPT-4, and Open-Source LLMs