Spaces:

Easonwangzk
/

Amazon-Multimodal-RAG-Assistant

Running

App Files Files Community

Amazon-Multimodal-RAG-Assistant / README.md

Easonwangzk

Fix YAML metadata: use allowed color value

57a2f48 3 days ago

preview code

raw

history blame

12.2 kB

metadata

title: Amazon Multimodal RAG Assistant
emoji: 🛒
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
license: mit
app_port: 7860

Amazon Multimodal RAG Assistant

An AI-powered e-commerce search assistant that combines multimodal embeddings (CLIP), vector search (ChromaDB), and large language models to provide intelligent product recommendations and natural language responses.

Features

Multimodal Search: Search products using text, images, or both simultaneously
Intelligent Retrieval: CLIP-based embeddings for semantic product matching
Dual LLM Support: Choose between OpenAI GPT-4 or local open-source models
Natural Language Responses: Context-aware answers powered by advanced LLMs
Modern Web Interface: Clean, responsive UI with real-time search
Vector Database: Persistent ChromaDB storage for fast retrieval
Prompt Engineering: Supports zero-shot, few-shot, and multi-shot prompting
Chat History: Multi-turn conversations with context awareness
Flexible Configuration: Environment-based setup for easy customization

Architecture

┌─────────────┐
│   Frontend  │  (HTML/JS/TailwindCSS)
└──────┬──────┘
       │ HTTP/JSON
       ▼
┌─────────────┐
│  FastAPI    │  (REST API Server)
└──────┬──────┘
       │
       ├─────────────────┐
       ▼                 ▼
┌─────────────┐   ┌─────────────┐
│   LLM       │   │    RAG      │
│ (GPT-4 or   │   │   (CLIP +   │
│  Local HF)  │   │  ChromaDB)  │
└─────────────┘   └─────────────┘

Components

rag.py: Retrieval system with CLIP embeddings and ChromaDB
llm.py: LLM interface with prompt engineering
api_server.py: FastAPI backend with singleton LLM pattern
frontend/: Modern web UI with drag-and-drop support
config.py: Centralized configuration management

Requirements

Python 3.8+
CUDA-compatible GPU (optional, but recommended for faster inference)
8GB+ RAM (16GB+ recommended)
10GB+ disk space for models and data

Installation

1. Clone the Repository

cd Multimodel

2. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

Note: CLIP installation requires git. If you encounter issues:

pip install git+https://github.com/openai/CLIP.git

4. Configure Environment

Create a .env file in the project root (copy from .env.example):

cp .env.example .env

For OpenAI GPT-4 (Recommended):

# .env file
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-api-key-here
OPENAI_MODEL=gpt-4o

For Local Models (Free, but requires more compute):

# .env file
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3

See .env.example for all configuration options.

5. Prepare Data

Place your Amazon product CSV file in the project root:

amazon_multimodal_clean.csv

Expected CSV columns:

uniq_id: Unique product identifier
product_name: Product name
product_text: Product description
main_category: Product category
image: Image URLs (pipe-separated)

Usage

Step 1: Build Vector Index

python rag.py --build --csv amazon_multimodal_clean.csv --max 1000

Options:

--csv: Path to your CSV file
--max: Maximum number of products to index (optional, removes limit if omitted)
--db: Database directory (default: chromadb_store)

This will:

Download product images
Generate CLIP embeddings
Build ChromaDB vector index
Save to chromadb_store/

Step 2: Start API Server

python api_server.py

The server will start on http://localhost:8000

Startup Notes:

GPT-4 Mode: Server starts instantly, first request takes 2-5 seconds (API call)
Local Model Mode: First request takes 10-60 seconds as the model loads into memory, subsequent requests are fast (model cached)

Step 3: Open Web Interface

Navigate to: http://localhost:8000

Search Modes:

Text Only: Search using natural language queries
Image Only: Upload a product image to find similar items
Multimodal: Combine text and image for refined search

Example Queries:

"Wireless earbuds with noise cancellation under $150"
"What is this product and how is it used?" (with image)
"Compare the top two smartwatches you found"

🔧 Configuration

LLM Backend Selection

The system supports two LLM backends that can be switched via environment variables:

Option 1: OpenAI GPT-4 (Recommended)

Advantages:

Superior response quality
Faster response times (2-5 seconds)
No GPU required
Lower memory footprint

Requirements:

OpenAI API key
Internet connection
Cost: ~$0.01-0.03 per query

Configuration:

# .env file
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-api-key-here
OPENAI_MODEL=gpt-4o
OPENAI_MAX_TOKENS=512
OPENAI_TEMPERATURE=0.2

Option 2: Local Open-Source Models

Advantages:

Free (no API costs)
Complete data privacy
Works offline
Customizable (fine-tuning possible)

Requirements:

16GB+ RAM (32GB+ for Mixtral)
GPU recommended (CUDA-compatible)

Supported Models:

mistralai/Mistral-7B-Instruct-v0.3 (7B params, recommended)
meta-llama/Meta-Llama-3-8B-Instruct (8B params)
mistralai/Mixtral-8x7B-Instruct-v0.1 (47B params, requires 32GB+ RAM)

Configuration:

# .env file
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
LLM_MAX_TOKENS=512
LLM_TEMPERATURE=0.2

Other Configuration Options

# Data paths
CSV_PATH=amazon_multimodal_clean.csv
CHROMA_DIR=chromadb_store
IMAGE_DIR=images

# CLIP model
CLIP_MODEL=ViT-B/32  # Options: ViT-B/32, ViT-B/16, ViT-L/14

# API server
API_HOST=0.0.0.0
API_PORT=8000
ALLOWED_ORIGINS=*

# Retrieval settings
TOP_K_PRODUCTS=5
MAX_TEXT_LENGTH=400

# Logging
LOG_LEVEL=INFO  # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL

See .env.example for the complete configuration template.

Evaluation

Evaluate retrieval quality:

python rag.py --eval --csv amazon_multimodal_clean.csv

Metrics computed:

Accuracy@1: Top result category match
Recall@1, @5, @10: Category match in top K results

Testing

Test Retrieval Only

# Text query
python rag.py --text "wireless headphones" --db chromadb_store

# Image query
python rag.py --image path/to/product.jpg --db chromadb_store

Test LLM Generation

python llm.py

Project Structure

Multimodel/
├── rag.py                  # CLIP + ChromaDB retrieval system
├── llm.py                  # LLM interface with prompt engineering
├── api_server.py           # FastAPI REST API
├── config.py               # Configuration management
├── requirements.txt        # Python dependencies
├── README.md               # This file
├── .gitignore              # Git ignore rules
├── frontend/
│   ├── index.html          # Web UI
│   ├── main.js             # Frontend JavaScript
│   └── amazon-logo.png     # Logo asset
├── chromadb_store/         # Vector database (generated)
├── images/                 # Downloaded product images (generated)
└── amazon_multimodal_clean.csv  # Your dataset

Troubleshooting

Issue: "OpenAI API key is required"

Solution: Ensure you've created a .env file and added python-dotenv dependency:

# Install dotenv if missing
pip install python-dotenv

# Create .env file
cp .env.example .env

# Edit .env and add your API key
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-actual-api-key-here

Issue: "TypeError: failed to extract enum MetadataValue"

Solution: This occurs during index building with ChromaDB. Update to the latest version:

pip install --upgrade chromadb

The code now handles None values properly by converting them to empty strings.

Issue: "CUDA out of memory" (Local Models)

Solution: Use CPU mode or reduce batch size

# Force CPU mode
export CUDA_VISIBLE_DEVICES=-1
python api_server.py

Issue: "Model loading takes too long" (Local Models)

Solution: This is normal for first request (10-60s). The model is cached in memory for subsequent requests. Consider using GPT-4 for faster response times.

Issue: "Image download failures"

Solution: Some product URLs may be invalid or expired. This is normal and logged. The system will use text-only embeddings for those products.

Issue: Port 8000 already in use

Solution: Change port via environment variable

export API_PORT=8080
python api_server.py

Issue: Duplicate products after multiple index builds

Solution: ChromaDB uses add() which doesn't prevent duplicates. To rebuild the index, delete the database directory first:

rm -rf chromadb_store
python rag.py --build --csv amazon_multimodal_clean.csv

Security Notes

CORS: Currently set to allow_origins=["*"] for development
- For production, configure ALLOWED_ORIGINS to specific domains
Error Messages: Generic errors are returned to clients; detailed logs are server-side only
File Uploads: Images are validated and temporarily stored, then cleaned up

Performance Optimization

Implemented Optimizations:

LLM Singleton Pattern: Model loads once at server startup and is reused across requests (5-20x speedup)
CLIP Embedding Caching: CLIP model stays in memory after first load
ChromaDB HNSW Indexing: Approximate nearest neighbor search with O(log N) complexity
L2 Normalized Embeddings: Cosine similarity computed via efficient dot products
Graceful Error Handling: Image download failures don't block indexing process

Additional Optimizations for Production:

Use GPU: CUDA-enabled GPU for 10-50x faster CLIP inference (local models)
Use GPT-4: Cloud-based LLM eliminates model loading overhead
Batch Processing: Build index in batches for large datasets
CDN for Images: Serve product images via CDN
Load Balancer: Use multiple API instances behind a load balancer
Redis Caching: Cache frequent queries and embeddings

Future Enhancements

Add user authentication
Implement product filtering (price, brand, etc.)
Add bookmark/favorites functionality
Support multilingual queries
Integrate with real Amazon API
Add A/B testing for different prompts
Implement caching layer (Redis)
Add monitoring and analytics

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/YourFeature)
Commit changes (git commit -m 'Add YourFeature')
Push to branch (git push origin feature/YourFeature)
Open a Pull Request

License

This project is for educational and research purposes.

Acknowledgments

OpenAI: CLIP multimodal embeddings and GPT-4 API
ChromaDB: Vector database with HNSW indexing
HuggingFace: Transformers library and model hosting
FastAPI: Modern web framework
Mistral AI / Meta: Open-source LLM models
Tailwind CSS: Frontend styling framework

Additional Documentation

Research Report: Comprehensive technical report in LaTeX format covering implementation details, challenges, solutions, and future improvements
Quick Start Guide for GPT-4: Step-by-step guide for setting up with OpenAI GPT-4

Built with ❤️ using CLIP, ChromaDB, GPT-4, and Open-Source LLMs