Easonwangzk's picture
Fix YAML metadata: use allowed color value
57a2f48
|
raw
history blame
12.2 kB
metadata
title: Amazon Multimodal RAG Assistant
emoji: πŸ›’
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
license: mit
app_port: 7860

Amazon Multimodal RAG Assistant

An AI-powered e-commerce search assistant that combines multimodal embeddings (CLIP), vector search (ChromaDB), and large language models to provide intelligent product recommendations and natural language responses.

Project Status Python Version

Features

  • Multimodal Search: Search products using text, images, or both simultaneously
  • Intelligent Retrieval: CLIP-based embeddings for semantic product matching
  • Dual LLM Support: Choose between OpenAI GPT-4 or local open-source models
  • Natural Language Responses: Context-aware answers powered by advanced LLMs
  • Modern Web Interface: Clean, responsive UI with real-time search
  • Vector Database: Persistent ChromaDB storage for fast retrieval
  • Prompt Engineering: Supports zero-shot, few-shot, and multi-shot prompting
  • Chat History: Multi-turn conversations with context awareness
  • Flexible Configuration: Environment-based setup for easy customization

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frontend  β”‚  (HTML/JS/TailwindCSS)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚ HTTP/JSON
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  FastAPI    β”‚  (REST API Server)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β–Ό                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   LLM       β”‚   β”‚    RAG      β”‚
β”‚ (GPT-4 or   β”‚   β”‚   (CLIP +   β”‚
β”‚  Local HF)  β”‚   β”‚  ChromaDB)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

  1. rag.py: Retrieval system with CLIP embeddings and ChromaDB
  2. llm.py: LLM interface with prompt engineering
  3. api_server.py: FastAPI backend with singleton LLM pattern
  4. frontend/: Modern web UI with drag-and-drop support
  5. config.py: Centralized configuration management

Requirements

  • Python 3.8+
  • CUDA-compatible GPU (optional, but recommended for faster inference)
  • 8GB+ RAM (16GB+ recommended)
  • 10GB+ disk space for models and data

Installation

1. Clone the Repository

cd Multimodel

2. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

Note: CLIP installation requires git. If you encounter issues:

pip install git+https://github.com/openai/CLIP.git

4. Configure Environment

Create a .env file in the project root (copy from .env.example):

cp .env.example .env

For OpenAI GPT-4 (Recommended):

# .env file
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-api-key-here
OPENAI_MODEL=gpt-4o

For Local Models (Free, but requires more compute):

# .env file
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3

See .env.example for all configuration options.

5. Prepare Data

Place your Amazon product CSV file in the project root:

amazon_multimodal_clean.csv

Expected CSV columns:

  • uniq_id: Unique product identifier
  • product_name: Product name
  • product_text: Product description
  • main_category: Product category
  • image: Image URLs (pipe-separated)

Usage

Step 1: Build Vector Index

python rag.py --build --csv amazon_multimodal_clean.csv --max 1000

Options:

  • --csv: Path to your CSV file
  • --max: Maximum number of products to index (optional, removes limit if omitted)
  • --db: Database directory (default: chromadb_store)

This will:

  • Download product images
  • Generate CLIP embeddings
  • Build ChromaDB vector index
  • Save to chromadb_store/

Step 2: Start API Server

python api_server.py

The server will start on http://localhost:8000

Startup Notes:

  • GPT-4 Mode: Server starts instantly, first request takes 2-5 seconds (API call)
  • Local Model Mode: First request takes 10-60 seconds as the model loads into memory, subsequent requests are fast (model cached)

Step 3: Open Web Interface

Navigate to: http://localhost:8000

Search Modes:

  • Text Only: Search using natural language queries
  • Image Only: Upload a product image to find similar items
  • Multimodal: Combine text and image for refined search

Example Queries:

  • "Wireless earbuds with noise cancellation under $150"
  • "What is this product and how is it used?" (with image)
  • "Compare the top two smartwatches you found"

πŸ”§ Configuration

LLM Backend Selection

The system supports two LLM backends that can be switched via environment variables:

Option 1: OpenAI GPT-4 (Recommended)

Advantages:

  • Superior response quality
  • Faster response times (2-5 seconds)
  • No GPU required
  • Lower memory footprint

Requirements:

  • OpenAI API key
  • Internet connection
  • Cost: ~$0.01-0.03 per query

Configuration:

# .env file
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-api-key-here
OPENAI_MODEL=gpt-4o
OPENAI_MAX_TOKENS=512
OPENAI_TEMPERATURE=0.2

Option 2: Local Open-Source Models

Advantages:

  • Free (no API costs)
  • Complete data privacy
  • Works offline
  • Customizable (fine-tuning possible)

Requirements:

  • 16GB+ RAM (32GB+ for Mixtral)
  • GPU recommended (CUDA-compatible)

Supported Models:

  • mistralai/Mistral-7B-Instruct-v0.3 (7B params, recommended)
  • meta-llama/Meta-Llama-3-8B-Instruct (8B params)
  • mistralai/Mixtral-8x7B-Instruct-v0.1 (47B params, requires 32GB+ RAM)

Configuration:

# .env file
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
LLM_MAX_TOKENS=512
LLM_TEMPERATURE=0.2

Other Configuration Options

# Data paths
CSV_PATH=amazon_multimodal_clean.csv
CHROMA_DIR=chromadb_store
IMAGE_DIR=images

# CLIP model
CLIP_MODEL=ViT-B/32  # Options: ViT-B/32, ViT-B/16, ViT-L/14

# API server
API_HOST=0.0.0.0
API_PORT=8000
ALLOWED_ORIGINS=*

# Retrieval settings
TOP_K_PRODUCTS=5
MAX_TEXT_LENGTH=400

# Logging
LOG_LEVEL=INFO  # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL

See .env.example for the complete configuration template.

Evaluation

Evaluate retrieval quality:

python rag.py --eval --csv amazon_multimodal_clean.csv

Metrics computed:

  • Accuracy@1: Top result category match
  • Recall@1, @5, @10: Category match in top K results

Testing

Test Retrieval Only

# Text query
python rag.py --text "wireless headphones" --db chromadb_store

# Image query
python rag.py --image path/to/product.jpg --db chromadb_store

Test LLM Generation

python llm.py

Project Structure

Multimodel/
β”œβ”€β”€ rag.py                  # CLIP + ChromaDB retrieval system
β”œβ”€β”€ llm.py                  # LLM interface with prompt engineering
β”œβ”€β”€ api_server.py           # FastAPI REST API
β”œβ”€β”€ config.py               # Configuration management
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ README.md               # This file
β”œβ”€β”€ .gitignore              # Git ignore rules
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ index.html          # Web UI
β”‚   β”œβ”€β”€ main.js             # Frontend JavaScript
β”‚   └── amazon-logo.png     # Logo asset
β”œβ”€β”€ chromadb_store/         # Vector database (generated)
β”œβ”€β”€ images/                 # Downloaded product images (generated)
└── amazon_multimodal_clean.csv  # Your dataset

Troubleshooting

Issue: "OpenAI API key is required"

Solution: Ensure you've created a .env file and added python-dotenv dependency:

# Install dotenv if missing
pip install python-dotenv

# Create .env file
cp .env.example .env

# Edit .env and add your API key
USE_OPENAI=true
OPENAI_API_KEY=sk-proj-your-actual-api-key-here

Issue: "TypeError: failed to extract enum MetadataValue"

Solution: This occurs during index building with ChromaDB. Update to the latest version:

pip install --upgrade chromadb

The code now handles None values properly by converting them to empty strings.

Issue: "CUDA out of memory" (Local Models)

Solution: Use CPU mode or reduce batch size

# Force CPU mode
export CUDA_VISIBLE_DEVICES=-1
python api_server.py

Issue: "Model loading takes too long" (Local Models)

Solution: This is normal for first request (10-60s). The model is cached in memory for subsequent requests. Consider using GPT-4 for faster response times.

Issue: "Image download failures"

Solution: Some product URLs may be invalid or expired. This is normal and logged. The system will use text-only embeddings for those products.

Issue: Port 8000 already in use

Solution: Change port via environment variable

export API_PORT=8080
python api_server.py

Issue: Duplicate products after multiple index builds

Solution: ChromaDB uses add() which doesn't prevent duplicates. To rebuild the index, delete the database directory first:

rm -rf chromadb_store
python rag.py --build --csv amazon_multimodal_clean.csv

Security Notes

  • CORS: Currently set to allow_origins=["*"] for development
    • For production, configure ALLOWED_ORIGINS to specific domains
  • Error Messages: Generic errors are returned to clients; detailed logs are server-side only
  • File Uploads: Images are validated and temporarily stored, then cleaned up

Performance Optimization

Implemented Optimizations:

  1. LLM Singleton Pattern: Model loads once at server startup and is reused across requests (5-20x speedup)
  2. CLIP Embedding Caching: CLIP model stays in memory after first load
  3. ChromaDB HNSW Indexing: Approximate nearest neighbor search with O(log N) complexity
  4. L2 Normalized Embeddings: Cosine similarity computed via efficient dot products
  5. Graceful Error Handling: Image download failures don't block indexing process

Additional Optimizations for Production:

  1. Use GPU: CUDA-enabled GPU for 10-50x faster CLIP inference (local models)
  2. Use GPT-4: Cloud-based LLM eliminates model loading overhead
  3. Batch Processing: Build index in batches for large datasets
  4. CDN for Images: Serve product images via CDN
  5. Load Balancer: Use multiple API instances behind a load balancer
  6. Redis Caching: Cache frequent queries and embeddings

Future Enhancements

  • Add user authentication
  • Implement product filtering (price, brand, etc.)
  • Add bookmark/favorites functionality
  • Support multilingual queries
  • Integrate with real Amazon API
  • Add A/B testing for different prompts
  • Implement caching layer (Redis)
  • Add monitoring and analytics

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/YourFeature)
  3. Commit changes (git commit -m 'Add YourFeature')
  4. Push to branch (git push origin feature/YourFeature)
  5. Open a Pull Request

License

This project is for educational and research purposes.

Acknowledgments

  • OpenAI: CLIP multimodal embeddings and GPT-4 API
  • ChromaDB: Vector database with HNSW indexing
  • HuggingFace: Transformers library and model hosting
  • FastAPI: Modern web framework
  • Mistral AI / Meta: Open-source LLM models
  • Tailwind CSS: Frontend styling framework

Additional Documentation

  • Research Report: Comprehensive technical report in LaTeX format covering implementation details, challenges, solutions, and future improvements
  • Quick Start Guide for GPT-4: Step-by-step guide for setting up with OpenAI GPT-4

Built with ❀️ using CLIP, ChromaDB, GPT-4, and Open-Source LLMs