Amazon-Multimodal-RAG-Assistant / EVALUATION_GUIDE.md
Easonwangzk's picture
Initial commit with Git LFS
ab26b91

Evaluation System Guide

This guide explains how to use evaluation.py to evaluate the Amazon Multimodal RAG system.

Evaluation Metrics

Retrieval Metrics

Accuracy@1

  • Percentage of queries where the top-1 result has the correct category
  • Range: 0.0 - 1.0 (higher is better)

Recall@K

  • Percentage of queries where correct category appears in top-K results
  • Measured at K = 1, 5, 10
  • Range: 0.0 - 1.0 (higher is better)

MRR (Mean Reciprocal Rank)

  • Average of 1/rank for first correct result
  • Range: 0.0 - 1.0 (higher is better)
  • MRR = 1.0 means all top-1 results are correct

MAP (Mean Average Precision)

  • Average precision across all relevant results
  • Range: 0.0 - 1.0 (higher is better)

Distance Metrics

  • Top-1 Distance: Distance to first result (lower is better)
  • Average Distance: Mean distance of top-5 results (lower is better)

Response Metrics

Response Time

  • Time to generate response in seconds
  • Evaluates system performance and user experience

Product Mention Rate

  • Percentage of top-3 retrieved products mentioned in response
  • Range: 0.0 - 1.0 (higher means response uses retrieval better)

Category Mention Rate

  • Percentage of responses that mention correct product category
  • Range: 0.0 - 1.0

Semantic Similarity

  • Cosine similarity between query and response embeddings
  • Range: -1.0 - 1.0 (higher means more relevant response)
  • Interpretation: >0.7 (highly relevant), 0.5-0.7 (relevant), <0.5 (low relevance)

Response Quality Indicators

  • Hedging Rate: Percentage using uncertain language ("not sure", "don't know")
  • Comparison Rate: Percentage containing product comparisons

Category Match Rate

  • Percentage where top-1 retrieved product category matches ground truth
  • Range: 0.0 - 1.0

Quick Start

Prerequisites

  1. Build vector database index
python rag.py --build --csv amazon_multimodal_clean.csv --max 1000
  1. Configure API keys (if using OpenAI)
# .env file
USE_OPENAI=true
OPENAI_API_KEY=your-api-key-here
  1. Install dependencies
pip install pandas openpyxl

Basic Usage

Retrieval evaluation only (fast, recommended first)

python evaluation.py \
  --csv amazon_multimodal_clean.csv \
  --db chromadb_store \
  --output retrieval_eval.xlsx \
  --retrieval-only \
  --max-retrieval 100

Expected time: 2-5 minutes (100 queries)

Full evaluation (retrieval + response quality)

python evaluation.py \
  --csv amazon_multimodal_clean.csv \
  --db chromadb_store \
  --output full_eval.xlsx \
  --max-retrieval 100 \
  --max-response 50 \
  --mode zero-shot

Expected time:

  • OpenAI GPT-4: 5-10 minutes (50 queries)
  • Local models: 20-60 minutes (50 queries)

Evaluation Modes

Retrieval-Only Mode

Evaluates retrieval system without LLM:

python evaluation.py --csv data.csv --retrieval-only

Advantages:

  • Fast (no LLM wait time)
  • Tests core retrieval capability
  • No API token consumption

Use cases:

  • Debugging retrieval system
  • Optimizing embedding models
  • Quick performance benchmarks

End-to-End Mode

Evaluates full RAG pipeline (retrieval + LLM + response quality):

python evaluation.py --csv data.csv --max-response 50

Advantages:

  • Comprehensive performance assessment
  • Tests LLM response quality
  • Identifies end-to-end issues

Disadvantages:

  • Slower
  • Consumes API tokens (if using OpenAI)

Prompt Modes

# Zero-shot (default)
python evaluation.py --csv data.csv --mode zero-shot

# Few-shot (with examples)
python evaluation.py --csv data.csv --mode few-shot

# Multi-shot (more examples)
python evaluation.py --csv data.csv --mode multi-shot

Comparison:

  • Zero-shot: Fastest, no examples
  • Few-shot: Medium, provides 2 examples
  • Multi-shot: Slower, multiple examples (usually better quality)

Understanding Results

Excel Output Structure

The generated Excel file contains multiple sheets:

Sheet 1: Summary

  • Overview of all metrics
  • Average values for retrieval and response metrics
  • Use: Quick system performance overview

Sheet 2: Retrieval_Details

  • Detailed metrics for each query
  • Columns: query_id, query_text, ground_truth_category, accuracy_at_1, recall metrics, distances
  • Use: Analyze which queries perform well/poorly, identify system weaknesses

Sheet 3: Response_Details

  • LLM response details for each query
  • Columns: query_id, query, response, response_time, quality metrics
  • Use: Analyze LLM response quality, compare prompt modes, identify hallucinations

Sheet 4: Chart_Data

  • Pre-formatted data for creating charts
  • Use: Quick visualization creation

Performance Benchmarks

Retrieval Metrics Benchmarks:

Metric         | Excellent | Good      | Needs Work
---------------|-----------|-----------|------------
Accuracy@1     | >0.80     | 0.65-0.80 | <0.65
Recall@5       | >0.90     | 0.75-0.90 | <0.75
Recall@10      | >0.95     | 0.85-0.95 | <0.85
MRR            | >0.85     | 0.70-0.85 | <0.70
MAP            | >0.80     | 0.65-0.80 | <0.65

Response Metrics Benchmarks:

Metric                  | Excellent | Good      | Needs Work
------------------------|-----------|-----------|------------
Response Time (GPT-4)   | <3s       | 3-5s      | >5s
Response Time (Local)   | <10s      | 10-30s    | >30s
Semantic Similarity     | >0.70     | 0.55-0.70 | <0.55
Product Mention Rate    | >0.70     | 0.50-0.70 | <0.50
Hedging Rate            | <0.10     | 0.10-0.25 | >0.25

Advanced Usage

Custom Evaluation Size

# Quick test (10 queries)
python evaluation.py --csv data.csv --max-retrieval 10 --max-response 5

# Standard evaluation (100 queries)
python evaluation.py --csv data.csv --max-retrieval 100 --max-response 50

# Large-scale evaluation (500+ queries)
python evaluation.py --csv data.csv --max-retrieval 500 --max-response 200

Using Evaluation in Code

from evaluation import RetrievalEvaluator, ResponseEvaluator, export_to_excel

# Evaluate retrieval system
retrieval_evaluator = RetrievalEvaluator(persist_dir="chromadb_store")
results_df, metrics = retrieval_evaluator.evaluate_dataset(
    csv_path="amazon_multimodal_clean.csv",
    max_queries=100
)

print(f"Accuracy@1: {metrics['accuracy_at_1']:.3f}")
print(f"Recall@5: {metrics['recall_at_5']:.3f}")

# Export to Excel
export_to_excel(
    retrieval_results=results_df,
    retrieval_metrics=metrics,
    output_path="my_eval.xlsx"
)

Batch Evaluation of Different Configurations

# Test different prompt modes
for mode in zero-shot few-shot multi-shot; do
  python evaluation.py \
    --csv data.csv \
    --mode $mode \
    --output "eval_${mode}.xlsx" \
    --max-response 50
done

Troubleshooting

Problem: ModuleNotFoundError: No module named 'openpyxl'

Solution:

pip install openpyxl pandas

Problem: Evaluation too slow

Solutions:

  1. Use --retrieval-only mode (skip LLM)
  2. Reduce evaluation count: --max-response 10
  3. Use OpenAI GPT-4 instead of local models
  4. Use faster local models (Mistral-7B instead of Mixtral-8x7B)

Problem: OpenAI API timeout or errors

Solutions:

# Check API key
echo $OPENAI_API_KEY

# Check .env file
cat .env | grep OPENAI

# Use local model instead
# In .env:
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3

Problem: CUDA out of memory (local models)

Solutions:

# Use CPU mode
export CUDA_VISIBLE_DEVICES=-1

# Or use smaller model
# In .env:
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3

Best Practices

Iterative Evaluation Workflow

Step 1: Quick retrieval evaluation (10-20 queries)
  |
Step 2: Analyze results, adjust parameters
  |
Step 3: Medium-scale retrieval evaluation (100 queries)
  |
Step 4: Small end-to-end evaluation (20-30 queries)
  |
Step 5: Full evaluation (100+ retrieval + 50+ response)

A/B Testing Different Configurations

# Test configuration A (using GPT-4)
USE_OPENAI=true python evaluation.py --csv data.csv --output eval_gpt4.xlsx

# Test configuration B (using Mistral)
USE_OPENAI=false python evaluation.py --csv data.csv --output eval_mistral.xlsx

Compare Summary sheets in Excel to see differences.

Continuous Monitoring

Integrate evaluation into development workflow:

# Run after code changes
python evaluation.py --csv data.csv --output eval_$(date +%Y%m%d).xlsx --max-response 30

Compare evaluations from different dates to track performance changes.


Example Commands

# 1. Quick retrieval test (2-3 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 50

# 2. Standard retrieval evaluation (5-10 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 100

# 3. Full evaluation - OpenAI GPT-4 (10-15 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode zero-shot

# 4. Full evaluation - Few-shot (15-20 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode few-shot

# 5. Large-scale evaluation (30-60 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 500 --max-response 200 --mode zero-shot

Help

  • View evaluation.py source code for detailed comments
  • Run python evaluation.py --help for all parameters
  • Check README.md for overall project architecture

Created: 2025-12-09
Project: Amazon Multimodal RAG Assistant
Version: 1.0