Spaces:

Easonwangzk
/

Amazon-Multimodal-RAG-Assistant

Running

App Files Files Community

Amazon-Multimodal-RAG-Assistant / EVALUATION_GUIDE.md

Easonwangzk

Initial commit with Git LFS

ab26b91 2 days ago

preview code

raw

history blame contribute delete

9.6 kB

Evaluation System Guide

This guide explains how to use evaluation.py to evaluate the Amazon Multimodal RAG system.

Evaluation Metrics

Retrieval Metrics

Accuracy@1

Percentage of queries where the top-1 result has the correct category
Range: 0.0 - 1.0 (higher is better)

Recall@K

Percentage of queries where correct category appears in top-K results
Measured at K = 1, 5, 10
Range: 0.0 - 1.0 (higher is better)

MRR (Mean Reciprocal Rank)

Average of 1/rank for first correct result
Range: 0.0 - 1.0 (higher is better)
MRR = 1.0 means all top-1 results are correct

MAP (Mean Average Precision)

Average precision across all relevant results
Range: 0.0 - 1.0 (higher is better)

Distance Metrics

Top-1 Distance: Distance to first result (lower is better)
Average Distance: Mean distance of top-5 results (lower is better)

Response Metrics

Response Time

Time to generate response in seconds
Evaluates system performance and user experience

Product Mention Rate

Percentage of top-3 retrieved products mentioned in response
Range: 0.0 - 1.0 (higher means response uses retrieval better)

Category Mention Rate

Percentage of responses that mention correct product category
Range: 0.0 - 1.0

Semantic Similarity

Cosine similarity between query and response embeddings
Range: -1.0 - 1.0 (higher means more relevant response)
Interpretation: >0.7 (highly relevant), 0.5-0.7 (relevant), <0.5 (low relevance)

Response Quality Indicators

Hedging Rate: Percentage using uncertain language ("not sure", "don't know")
Comparison Rate: Percentage containing product comparisons

Category Match Rate

Percentage where top-1 retrieved product category matches ground truth
Range: 0.0 - 1.0

Quick Start

Prerequisites

Build vector database index

python rag.py --build --csv amazon_multimodal_clean.csv --max 1000

Configure API keys (if using OpenAI)

# .env file
USE_OPENAI=true
OPENAI_API_KEY=your-api-key-here

Install dependencies

pip install pandas openpyxl

Basic Usage

Retrieval evaluation only (fast, recommended first)

python evaluation.py \
  --csv amazon_multimodal_clean.csv \
  --db chromadb_store \
  --output retrieval_eval.xlsx \
  --retrieval-only \
  --max-retrieval 100

Expected time: 2-5 minutes (100 queries)

Full evaluation (retrieval + response quality)

python evaluation.py \
  --csv amazon_multimodal_clean.csv \
  --db chromadb_store \
  --output full_eval.xlsx \
  --max-retrieval 100 \
  --max-response 50 \
  --mode zero-shot

Expected time:

OpenAI GPT-4: 5-10 minutes (50 queries)
Local models: 20-60 minutes (50 queries)

Evaluation Modes

Retrieval-Only Mode

Evaluates retrieval system without LLM:

python evaluation.py --csv data.csv --retrieval-only

Advantages:

Fast (no LLM wait time)
Tests core retrieval capability
No API token consumption

Use cases:

Debugging retrieval system
Optimizing embedding models
Quick performance benchmarks

End-to-End Mode

Evaluates full RAG pipeline (retrieval + LLM + response quality):

python evaluation.py --csv data.csv --max-response 50

Advantages:

Comprehensive performance assessment
Tests LLM response quality
Identifies end-to-end issues

Disadvantages:

Slower
Consumes API tokens (if using OpenAI)

Prompt Modes

# Zero-shot (default)
python evaluation.py --csv data.csv --mode zero-shot

# Few-shot (with examples)
python evaluation.py --csv data.csv --mode few-shot

# Multi-shot (more examples)
python evaluation.py --csv data.csv --mode multi-shot

Comparison:

Zero-shot: Fastest, no examples
Few-shot: Medium, provides 2 examples
Multi-shot: Slower, multiple examples (usually better quality)

Understanding Results

Excel Output Structure

The generated Excel file contains multiple sheets:

Sheet 1: Summary

Overview of all metrics
Average values for retrieval and response metrics
Use: Quick system performance overview

Sheet 2: Retrieval_Details

Detailed metrics for each query
Columns: query_id, query_text, ground_truth_category, accuracy_at_1, recall metrics, distances
Use: Analyze which queries perform well/poorly, identify system weaknesses

Sheet 3: Response_Details

LLM response details for each query
Columns: query_id, query, response, response_time, quality metrics
Use: Analyze LLM response quality, compare prompt modes, identify hallucinations

Sheet 4: Chart_Data

Pre-formatted data for creating charts
Use: Quick visualization creation

Performance Benchmarks

Retrieval Metrics Benchmarks:

Metric         | Excellent | Good      | Needs Work
---------------|-----------|-----------|------------
Accuracy@1     | >0.80     | 0.65-0.80 | <0.65
Recall@5       | >0.90     | 0.75-0.90 | <0.75
Recall@10      | >0.95     | 0.85-0.95 | <0.85
MRR            | >0.85     | 0.70-0.85 | <0.70
MAP            | >0.80     | 0.65-0.80 | <0.65

Response Metrics Benchmarks:

Metric                  | Excellent | Good      | Needs Work
------------------------|-----------|-----------|------------
Response Time (GPT-4)   | <3s       | 3-5s      | >5s
Response Time (Local)   | <10s      | 10-30s    | >30s
Semantic Similarity     | >0.70     | 0.55-0.70 | <0.55
Product Mention Rate    | >0.70     | 0.50-0.70 | <0.50
Hedging Rate            | <0.10     | 0.10-0.25 | >0.25

Advanced Usage

Custom Evaluation Size

# Quick test (10 queries)
python evaluation.py --csv data.csv --max-retrieval 10 --max-response 5

# Standard evaluation (100 queries)
python evaluation.py --csv data.csv --max-retrieval 100 --max-response 50

# Large-scale evaluation (500+ queries)
python evaluation.py --csv data.csv --max-retrieval 500 --max-response 200

Using Evaluation in Code

from evaluation import RetrievalEvaluator, ResponseEvaluator, export_to_excel

# Evaluate retrieval system
retrieval_evaluator = RetrievalEvaluator(persist_dir="chromadb_store")
results_df, metrics = retrieval_evaluator.evaluate_dataset(
    csv_path="amazon_multimodal_clean.csv",
    max_queries=100
)

print(f"Accuracy@1: {metrics['accuracy_at_1']:.3f}")
print(f"Recall@5: {metrics['recall_at_5']:.3f}")

# Export to Excel
export_to_excel(
    retrieval_results=results_df,
    retrieval_metrics=metrics,
    output_path="my_eval.xlsx"
)

Batch Evaluation of Different Configurations

# Test different prompt modes
for mode in zero-shot few-shot multi-shot; do
  python evaluation.py \
    --csv data.csv \
    --mode $mode \
    --output "eval_${mode}.xlsx" \
    --max-response 50
done

Troubleshooting

Problem: ModuleNotFoundError: No module named 'openpyxl'

Solution:

pip install openpyxl pandas

Problem: Evaluation too slow

Solutions:

Use --retrieval-only mode (skip LLM)
Reduce evaluation count: --max-response 10
Use OpenAI GPT-4 instead of local models
Use faster local models (Mistral-7B instead of Mixtral-8x7B)

Problem: OpenAI API timeout or errors

Solutions:

# Check API key
echo $OPENAI_API_KEY

# Check .env file
cat .env | grep OPENAI

# Use local model instead
# In .env:
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3

Problem: CUDA out of memory (local models)

Solutions:

# Use CPU mode
export CUDA_VISIBLE_DEVICES=-1

# Or use smaller model
# In .env:
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3

Best Practices

Iterative Evaluation Workflow

Step 1: Quick retrieval evaluation (10-20 queries)
  |
Step 2: Analyze results, adjust parameters
  |
Step 3: Medium-scale retrieval evaluation (100 queries)
  |
Step 4: Small end-to-end evaluation (20-30 queries)
  |
Step 5: Full evaluation (100+ retrieval + 50+ response)

A/B Testing Different Configurations

# Test configuration A (using GPT-4)
USE_OPENAI=true python evaluation.py --csv data.csv --output eval_gpt4.xlsx

# Test configuration B (using Mistral)
USE_OPENAI=false python evaluation.py --csv data.csv --output eval_mistral.xlsx

Compare Summary sheets in Excel to see differences.

Continuous Monitoring

Integrate evaluation into development workflow:

# Run after code changes
python evaluation.py --csv data.csv --output eval_$(date +%Y%m%d).xlsx --max-response 30

Compare evaluations from different dates to track performance changes.

Example Commands

# 1. Quick retrieval test (2-3 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 50

# 2. Standard retrieval evaluation (5-10 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 100

# 3. Full evaluation - OpenAI GPT-4 (10-15 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode zero-shot

# 4. Full evaluation - Few-shot (15-20 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode few-shot

# 5. Large-scale evaluation (30-60 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 500 --max-response 200 --mode zero-shot

Help

View evaluation.py source code for detailed comments
Run python evaluation.py --help for all parameters
Check README.md for overall project architecture

Created: 2025-12-09
Project: Amazon Multimodal RAG Assistant
Version: 1.0