Evaluation System Guide
This guide explains how to use evaluation.py to evaluate the Amazon Multimodal RAG system.
Evaluation Metrics
Retrieval Metrics
Accuracy@1
- Percentage of queries where the top-1 result has the correct category
- Range: 0.0 - 1.0 (higher is better)
Recall@K
- Percentage of queries where correct category appears in top-K results
- Measured at K = 1, 5, 10
- Range: 0.0 - 1.0 (higher is better)
MRR (Mean Reciprocal Rank)
- Average of 1/rank for first correct result
- Range: 0.0 - 1.0 (higher is better)
- MRR = 1.0 means all top-1 results are correct
MAP (Mean Average Precision)
- Average precision across all relevant results
- Range: 0.0 - 1.0 (higher is better)
Distance Metrics
- Top-1 Distance: Distance to first result (lower is better)
- Average Distance: Mean distance of top-5 results (lower is better)
Response Metrics
Response Time
- Time to generate response in seconds
- Evaluates system performance and user experience
Product Mention Rate
- Percentage of top-3 retrieved products mentioned in response
- Range: 0.0 - 1.0 (higher means response uses retrieval better)
Category Mention Rate
- Percentage of responses that mention correct product category
- Range: 0.0 - 1.0
Semantic Similarity
- Cosine similarity between query and response embeddings
- Range: -1.0 - 1.0 (higher means more relevant response)
- Interpretation: >0.7 (highly relevant), 0.5-0.7 (relevant), <0.5 (low relevance)
Response Quality Indicators
- Hedging Rate: Percentage using uncertain language ("not sure", "don't know")
- Comparison Rate: Percentage containing product comparisons
Category Match Rate
- Percentage where top-1 retrieved product category matches ground truth
- Range: 0.0 - 1.0
Quick Start
Prerequisites
- Build vector database index
python rag.py --build --csv amazon_multimodal_clean.csv --max 1000
- Configure API keys (if using OpenAI)
# .env file
USE_OPENAI=true
OPENAI_API_KEY=your-api-key-here
- Install dependencies
pip install pandas openpyxl
Basic Usage
Retrieval evaluation only (fast, recommended first)
python evaluation.py \
--csv amazon_multimodal_clean.csv \
--db chromadb_store \
--output retrieval_eval.xlsx \
--retrieval-only \
--max-retrieval 100
Expected time: 2-5 minutes (100 queries)
Full evaluation (retrieval + response quality)
python evaluation.py \
--csv amazon_multimodal_clean.csv \
--db chromadb_store \
--output full_eval.xlsx \
--max-retrieval 100 \
--max-response 50 \
--mode zero-shot
Expected time:
- OpenAI GPT-4: 5-10 minutes (50 queries)
- Local models: 20-60 minutes (50 queries)
Evaluation Modes
Retrieval-Only Mode
Evaluates retrieval system without LLM:
python evaluation.py --csv data.csv --retrieval-only
Advantages:
- Fast (no LLM wait time)
- Tests core retrieval capability
- No API token consumption
Use cases:
- Debugging retrieval system
- Optimizing embedding models
- Quick performance benchmarks
End-to-End Mode
Evaluates full RAG pipeline (retrieval + LLM + response quality):
python evaluation.py --csv data.csv --max-response 50
Advantages:
- Comprehensive performance assessment
- Tests LLM response quality
- Identifies end-to-end issues
Disadvantages:
- Slower
- Consumes API tokens (if using OpenAI)
Prompt Modes
# Zero-shot (default)
python evaluation.py --csv data.csv --mode zero-shot
# Few-shot (with examples)
python evaluation.py --csv data.csv --mode few-shot
# Multi-shot (more examples)
python evaluation.py --csv data.csv --mode multi-shot
Comparison:
- Zero-shot: Fastest, no examples
- Few-shot: Medium, provides 2 examples
- Multi-shot: Slower, multiple examples (usually better quality)
Understanding Results
Excel Output Structure
The generated Excel file contains multiple sheets:
Sheet 1: Summary
- Overview of all metrics
- Average values for retrieval and response metrics
- Use: Quick system performance overview
Sheet 2: Retrieval_Details
- Detailed metrics for each query
- Columns: query_id, query_text, ground_truth_category, accuracy_at_1, recall metrics, distances
- Use: Analyze which queries perform well/poorly, identify system weaknesses
Sheet 3: Response_Details
- LLM response details for each query
- Columns: query_id, query, response, response_time, quality metrics
- Use: Analyze LLM response quality, compare prompt modes, identify hallucinations
Sheet 4: Chart_Data
- Pre-formatted data for creating charts
- Use: Quick visualization creation
Performance Benchmarks
Retrieval Metrics Benchmarks:
Metric | Excellent | Good | Needs Work
---------------|-----------|-----------|------------
Accuracy@1 | >0.80 | 0.65-0.80 | <0.65
Recall@5 | >0.90 | 0.75-0.90 | <0.75
Recall@10 | >0.95 | 0.85-0.95 | <0.85
MRR | >0.85 | 0.70-0.85 | <0.70
MAP | >0.80 | 0.65-0.80 | <0.65
Response Metrics Benchmarks:
Metric | Excellent | Good | Needs Work
------------------------|-----------|-----------|------------
Response Time (GPT-4) | <3s | 3-5s | >5s
Response Time (Local) | <10s | 10-30s | >30s
Semantic Similarity | >0.70 | 0.55-0.70 | <0.55
Product Mention Rate | >0.70 | 0.50-0.70 | <0.50
Hedging Rate | <0.10 | 0.10-0.25 | >0.25
Advanced Usage
Custom Evaluation Size
# Quick test (10 queries)
python evaluation.py --csv data.csv --max-retrieval 10 --max-response 5
# Standard evaluation (100 queries)
python evaluation.py --csv data.csv --max-retrieval 100 --max-response 50
# Large-scale evaluation (500+ queries)
python evaluation.py --csv data.csv --max-retrieval 500 --max-response 200
Using Evaluation in Code
from evaluation import RetrievalEvaluator, ResponseEvaluator, export_to_excel
# Evaluate retrieval system
retrieval_evaluator = RetrievalEvaluator(persist_dir="chromadb_store")
results_df, metrics = retrieval_evaluator.evaluate_dataset(
csv_path="amazon_multimodal_clean.csv",
max_queries=100
)
print(f"Accuracy@1: {metrics['accuracy_at_1']:.3f}")
print(f"Recall@5: {metrics['recall_at_5']:.3f}")
# Export to Excel
export_to_excel(
retrieval_results=results_df,
retrieval_metrics=metrics,
output_path="my_eval.xlsx"
)
Batch Evaluation of Different Configurations
# Test different prompt modes
for mode in zero-shot few-shot multi-shot; do
python evaluation.py \
--csv data.csv \
--mode $mode \
--output "eval_${mode}.xlsx" \
--max-response 50
done
Troubleshooting
Problem: ModuleNotFoundError: No module named 'openpyxl'
Solution:
pip install openpyxl pandas
Problem: Evaluation too slow
Solutions:
- Use
--retrieval-onlymode (skip LLM) - Reduce evaluation count:
--max-response 10 - Use OpenAI GPT-4 instead of local models
- Use faster local models (Mistral-7B instead of Mixtral-8x7B)
Problem: OpenAI API timeout or errors
Solutions:
# Check API key
echo $OPENAI_API_KEY
# Check .env file
cat .env | grep OPENAI
# Use local model instead
# In .env:
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
Problem: CUDA out of memory (local models)
Solutions:
# Use CPU mode
export CUDA_VISIBLE_DEVICES=-1
# Or use smaller model
# In .env:
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
Best Practices
Iterative Evaluation Workflow
Step 1: Quick retrieval evaluation (10-20 queries)
|
Step 2: Analyze results, adjust parameters
|
Step 3: Medium-scale retrieval evaluation (100 queries)
|
Step 4: Small end-to-end evaluation (20-30 queries)
|
Step 5: Full evaluation (100+ retrieval + 50+ response)
A/B Testing Different Configurations
# Test configuration A (using GPT-4)
USE_OPENAI=true python evaluation.py --csv data.csv --output eval_gpt4.xlsx
# Test configuration B (using Mistral)
USE_OPENAI=false python evaluation.py --csv data.csv --output eval_mistral.xlsx
Compare Summary sheets in Excel to see differences.
Continuous Monitoring
Integrate evaluation into development workflow:
# Run after code changes
python evaluation.py --csv data.csv --output eval_$(date +%Y%m%d).xlsx --max-response 30
Compare evaluations from different dates to track performance changes.
Example Commands
# 1. Quick retrieval test (2-3 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 50
# 2. Standard retrieval evaluation (5-10 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 100
# 3. Full evaluation - OpenAI GPT-4 (10-15 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode zero-shot
# 4. Full evaluation - Few-shot (15-20 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode few-shot
# 5. Large-scale evaluation (30-60 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 500 --max-response 200 --mode zero-shot
Help
- View
evaluation.pysource code for detailed comments - Run
python evaluation.py --helpfor all parameters - Check
README.mdfor overall project architecture
Created: 2025-12-09
Project: Amazon Multimodal RAG Assistant
Version: 1.0