Amazon-Multimodal-RAG-Assistant / EVALUATION_GUIDE.md
Easonwangzk's picture
Initial commit with Git LFS
ab26b91
|
raw
history blame
9.6 kB
# Evaluation System Guide
This guide explains how to use `evaluation.py` to evaluate the Amazon Multimodal RAG system.
## Evaluation Metrics
### Retrieval Metrics
**Accuracy@1**
- Percentage of queries where the top-1 result has the correct category
- Range: 0.0 - 1.0 (higher is better)
**Recall@K**
- Percentage of queries where correct category appears in top-K results
- Measured at K = 1, 5, 10
- Range: 0.0 - 1.0 (higher is better)
**MRR (Mean Reciprocal Rank)**
- Average of 1/rank for first correct result
- Range: 0.0 - 1.0 (higher is better)
- MRR = 1.0 means all top-1 results are correct
**MAP (Mean Average Precision)**
- Average precision across all relevant results
- Range: 0.0 - 1.0 (higher is better)
**Distance Metrics**
- Top-1 Distance: Distance to first result (lower is better)
- Average Distance: Mean distance of top-5 results (lower is better)
### Response Metrics
**Response Time**
- Time to generate response in seconds
- Evaluates system performance and user experience
**Product Mention Rate**
- Percentage of top-3 retrieved products mentioned in response
- Range: 0.0 - 1.0 (higher means response uses retrieval better)
**Category Mention Rate**
- Percentage of responses that mention correct product category
- Range: 0.0 - 1.0
**Semantic Similarity**
- Cosine similarity between query and response embeddings
- Range: -1.0 - 1.0 (higher means more relevant response)
- Interpretation: >0.7 (highly relevant), 0.5-0.7 (relevant), <0.5 (low relevance)
**Response Quality Indicators**
- Hedging Rate: Percentage using uncertain language ("not sure", "don't know")
- Comparison Rate: Percentage containing product comparisons
**Category Match Rate**
- Percentage where top-1 retrieved product category matches ground truth
- Range: 0.0 - 1.0
---
## Quick Start
### Prerequisites
1. Build vector database index
```bash
python rag.py --build --csv amazon_multimodal_clean.csv --max 1000
```
2. Configure API keys (if using OpenAI)
```bash
# .env file
USE_OPENAI=true
OPENAI_API_KEY=your-api-key-here
```
3. Install dependencies
```bash
pip install pandas openpyxl
```
### Basic Usage
**Retrieval evaluation only (fast, recommended first)**
```bash
python evaluation.py \
--csv amazon_multimodal_clean.csv \
--db chromadb_store \
--output retrieval_eval.xlsx \
--retrieval-only \
--max-retrieval 100
```
Expected time: 2-5 minutes (100 queries)
**Full evaluation (retrieval + response quality)**
```bash
python evaluation.py \
--csv amazon_multimodal_clean.csv \
--db chromadb_store \
--output full_eval.xlsx \
--max-retrieval 100 \
--max-response 50 \
--mode zero-shot
```
Expected time:
- OpenAI GPT-4: 5-10 minutes (50 queries)
- Local models: 20-60 minutes (50 queries)
---
## Evaluation Modes
### Retrieval-Only Mode
Evaluates retrieval system without LLM:
```bash
python evaluation.py --csv data.csv --retrieval-only
```
Advantages:
- Fast (no LLM wait time)
- Tests core retrieval capability
- No API token consumption
Use cases:
- Debugging retrieval system
- Optimizing embedding models
- Quick performance benchmarks
### End-to-End Mode
Evaluates full RAG pipeline (retrieval + LLM + response quality):
```bash
python evaluation.py --csv data.csv --max-response 50
```
Advantages:
- Comprehensive performance assessment
- Tests LLM response quality
- Identifies end-to-end issues
Disadvantages:
- Slower
- Consumes API tokens (if using OpenAI)
### Prompt Modes
```bash
# Zero-shot (default)
python evaluation.py --csv data.csv --mode zero-shot
# Few-shot (with examples)
python evaluation.py --csv data.csv --mode few-shot
# Multi-shot (more examples)
python evaluation.py --csv data.csv --mode multi-shot
```
Comparison:
- Zero-shot: Fastest, no examples
- Few-shot: Medium, provides 2 examples
- Multi-shot: Slower, multiple examples (usually better quality)
---
## Understanding Results
### Excel Output Structure
The generated Excel file contains multiple sheets:
**Sheet 1: Summary**
- Overview of all metrics
- Average values for retrieval and response metrics
- Use: Quick system performance overview
**Sheet 2: Retrieval_Details**
- Detailed metrics for each query
- Columns: query_id, query_text, ground_truth_category, accuracy_at_1, recall metrics, distances
- Use: Analyze which queries perform well/poorly, identify system weaknesses
**Sheet 3: Response_Details**
- LLM response details for each query
- Columns: query_id, query, response, response_time, quality metrics
- Use: Analyze LLM response quality, compare prompt modes, identify hallucinations
**Sheet 4: Chart_Data**
- Pre-formatted data for creating charts
- Use: Quick visualization creation
### Performance Benchmarks
Retrieval Metrics Benchmarks:
```
Metric | Excellent | Good | Needs Work
---------------|-----------|-----------|------------
Accuracy@1 | >0.80 | 0.65-0.80 | <0.65
Recall@5 | >0.90 | 0.75-0.90 | <0.75
Recall@10 | >0.95 | 0.85-0.95 | <0.85
MRR | >0.85 | 0.70-0.85 | <0.70
MAP | >0.80 | 0.65-0.80 | <0.65
```
Response Metrics Benchmarks:
```
Metric | Excellent | Good | Needs Work
------------------------|-----------|-----------|------------
Response Time (GPT-4) | <3s | 3-5s | >5s
Response Time (Local) | <10s | 10-30s | >30s
Semantic Similarity | >0.70 | 0.55-0.70 | <0.55
Product Mention Rate | >0.70 | 0.50-0.70 | <0.50
Hedging Rate | <0.10 | 0.10-0.25 | >0.25
```
---
## Advanced Usage
### Custom Evaluation Size
```bash
# Quick test (10 queries)
python evaluation.py --csv data.csv --max-retrieval 10 --max-response 5
# Standard evaluation (100 queries)
python evaluation.py --csv data.csv --max-retrieval 100 --max-response 50
# Large-scale evaluation (500+ queries)
python evaluation.py --csv data.csv --max-retrieval 500 --max-response 200
```
### Using Evaluation in Code
```python
from evaluation import RetrievalEvaluator, ResponseEvaluator, export_to_excel
# Evaluate retrieval system
retrieval_evaluator = RetrievalEvaluator(persist_dir="chromadb_store")
results_df, metrics = retrieval_evaluator.evaluate_dataset(
csv_path="amazon_multimodal_clean.csv",
max_queries=100
)
print(f"Accuracy@1: {metrics['accuracy_at_1']:.3f}")
print(f"Recall@5: {metrics['recall_at_5']:.3f}")
# Export to Excel
export_to_excel(
retrieval_results=results_df,
retrieval_metrics=metrics,
output_path="my_eval.xlsx"
)
```
### Batch Evaluation of Different Configurations
```bash
# Test different prompt modes
for mode in zero-shot few-shot multi-shot; do
python evaluation.py \
--csv data.csv \
--mode $mode \
--output "eval_${mode}.xlsx" \
--max-response 50
done
```
---
## Troubleshooting
**Problem: ModuleNotFoundError: No module named 'openpyxl'**
Solution:
```bash
pip install openpyxl pandas
```
**Problem: Evaluation too slow**
Solutions:
1. Use `--retrieval-only` mode (skip LLM)
2. Reduce evaluation count: `--max-response 10`
3. Use OpenAI GPT-4 instead of local models
4. Use faster local models (Mistral-7B instead of Mixtral-8x7B)
**Problem: OpenAI API timeout or errors**
Solutions:
```bash
# Check API key
echo $OPENAI_API_KEY
# Check .env file
cat .env | grep OPENAI
# Use local model instead
# In .env:
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
```
**Problem: CUDA out of memory (local models)**
Solutions:
```bash
# Use CPU mode
export CUDA_VISIBLE_DEVICES=-1
# Or use smaller model
# In .env:
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
```
---
## Best Practices
### Iterative Evaluation Workflow
```
Step 1: Quick retrieval evaluation (10-20 queries)
|
Step 2: Analyze results, adjust parameters
|
Step 3: Medium-scale retrieval evaluation (100 queries)
|
Step 4: Small end-to-end evaluation (20-30 queries)
|
Step 5: Full evaluation (100+ retrieval + 50+ response)
```
### A/B Testing Different Configurations
```bash
# Test configuration A (using GPT-4)
USE_OPENAI=true python evaluation.py --csv data.csv --output eval_gpt4.xlsx
# Test configuration B (using Mistral)
USE_OPENAI=false python evaluation.py --csv data.csv --output eval_mistral.xlsx
```
Compare Summary sheets in Excel to see differences.
### Continuous Monitoring
Integrate evaluation into development workflow:
```bash
# Run after code changes
python evaluation.py --csv data.csv --output eval_$(date +%Y%m%d).xlsx --max-response 30
```
Compare evaluations from different dates to track performance changes.
---
## Example Commands
```bash
# 1. Quick retrieval test (2-3 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 50
# 2. Standard retrieval evaluation (5-10 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 100
# 3. Full evaluation - OpenAI GPT-4 (10-15 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode zero-shot
# 4. Full evaluation - Few-shot (15-20 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode few-shot
# 5. Large-scale evaluation (30-60 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 500 --max-response 200 --mode zero-shot
```
---
## Help
- View `evaluation.py` source code for detailed comments
- Run `python evaluation.py --help` for all parameters
- Check `README.md` for overall project architecture
---
Created: 2025-12-09
Project: Amazon Multimodal RAG Assistant
Version: 1.0