Spaces:

Easonwangzk
/

Amazon-Multimodal-RAG-Assistant

Running

App Files Files Community

Amazon-Multimodal-RAG-Assistant / EVALUATION_GUIDE.md

Easonwangzk

Initial commit with Git LFS

ab26b91 3 days ago

preview code

raw

history blame contribute delete

9.6 kB

	# Evaluation System Guide

	This guide explains how to use `evaluation.py` to evaluate the Amazon Multimodal RAG system.

	## Evaluation Metrics

	### Retrieval Metrics

	Accuracy@1
	- Percentage of queries where the top-1 result has the correct category
	- Range: 0.0 - 1.0 (higher is better)

	Recall@K
	- Percentage of queries where correct category appears in top-K results
	- Measured at K = 1, 5, 10
	- Range: 0.0 - 1.0 (higher is better)

	MRR (Mean Reciprocal Rank)
	- Average of 1/rank for first correct result
	- Range: 0.0 - 1.0 (higher is better)
	- MRR = 1.0 means all top-1 results are correct

	MAP (Mean Average Precision)
	- Average precision across all relevant results
	- Range: 0.0 - 1.0 (higher is better)

	Distance Metrics
	- Top-1 Distance: Distance to first result (lower is better)
	- Average Distance: Mean distance of top-5 results (lower is better)

	### Response Metrics

	Response Time
	- Time to generate response in seconds
	- Evaluates system performance and user experience

	Product Mention Rate
	- Percentage of top-3 retrieved products mentioned in response
	- Range: 0.0 - 1.0 (higher means response uses retrieval better)

	Category Mention Rate
	- Percentage of responses that mention correct product category
	- Range: 0.0 - 1.0

	Semantic Similarity
	- Cosine similarity between query and response embeddings
	- Range: -1.0 - 1.0 (higher means more relevant response)
	- Interpretation: >0.7 (highly relevant), 0.5-0.7 (relevant), <0.5 (low relevance)

	Response Quality Indicators
	- Hedging Rate: Percentage using uncertain language ("not sure", "don't know")
	- Comparison Rate: Percentage containing product comparisons

	Category Match Rate
	- Percentage where top-1 retrieved product category matches ground truth
	- Range: 0.0 - 1.0

	---

	## Quick Start

	### Prerequisites

	1. Build vector database index
	```bash
	python rag.py --build --csv amazon_multimodal_clean.csv --max 1000
	```

	2. Configure API keys (if using OpenAI)
	```bash
	# .env file
	USE_OPENAI=true
	OPENAI_API_KEY=your-api-key-here
	```

	3. Install dependencies
	```bash
	pip install pandas openpyxl
	```

	### Basic Usage

	Retrieval evaluation only (fast, recommended first)
	```bash
	python evaluation.py \
	--csv amazon_multimodal_clean.csv \
	--db chromadb_store \
	--output retrieval_eval.xlsx \
	--retrieval-only \
	--max-retrieval 100
	```

	Expected time: 2-5 minutes (100 queries)

	Full evaluation (retrieval + response quality)
	```bash
	python evaluation.py \
	--csv amazon_multimodal_clean.csv \
	--db chromadb_store \
	--output full_eval.xlsx \
	--max-retrieval 100 \
	--max-response 50 \
	--mode zero-shot
	```

	Expected time:
	- OpenAI GPT-4: 5-10 minutes (50 queries)
	- Local models: 20-60 minutes (50 queries)

	---

	## Evaluation Modes

	### Retrieval-Only Mode

	Evaluates retrieval system without LLM:
	```bash
	python evaluation.py --csv data.csv --retrieval-only
	```

	Advantages:
	- Fast (no LLM wait time)
	- Tests core retrieval capability
	- No API token consumption

	Use cases:
	- Debugging retrieval system
	- Optimizing embedding models
	- Quick performance benchmarks

	### End-to-End Mode

	Evaluates full RAG pipeline (retrieval + LLM + response quality):
	```bash
	python evaluation.py --csv data.csv --max-response 50
	```

	Advantages:
	- Comprehensive performance assessment
	- Tests LLM response quality
	- Identifies end-to-end issues

	Disadvantages:
	- Slower
	- Consumes API tokens (if using OpenAI)

	### Prompt Modes

	```bash
	# Zero-shot (default)
	python evaluation.py --csv data.csv --mode zero-shot

	# Few-shot (with examples)
	python evaluation.py --csv data.csv --mode few-shot

	# Multi-shot (more examples)
	python evaluation.py --csv data.csv --mode multi-shot
	```

	Comparison:
	- Zero-shot: Fastest, no examples
	- Few-shot: Medium, provides 2 examples
	- Multi-shot: Slower, multiple examples (usually better quality)

	---

	## Understanding Results

	### Excel Output Structure

	The generated Excel file contains multiple sheets:

	Sheet 1: Summary
	- Overview of all metrics
	- Average values for retrieval and response metrics
	- Use: Quick system performance overview

	Sheet 2: Retrieval_Details
	- Detailed metrics for each query
	- Columns: query_id, query_text, ground_truth_category, accuracy_at_1, recall metrics, distances
	- Use: Analyze which queries perform well/poorly, identify system weaknesses

	Sheet 3: Response_Details
	- LLM response details for each query
	- Columns: query_id, query, response, response_time, quality metrics
	- Use: Analyze LLM response quality, compare prompt modes, identify hallucinations

	Sheet 4: Chart_Data
	- Pre-formatted data for creating charts
	- Use: Quick visualization creation

	### Performance Benchmarks

	Retrieval Metrics Benchmarks:
	```
	Metric \| Excellent \| Good \| Needs Work
	---------------\|-----------\|-----------\|------------
	Accuracy@1 \| >0.80 \| 0.65-0.80 \| <0.65
	Recall@5 \| >0.90 \| 0.75-0.90 \| <0.75
	Recall@10 \| >0.95 \| 0.85-0.95 \| <0.85
	MRR \| >0.85 \| 0.70-0.85 \| <0.70
	MAP \| >0.80 \| 0.65-0.80 \| <0.65
	```

	Response Metrics Benchmarks:
	```
	Metric \| Excellent \| Good \| Needs Work
	------------------------\|-----------\|-----------\|------------
	Response Time (GPT-4) \| <3s \| 3-5s \| >5s
	Response Time (Local) \| <10s \| 10-30s \| >30s
	Semantic Similarity \| >0.70 \| 0.55-0.70 \| <0.55
	Product Mention Rate \| >0.70 \| 0.50-0.70 \| <0.50
	Hedging Rate \| <0.10 \| 0.10-0.25 \| >0.25
	```

	---

	## Advanced Usage

	### Custom Evaluation Size

	```bash
	# Quick test (10 queries)
	python evaluation.py --csv data.csv --max-retrieval 10 --max-response 5

	# Standard evaluation (100 queries)
	python evaluation.py --csv data.csv --max-retrieval 100 --max-response 50

	# Large-scale evaluation (500+ queries)
	python evaluation.py --csv data.csv --max-retrieval 500 --max-response 200
	```

	### Using Evaluation in Code

	```python
	from evaluation import RetrievalEvaluator, ResponseEvaluator, export_to_excel

	# Evaluate retrieval system
	retrieval_evaluator = RetrievalEvaluator(persist_dir="chromadb_store")
	results_df, metrics = retrieval_evaluator.evaluate_dataset(
	csv_path="amazon_multimodal_clean.csv",
	max_queries=100
	)

	print(f"Accuracy@1: {metrics['accuracy_at_1']:.3f}")
	print(f"Recall@5: {metrics['recall_at_5']:.3f}")

	# Export to Excel
	export_to_excel(
	retrieval_results=results_df,
	retrieval_metrics=metrics,
	output_path="my_eval.xlsx"
	)
	```

	### Batch Evaluation of Different Configurations

	```bash
	# Test different prompt modes
	for mode in zero-shot few-shot multi-shot; do
	python evaluation.py \
	--csv data.csv \
	--mode $mode \
	--output "eval_${mode}.xlsx" \
	--max-response 50
	done
	```

	---

	## Troubleshooting

	Problem: ModuleNotFoundError: No module named 'openpyxl'

	Solution:
	```bash
	pip install openpyxl pandas
	```

	Problem: Evaluation too slow

	Solutions:
	1. Use `--retrieval-only` mode (skip LLM)
	2. Reduce evaluation count: `--max-response 10`
	3. Use OpenAI GPT-4 instead of local models
	4. Use faster local models (Mistral-7B instead of Mixtral-8x7B)

	Problem: OpenAI API timeout or errors

	Solutions:
	```bash
	# Check API key
	echo $OPENAI_API_KEY

	# Check .env file
	cat .env \| grep OPENAI

	# Use local model instead
	# In .env:
	USE_OPENAI=false
	LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
	```

	Problem: CUDA out of memory (local models)

	Solutions:
	```bash
	# Use CPU mode
	export CUDA_VISIBLE_DEVICES=-1

	# Or use smaller model
	# In .env:
	LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
	```

	---

	## Best Practices

	### Iterative Evaluation Workflow

	```
	Step 1: Quick retrieval evaluation (10-20 queries)
	\|
	Step 2: Analyze results, adjust parameters
	\|
	Step 3: Medium-scale retrieval evaluation (100 queries)
	\|
	Step 4: Small end-to-end evaluation (20-30 queries)
	\|
	Step 5: Full evaluation (100+ retrieval + 50+ response)
	```

	### A/B Testing Different Configurations

	```bash
	# Test configuration A (using GPT-4)
	USE_OPENAI=true python evaluation.py --csv data.csv --output eval_gpt4.xlsx

	# Test configuration B (using Mistral)
	USE_OPENAI=false python evaluation.py --csv data.csv --output eval_mistral.xlsx
	```

	Compare Summary sheets in Excel to see differences.

	### Continuous Monitoring

	Integrate evaluation into development workflow:
	```bash
	# Run after code changes
	python evaluation.py --csv data.csv --output eval_$(date +%Y%m%d).xlsx --max-response 30
	```

	Compare evaluations from different dates to track performance changes.

	---

	## Example Commands

	```bash
	# 1. Quick retrieval test (2-3 minutes)
	python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 50

	# 2. Standard retrieval evaluation (5-10 minutes)
	python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 100

	# 3. Full evaluation - OpenAI GPT-4 (10-15 minutes)
	python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode zero-shot

	# 4. Full evaluation - Few-shot (15-20 minutes)
	python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode few-shot

	# 5. Large-scale evaluation (30-60 minutes)
	python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 500 --max-response 200 --mode zero-shot
	```

	---

	## Help

	- View `evaluation.py` source code for detailed comments
	- Run `python evaluation.py --help` for all parameters
	- Check `README.md` for overall project architecture

	---

	Created: 2025-12-09
	Project: Amazon Multimodal RAG Assistant
	Version: 1.0