# Evaluation System Guide This guide explains how to use `evaluation.py` to evaluate the Amazon Multimodal RAG system. ## Evaluation Metrics ### Retrieval Metrics **Accuracy@1** - Percentage of queries where the top-1 result has the correct category - Range: 0.0 - 1.0 (higher is better) **Recall@K** - Percentage of queries where correct category appears in top-K results - Measured at K = 1, 5, 10 - Range: 0.0 - 1.0 (higher is better) **MRR (Mean Reciprocal Rank)** - Average of 1/rank for first correct result - Range: 0.0 - 1.0 (higher is better) - MRR = 1.0 means all top-1 results are correct **MAP (Mean Average Precision)** - Average precision across all relevant results - Range: 0.0 - 1.0 (higher is better) **Distance Metrics** - Top-1 Distance: Distance to first result (lower is better) - Average Distance: Mean distance of top-5 results (lower is better) ### Response Metrics **Response Time** - Time to generate response in seconds - Evaluates system performance and user experience **Product Mention Rate** - Percentage of top-3 retrieved products mentioned in response - Range: 0.0 - 1.0 (higher means response uses retrieval better) **Category Mention Rate** - Percentage of responses that mention correct product category - Range: 0.0 - 1.0 **Semantic Similarity** - Cosine similarity between query and response embeddings - Range: -1.0 - 1.0 (higher means more relevant response) - Interpretation: >0.7 (highly relevant), 0.5-0.7 (relevant), <0.5 (low relevance) **Response Quality Indicators** - Hedging Rate: Percentage using uncertain language ("not sure", "don't know") - Comparison Rate: Percentage containing product comparisons **Category Match Rate** - Percentage where top-1 retrieved product category matches ground truth - Range: 0.0 - 1.0 --- ## Quick Start ### Prerequisites 1. Build vector database index ```bash python rag.py --build --csv amazon_multimodal_clean.csv --max 1000 ``` 2. Configure API keys (if using OpenAI) ```bash # .env file USE_OPENAI=true OPENAI_API_KEY=your-api-key-here ``` 3. Install dependencies ```bash pip install pandas openpyxl ``` ### Basic Usage **Retrieval evaluation only (fast, recommended first)** ```bash python evaluation.py \ --csv amazon_multimodal_clean.csv \ --db chromadb_store \ --output retrieval_eval.xlsx \ --retrieval-only \ --max-retrieval 100 ``` Expected time: 2-5 minutes (100 queries) **Full evaluation (retrieval + response quality)** ```bash python evaluation.py \ --csv amazon_multimodal_clean.csv \ --db chromadb_store \ --output full_eval.xlsx \ --max-retrieval 100 \ --max-response 50 \ --mode zero-shot ``` Expected time: - OpenAI GPT-4: 5-10 minutes (50 queries) - Local models: 20-60 minutes (50 queries) --- ## Evaluation Modes ### Retrieval-Only Mode Evaluates retrieval system without LLM: ```bash python evaluation.py --csv data.csv --retrieval-only ``` Advantages: - Fast (no LLM wait time) - Tests core retrieval capability - No API token consumption Use cases: - Debugging retrieval system - Optimizing embedding models - Quick performance benchmarks ### End-to-End Mode Evaluates full RAG pipeline (retrieval + LLM + response quality): ```bash python evaluation.py --csv data.csv --max-response 50 ``` Advantages: - Comprehensive performance assessment - Tests LLM response quality - Identifies end-to-end issues Disadvantages: - Slower - Consumes API tokens (if using OpenAI) ### Prompt Modes ```bash # Zero-shot (default) python evaluation.py --csv data.csv --mode zero-shot # Few-shot (with examples) python evaluation.py --csv data.csv --mode few-shot # Multi-shot (more examples) python evaluation.py --csv data.csv --mode multi-shot ``` Comparison: - Zero-shot: Fastest, no examples - Few-shot: Medium, provides 2 examples - Multi-shot: Slower, multiple examples (usually better quality) --- ## Understanding Results ### Excel Output Structure The generated Excel file contains multiple sheets: **Sheet 1: Summary** - Overview of all metrics - Average values for retrieval and response metrics - Use: Quick system performance overview **Sheet 2: Retrieval_Details** - Detailed metrics for each query - Columns: query_id, query_text, ground_truth_category, accuracy_at_1, recall metrics, distances - Use: Analyze which queries perform well/poorly, identify system weaknesses **Sheet 3: Response_Details** - LLM response details for each query - Columns: query_id, query, response, response_time, quality metrics - Use: Analyze LLM response quality, compare prompt modes, identify hallucinations **Sheet 4: Chart_Data** - Pre-formatted data for creating charts - Use: Quick visualization creation ### Performance Benchmarks Retrieval Metrics Benchmarks: ``` Metric | Excellent | Good | Needs Work ---------------|-----------|-----------|------------ Accuracy@1 | >0.80 | 0.65-0.80 | <0.65 Recall@5 | >0.90 | 0.75-0.90 | <0.75 Recall@10 | >0.95 | 0.85-0.95 | <0.85 MRR | >0.85 | 0.70-0.85 | <0.70 MAP | >0.80 | 0.65-0.80 | <0.65 ``` Response Metrics Benchmarks: ``` Metric | Excellent | Good | Needs Work ------------------------|-----------|-----------|------------ Response Time (GPT-4) | <3s | 3-5s | >5s Response Time (Local) | <10s | 10-30s | >30s Semantic Similarity | >0.70 | 0.55-0.70 | <0.55 Product Mention Rate | >0.70 | 0.50-0.70 | <0.50 Hedging Rate | <0.10 | 0.10-0.25 | >0.25 ``` --- ## Advanced Usage ### Custom Evaluation Size ```bash # Quick test (10 queries) python evaluation.py --csv data.csv --max-retrieval 10 --max-response 5 # Standard evaluation (100 queries) python evaluation.py --csv data.csv --max-retrieval 100 --max-response 50 # Large-scale evaluation (500+ queries) python evaluation.py --csv data.csv --max-retrieval 500 --max-response 200 ``` ### Using Evaluation in Code ```python from evaluation import RetrievalEvaluator, ResponseEvaluator, export_to_excel # Evaluate retrieval system retrieval_evaluator = RetrievalEvaluator(persist_dir="chromadb_store") results_df, metrics = retrieval_evaluator.evaluate_dataset( csv_path="amazon_multimodal_clean.csv", max_queries=100 ) print(f"Accuracy@1: {metrics['accuracy_at_1']:.3f}") print(f"Recall@5: {metrics['recall_at_5']:.3f}") # Export to Excel export_to_excel( retrieval_results=results_df, retrieval_metrics=metrics, output_path="my_eval.xlsx" ) ``` ### Batch Evaluation of Different Configurations ```bash # Test different prompt modes for mode in zero-shot few-shot multi-shot; do python evaluation.py \ --csv data.csv \ --mode $mode \ --output "eval_${mode}.xlsx" \ --max-response 50 done ``` --- ## Troubleshooting **Problem: ModuleNotFoundError: No module named 'openpyxl'** Solution: ```bash pip install openpyxl pandas ``` **Problem: Evaluation too slow** Solutions: 1. Use `--retrieval-only` mode (skip LLM) 2. Reduce evaluation count: `--max-response 10` 3. Use OpenAI GPT-4 instead of local models 4. Use faster local models (Mistral-7B instead of Mixtral-8x7B) **Problem: OpenAI API timeout or errors** Solutions: ```bash # Check API key echo $OPENAI_API_KEY # Check .env file cat .env | grep OPENAI # Use local model instead # In .env: USE_OPENAI=false LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3 ``` **Problem: CUDA out of memory (local models)** Solutions: ```bash # Use CPU mode export CUDA_VISIBLE_DEVICES=-1 # Or use smaller model # In .env: LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3 ``` --- ## Best Practices ### Iterative Evaluation Workflow ``` Step 1: Quick retrieval evaluation (10-20 queries) | Step 2: Analyze results, adjust parameters | Step 3: Medium-scale retrieval evaluation (100 queries) | Step 4: Small end-to-end evaluation (20-30 queries) | Step 5: Full evaluation (100+ retrieval + 50+ response) ``` ### A/B Testing Different Configurations ```bash # Test configuration A (using GPT-4) USE_OPENAI=true python evaluation.py --csv data.csv --output eval_gpt4.xlsx # Test configuration B (using Mistral) USE_OPENAI=false python evaluation.py --csv data.csv --output eval_mistral.xlsx ``` Compare Summary sheets in Excel to see differences. ### Continuous Monitoring Integrate evaluation into development workflow: ```bash # Run after code changes python evaluation.py --csv data.csv --output eval_$(date +%Y%m%d).xlsx --max-response 30 ``` Compare evaluations from different dates to track performance changes. --- ## Example Commands ```bash # 1. Quick retrieval test (2-3 minutes) python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 50 # 2. Standard retrieval evaluation (5-10 minutes) python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 100 # 3. Full evaluation - OpenAI GPT-4 (10-15 minutes) python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode zero-shot # 4. Full evaluation - Few-shot (15-20 minutes) python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode few-shot # 5. Large-scale evaluation (30-60 minutes) python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 500 --max-response 200 --mode zero-shot ``` --- ## Help - View `evaluation.py` source code for detailed comments - Run `python evaluation.py --help` for all parameters - Check `README.md` for overall project architecture --- Created: 2025-12-09 Project: Amazon Multimodal RAG Assistant Version: 1.0