| # Evaluation System Guide | |
| This guide explains how to use `evaluation.py` to evaluate the Amazon Multimodal RAG system. | |
| ## Evaluation Metrics | |
| ### Retrieval Metrics | |
| **Accuracy@1** | |
| - Percentage of queries where the top-1 result has the correct category | |
| - Range: 0.0 - 1.0 (higher is better) | |
| **Recall@K** | |
| - Percentage of queries where correct category appears in top-K results | |
| - Measured at K = 1, 5, 10 | |
| - Range: 0.0 - 1.0 (higher is better) | |
| **MRR (Mean Reciprocal Rank)** | |
| - Average of 1/rank for first correct result | |
| - Range: 0.0 - 1.0 (higher is better) | |
| - MRR = 1.0 means all top-1 results are correct | |
| **MAP (Mean Average Precision)** | |
| - Average precision across all relevant results | |
| - Range: 0.0 - 1.0 (higher is better) | |
| **Distance Metrics** | |
| - Top-1 Distance: Distance to first result (lower is better) | |
| - Average Distance: Mean distance of top-5 results (lower is better) | |
| ### Response Metrics | |
| **Response Time** | |
| - Time to generate response in seconds | |
| - Evaluates system performance and user experience | |
| **Product Mention Rate** | |
| - Percentage of top-3 retrieved products mentioned in response | |
| - Range: 0.0 - 1.0 (higher means response uses retrieval better) | |
| **Category Mention Rate** | |
| - Percentage of responses that mention correct product category | |
| - Range: 0.0 - 1.0 | |
| **Semantic Similarity** | |
| - Cosine similarity between query and response embeddings | |
| - Range: -1.0 - 1.0 (higher means more relevant response) | |
| - Interpretation: >0.7 (highly relevant), 0.5-0.7 (relevant), <0.5 (low relevance) | |
| **Response Quality Indicators** | |
| - Hedging Rate: Percentage using uncertain language ("not sure", "don't know") | |
| - Comparison Rate: Percentage containing product comparisons | |
| **Category Match Rate** | |
| - Percentage where top-1 retrieved product category matches ground truth | |
| - Range: 0.0 - 1.0 | |
| --- | |
| ## Quick Start | |
| ### Prerequisites | |
| 1. Build vector database index | |
| ```bash | |
| python rag.py --build --csv amazon_multimodal_clean.csv --max 1000 | |
| ``` | |
| 2. Configure API keys (if using OpenAI) | |
| ```bash | |
| # .env file | |
| USE_OPENAI=true | |
| OPENAI_API_KEY=your-api-key-here | |
| ``` | |
| 3. Install dependencies | |
| ```bash | |
| pip install pandas openpyxl | |
| ``` | |
| ### Basic Usage | |
| **Retrieval evaluation only (fast, recommended first)** | |
| ```bash | |
| python evaluation.py \ | |
| --csv amazon_multimodal_clean.csv \ | |
| --db chromadb_store \ | |
| --output retrieval_eval.xlsx \ | |
| --retrieval-only \ | |
| --max-retrieval 100 | |
| ``` | |
| Expected time: 2-5 minutes (100 queries) | |
| **Full evaluation (retrieval + response quality)** | |
| ```bash | |
| python evaluation.py \ | |
| --csv amazon_multimodal_clean.csv \ | |
| --db chromadb_store \ | |
| --output full_eval.xlsx \ | |
| --max-retrieval 100 \ | |
| --max-response 50 \ | |
| --mode zero-shot | |
| ``` | |
| Expected time: | |
| - OpenAI GPT-4: 5-10 minutes (50 queries) | |
| - Local models: 20-60 minutes (50 queries) | |
| --- | |
| ## Evaluation Modes | |
| ### Retrieval-Only Mode | |
| Evaluates retrieval system without LLM: | |
| ```bash | |
| python evaluation.py --csv data.csv --retrieval-only | |
| ``` | |
| Advantages: | |
| - Fast (no LLM wait time) | |
| - Tests core retrieval capability | |
| - No API token consumption | |
| Use cases: | |
| - Debugging retrieval system | |
| - Optimizing embedding models | |
| - Quick performance benchmarks | |
| ### End-to-End Mode | |
| Evaluates full RAG pipeline (retrieval + LLM + response quality): | |
| ```bash | |
| python evaluation.py --csv data.csv --max-response 50 | |
| ``` | |
| Advantages: | |
| - Comprehensive performance assessment | |
| - Tests LLM response quality | |
| - Identifies end-to-end issues | |
| Disadvantages: | |
| - Slower | |
| - Consumes API tokens (if using OpenAI) | |
| ### Prompt Modes | |
| ```bash | |
| # Zero-shot (default) | |
| python evaluation.py --csv data.csv --mode zero-shot | |
| # Few-shot (with examples) | |
| python evaluation.py --csv data.csv --mode few-shot | |
| # Multi-shot (more examples) | |
| python evaluation.py --csv data.csv --mode multi-shot | |
| ``` | |
| Comparison: | |
| - Zero-shot: Fastest, no examples | |
| - Few-shot: Medium, provides 2 examples | |
| - Multi-shot: Slower, multiple examples (usually better quality) | |
| --- | |
| ## Understanding Results | |
| ### Excel Output Structure | |
| The generated Excel file contains multiple sheets: | |
| **Sheet 1: Summary** | |
| - Overview of all metrics | |
| - Average values for retrieval and response metrics | |
| - Use: Quick system performance overview | |
| **Sheet 2: Retrieval_Details** | |
| - Detailed metrics for each query | |
| - Columns: query_id, query_text, ground_truth_category, accuracy_at_1, recall metrics, distances | |
| - Use: Analyze which queries perform well/poorly, identify system weaknesses | |
| **Sheet 3: Response_Details** | |
| - LLM response details for each query | |
| - Columns: query_id, query, response, response_time, quality metrics | |
| - Use: Analyze LLM response quality, compare prompt modes, identify hallucinations | |
| **Sheet 4: Chart_Data** | |
| - Pre-formatted data for creating charts | |
| - Use: Quick visualization creation | |
| ### Performance Benchmarks | |
| Retrieval Metrics Benchmarks: | |
| ``` | |
| Metric | Excellent | Good | Needs Work | |
| ---------------|-----------|-----------|------------ | |
| Accuracy@1 | >0.80 | 0.65-0.80 | <0.65 | |
| Recall@5 | >0.90 | 0.75-0.90 | <0.75 | |
| Recall@10 | >0.95 | 0.85-0.95 | <0.85 | |
| MRR | >0.85 | 0.70-0.85 | <0.70 | |
| MAP | >0.80 | 0.65-0.80 | <0.65 | |
| ``` | |
| Response Metrics Benchmarks: | |
| ``` | |
| Metric | Excellent | Good | Needs Work | |
| ------------------------|-----------|-----------|------------ | |
| Response Time (GPT-4) | <3s | 3-5s | >5s | |
| Response Time (Local) | <10s | 10-30s | >30s | |
| Semantic Similarity | >0.70 | 0.55-0.70 | <0.55 | |
| Product Mention Rate | >0.70 | 0.50-0.70 | <0.50 | |
| Hedging Rate | <0.10 | 0.10-0.25 | >0.25 | |
| ``` | |
| --- | |
| ## Advanced Usage | |
| ### Custom Evaluation Size | |
| ```bash | |
| # Quick test (10 queries) | |
| python evaluation.py --csv data.csv --max-retrieval 10 --max-response 5 | |
| # Standard evaluation (100 queries) | |
| python evaluation.py --csv data.csv --max-retrieval 100 --max-response 50 | |
| # Large-scale evaluation (500+ queries) | |
| python evaluation.py --csv data.csv --max-retrieval 500 --max-response 200 | |
| ``` | |
| ### Using Evaluation in Code | |
| ```python | |
| from evaluation import RetrievalEvaluator, ResponseEvaluator, export_to_excel | |
| # Evaluate retrieval system | |
| retrieval_evaluator = RetrievalEvaluator(persist_dir="chromadb_store") | |
| results_df, metrics = retrieval_evaluator.evaluate_dataset( | |
| csv_path="amazon_multimodal_clean.csv", | |
| max_queries=100 | |
| ) | |
| print(f"Accuracy@1: {metrics['accuracy_at_1']:.3f}") | |
| print(f"Recall@5: {metrics['recall_at_5']:.3f}") | |
| # Export to Excel | |
| export_to_excel( | |
| retrieval_results=results_df, | |
| retrieval_metrics=metrics, | |
| output_path="my_eval.xlsx" | |
| ) | |
| ``` | |
| ### Batch Evaluation of Different Configurations | |
| ```bash | |
| # Test different prompt modes | |
| for mode in zero-shot few-shot multi-shot; do | |
| python evaluation.py \ | |
| --csv data.csv \ | |
| --mode $mode \ | |
| --output "eval_${mode}.xlsx" \ | |
| --max-response 50 | |
| done | |
| ``` | |
| --- | |
| ## Troubleshooting | |
| **Problem: ModuleNotFoundError: No module named 'openpyxl'** | |
| Solution: | |
| ```bash | |
| pip install openpyxl pandas | |
| ``` | |
| **Problem: Evaluation too slow** | |
| Solutions: | |
| 1. Use `--retrieval-only` mode (skip LLM) | |
| 2. Reduce evaluation count: `--max-response 10` | |
| 3. Use OpenAI GPT-4 instead of local models | |
| 4. Use faster local models (Mistral-7B instead of Mixtral-8x7B) | |
| **Problem: OpenAI API timeout or errors** | |
| Solutions: | |
| ```bash | |
| # Check API key | |
| echo $OPENAI_API_KEY | |
| # Check .env file | |
| cat .env | grep OPENAI | |
| # Use local model instead | |
| # In .env: | |
| USE_OPENAI=false | |
| LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3 | |
| ``` | |
| **Problem: CUDA out of memory (local models)** | |
| Solutions: | |
| ```bash | |
| # Use CPU mode | |
| export CUDA_VISIBLE_DEVICES=-1 | |
| # Or use smaller model | |
| # In .env: | |
| LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3 | |
| ``` | |
| --- | |
| ## Best Practices | |
| ### Iterative Evaluation Workflow | |
| ``` | |
| Step 1: Quick retrieval evaluation (10-20 queries) | |
| | | |
| Step 2: Analyze results, adjust parameters | |
| | | |
| Step 3: Medium-scale retrieval evaluation (100 queries) | |
| | | |
| Step 4: Small end-to-end evaluation (20-30 queries) | |
| | | |
| Step 5: Full evaluation (100+ retrieval + 50+ response) | |
| ``` | |
| ### A/B Testing Different Configurations | |
| ```bash | |
| # Test configuration A (using GPT-4) | |
| USE_OPENAI=true python evaluation.py --csv data.csv --output eval_gpt4.xlsx | |
| # Test configuration B (using Mistral) | |
| USE_OPENAI=false python evaluation.py --csv data.csv --output eval_mistral.xlsx | |
| ``` | |
| Compare Summary sheets in Excel to see differences. | |
| ### Continuous Monitoring | |
| Integrate evaluation into development workflow: | |
| ```bash | |
| # Run after code changes | |
| python evaluation.py --csv data.csv --output eval_$(date +%Y%m%d).xlsx --max-response 30 | |
| ``` | |
| Compare evaluations from different dates to track performance changes. | |
| --- | |
| ## Example Commands | |
| ```bash | |
| # 1. Quick retrieval test (2-3 minutes) | |
| python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 50 | |
| # 2. Standard retrieval evaluation (5-10 minutes) | |
| python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 100 | |
| # 3. Full evaluation - OpenAI GPT-4 (10-15 minutes) | |
| python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode zero-shot | |
| # 4. Full evaluation - Few-shot (15-20 minutes) | |
| python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode few-shot | |
| # 5. Large-scale evaluation (30-60 minutes) | |
| python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 500 --max-response 200 --mode zero-shot | |
| ``` | |
| --- | |
| ## Help | |
| - View `evaluation.py` source code for detailed comments | |
| - Run `python evaluation.py --help` for all parameters | |
| - Check `README.md` for overall project architecture | |
| --- | |
| Created: 2025-12-09 | |
| Project: Amazon Multimodal RAG Assistant | |
| Version: 1.0 | |