# Evaluation System Guide

This guide explains how to use `evaluation.py` to evaluate the Amazon Multimodal RAG system.

## Evaluation Metrics

### Retrieval Metrics

**Accuracy@1**
- Percentage of queries where the top-1 result has the correct category
- Range: 0.0 - 1.0 (higher is better)

**Recall@K**
- Percentage of queries where correct category appears in top-K results
- Measured at K = 1, 5, 10
- Range: 0.0 - 1.0 (higher is better)

**MRR (Mean Reciprocal Rank)**
- Average of 1/rank for first correct result
- Range: 0.0 - 1.0 (higher is better)
- MRR = 1.0 means all top-1 results are correct

**MAP (Mean Average Precision)**
- Average precision across all relevant results
- Range: 0.0 - 1.0 (higher is better)

**Distance Metrics**
- Top-1 Distance: Distance to first result (lower is better)
- Average Distance: Mean distance of top-5 results (lower is better)

### Response Metrics

**Response Time**
- Time to generate response in seconds
- Evaluates system performance and user experience

**Product Mention Rate**
- Percentage of top-3 retrieved products mentioned in response
- Range: 0.0 - 1.0 (higher means response uses retrieval better)

**Category Mention Rate**
- Percentage of responses that mention correct product category
- Range: 0.0 - 1.0

**Semantic Similarity**
- Cosine similarity between query and response embeddings
- Range: -1.0 - 1.0 (higher means more relevant response)
- Interpretation: >0.7 (highly relevant), 0.5-0.7 (relevant), <0.5 (low relevance)

**Response Quality Indicators**
- Hedging Rate: Percentage using uncertain language ("not sure", "don't know")
- Comparison Rate: Percentage containing product comparisons

**Category Match Rate**
- Percentage where top-1 retrieved product category matches ground truth
- Range: 0.0 - 1.0

---

## Quick Start

### Prerequisites

1. Build vector database index
```bash
python rag.py --build --csv amazon_multimodal_clean.csv --max 1000
```

2. Configure API keys (if using OpenAI)
```bash
# .env file
USE_OPENAI=true
OPENAI_API_KEY=your-api-key-here
```

3. Install dependencies
```bash
pip install pandas openpyxl
```

### Basic Usage

**Retrieval evaluation only (fast, recommended first)**
```bash
python evaluation.py \
  --csv amazon_multimodal_clean.csv \
  --db chromadb_store \
  --output retrieval_eval.xlsx \
  --retrieval-only \
  --max-retrieval 100
```

Expected time: 2-5 minutes (100 queries)

**Full evaluation (retrieval + response quality)**
```bash
python evaluation.py \
  --csv amazon_multimodal_clean.csv \
  --db chromadb_store \
  --output full_eval.xlsx \
  --max-retrieval 100 \
  --max-response 50 \
  --mode zero-shot
```

Expected time:
- OpenAI GPT-4: 5-10 minutes (50 queries)
- Local models: 20-60 minutes (50 queries)

---

## Evaluation Modes

### Retrieval-Only Mode

Evaluates retrieval system without LLM:
```bash
python evaluation.py --csv data.csv --retrieval-only
```

Advantages:
- Fast (no LLM wait time)
- Tests core retrieval capability
- No API token consumption

Use cases:
- Debugging retrieval system
- Optimizing embedding models
- Quick performance benchmarks

### End-to-End Mode

Evaluates full RAG pipeline (retrieval + LLM + response quality):
```bash
python evaluation.py --csv data.csv --max-response 50
```

Advantages:
- Comprehensive performance assessment
- Tests LLM response quality
- Identifies end-to-end issues

Disadvantages:
- Slower
- Consumes API tokens (if using OpenAI)

### Prompt Modes

```bash
# Zero-shot (default)
python evaluation.py --csv data.csv --mode zero-shot

# Few-shot (with examples)
python evaluation.py --csv data.csv --mode few-shot

# Multi-shot (more examples)
python evaluation.py --csv data.csv --mode multi-shot
```

Comparison:
- Zero-shot: Fastest, no examples
- Few-shot: Medium, provides 2 examples
- Multi-shot: Slower, multiple examples (usually better quality)

---

## Understanding Results

### Excel Output Structure

The generated Excel file contains multiple sheets:

**Sheet 1: Summary**
- Overview of all metrics
- Average values for retrieval and response metrics
- Use: Quick system performance overview

**Sheet 2: Retrieval_Details**
- Detailed metrics for each query
- Columns: query_id, query_text, ground_truth_category, accuracy_at_1, recall metrics, distances
- Use: Analyze which queries perform well/poorly, identify system weaknesses

**Sheet 3: Response_Details**
- LLM response details for each query
- Columns: query_id, query, response, response_time, quality metrics
- Use: Analyze LLM response quality, compare prompt modes, identify hallucinations

**Sheet 4: Chart_Data**
- Pre-formatted data for creating charts
- Use: Quick visualization creation

### Performance Benchmarks

Retrieval Metrics Benchmarks:
```
Metric         | Excellent | Good      | Needs Work
---------------|-----------|-----------|------------
Accuracy@1     | >0.80     | 0.65-0.80 | <0.65
Recall@5       | >0.90     | 0.75-0.90 | <0.75
Recall@10      | >0.95     | 0.85-0.95 | <0.85
MRR            | >0.85     | 0.70-0.85 | <0.70
MAP            | >0.80     | 0.65-0.80 | <0.65
```

Response Metrics Benchmarks:
```
Metric                  | Excellent | Good      | Needs Work
------------------------|-----------|-----------|------------
Response Time (GPT-4)   | <3s       | 3-5s      | >5s
Response Time (Local)   | <10s      | 10-30s    | >30s
Semantic Similarity     | >0.70     | 0.55-0.70 | <0.55
Product Mention Rate    | >0.70     | 0.50-0.70 | <0.50
Hedging Rate            | <0.10     | 0.10-0.25 | >0.25
```

---

## Advanced Usage

### Custom Evaluation Size

```bash
# Quick test (10 queries)
python evaluation.py --csv data.csv --max-retrieval 10 --max-response 5

# Standard evaluation (100 queries)
python evaluation.py --csv data.csv --max-retrieval 100 --max-response 50

# Large-scale evaluation (500+ queries)
python evaluation.py --csv data.csv --max-retrieval 500 --max-response 200
```

### Using Evaluation in Code

```python
from evaluation import RetrievalEvaluator, ResponseEvaluator, export_to_excel

# Evaluate retrieval system
retrieval_evaluator = RetrievalEvaluator(persist_dir="chromadb_store")
results_df, metrics = retrieval_evaluator.evaluate_dataset(
    csv_path="amazon_multimodal_clean.csv",
    max_queries=100
)

print(f"Accuracy@1: {metrics['accuracy_at_1']:.3f}")
print(f"Recall@5: {metrics['recall_at_5']:.3f}")

# Export to Excel
export_to_excel(
    retrieval_results=results_df,
    retrieval_metrics=metrics,
    output_path="my_eval.xlsx"
)
```

### Batch Evaluation of Different Configurations

```bash
# Test different prompt modes
for mode in zero-shot few-shot multi-shot; do
  python evaluation.py \
    --csv data.csv \
    --mode $mode \
    --output "eval_${mode}.xlsx" \
    --max-response 50
done
```

---

## Troubleshooting

**Problem: ModuleNotFoundError: No module named 'openpyxl'**

Solution:
```bash
pip install openpyxl pandas
```

**Problem: Evaluation too slow**

Solutions:
1. Use `--retrieval-only` mode (skip LLM)
2. Reduce evaluation count: `--max-response 10`
3. Use OpenAI GPT-4 instead of local models
4. Use faster local models (Mistral-7B instead of Mixtral-8x7B)

**Problem: OpenAI API timeout or errors**

Solutions:
```bash
# Check API key
echo $OPENAI_API_KEY

# Check .env file
cat .env | grep OPENAI

# Use local model instead
# In .env:
USE_OPENAI=false
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
```

**Problem: CUDA out of memory (local models)**

Solutions:
```bash
# Use CPU mode
export CUDA_VISIBLE_DEVICES=-1

# Or use smaller model
# In .env:
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
```

---

## Best Practices

### Iterative Evaluation Workflow

```
Step 1: Quick retrieval evaluation (10-20 queries)
  |
Step 2: Analyze results, adjust parameters
  |
Step 3: Medium-scale retrieval evaluation (100 queries)
  |
Step 4: Small end-to-end evaluation (20-30 queries)
  |
Step 5: Full evaluation (100+ retrieval + 50+ response)
```

### A/B Testing Different Configurations

```bash
# Test configuration A (using GPT-4)
USE_OPENAI=true python evaluation.py --csv data.csv --output eval_gpt4.xlsx

# Test configuration B (using Mistral)
USE_OPENAI=false python evaluation.py --csv data.csv --output eval_mistral.xlsx
```

Compare Summary sheets in Excel to see differences.

### Continuous Monitoring

Integrate evaluation into development workflow:
```bash
# Run after code changes
python evaluation.py --csv data.csv --output eval_$(date +%Y%m%d).xlsx --max-response 30
```

Compare evaluations from different dates to track performance changes.

---

## Example Commands

```bash
# 1. Quick retrieval test (2-3 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 50

# 2. Standard retrieval evaluation (5-10 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --retrieval-only --max-retrieval 100

# 3. Full evaluation - OpenAI GPT-4 (10-15 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode zero-shot

# 4. Full evaluation - Few-shot (15-20 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 100 --max-response 50 --mode few-shot

# 5. Large-scale evaluation (30-60 minutes)
python evaluation.py --csv amazon_multimodal_clean.csv --max-retrieval 500 --max-response 200 --mode zero-shot
```

---

## Help

- View `evaluation.py` source code for detailed comments
- Run `python evaluation.py --help` for all parameters
- Check `README.md` for overall project architecture

---

Created: 2025-12-09  
Project: Amazon Multimodal RAG Assistant  
Version: 1.0