# Evaluation Guide - How to Run Evaluation with Samples

## Overview

The **Analytics & Evaluation** tab allows you to run comprehensive quantitative evaluation of all 4 retrieval methods using test queries with ground truth documents.

## Input Format

### 1. Evaluation Queries (JSON)

**Required Format:**
```json
[
  {
    "query": "Your question here",
    "ground_truth": ["chunk_content_1", "chunk_content_2"],
    "k_values": [1, 3, 5],
    "tags": ["tag1", "tag2"],
    "tag_operator": "OR",
    "vector_weight": 0.7,
    "tag_weight": 0.3
  }
]
```

**Fields:**
- **`query`** (required): The search question/query string
- **`ground_truth`** (required): List of actual document contents that should be retrieved. These should match the **actual text content** of chunks in your indexed documents.
- **`k_values`** (optional): List of k values to test (default: `[1, 3, 5]`)
- **`tags`** (optional): Tags for tag-based pipelines
- **`tag_operator`** (optional): `"OR"`, `"AND"`, or `"NOT"` (default: `"OR"`)
- **`vector_weight`** (optional): For hybrid pipelines (default: `0.7`)
- **`tag_weight`** (optional): For hybrid pipelines (default: `0.3`)

### 2. User Satisfaction Scores (JSON, Optional)

**Format:**
```json
{
  "query_0": 4.5,
  "query_1": 3.8,
  "query_2": 5.0
}
```

- Keys are `"query_0"`, `"query_1"`, etc. (index-based)
- Values are satisfaction scores (typically 1-5)

## Sample Evaluation Input

### Example 1: Basic Evaluation

```json
[
  {
    "query": "What are the emergency procedures for fire incidents?",
    "ground_truth": [
      "In case of fire, immediately activate the nearest fire alarm and evacuate the building following the posted exit routes.",
      "Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits.",
      "During fire emergencies, do not use elevators and stay low to avoid smoke inhalation."
    ],
    "k_values": [1, 3, 5]
  },
  {
    "query": "What equipment is needed for patient safety monitoring?",
    "ground_truth": [
      "Standard patient monitoring equipment includes blood pressure cuffs, pulse oximeters, and ECG monitors.",
      "Safety monitoring requires regular calibration of medical devices and documented maintenance logs."
    ],
    "k_values": [1, 3, 5]
  }
]
```

### Example 2: With Tags

```json
[
  {
    "query": "What are surgical safety protocols?",
    "ground_truth": [
      "All surgical procedures require pre-operative checklists and sterile environment protocols.",
      "Surgical safety includes patient identification verification and site marking procedures.",
      "Post-operative care involves monitoring vital signs and wound care instructions."
    ],
    "k_values": [1, 3, 5],
    "tags": ["surgery", "safety", "protocol"],
    "tag_operator": "AND"
  },
  {
    "query": "How to handle medical emergencies?",
    "ground_truth": [
      "Medical emergency response begins with assessing patient ABC (Airway, Breathing, Circulation).",
      "Emergency protocols require immediate notification of medical team and preparation of emergency equipment."
    ],
    "k_values": [1, 3, 5],
    "tags": ["emergency", "medical", "response"],
    "tag_operator": "OR"
  }
]
```

### Example 3: With User Satisfaction

**Evaluation Queries:**
```json
[
  {
    "query": "What are infection control measures?",
    "ground_truth": [
      "Infection control requires hand hygiene, use of personal protective equipment, and proper sterilization of instruments.",
      "Standard precautions must be followed for all patients to prevent transmission of infectious diseases."
    ],
    "k_values": [1, 3, 5]
  },
  {
    "query": "What are patient care guidelines?",
    "ground_truth": [
      "Patient care guidelines emphasize respect for patient autonomy, informed consent, and maintaining confidentiality.",
      "Care protocols require documentation of all interventions and regular assessment of patient condition."
    ],
    "k_values": [1, 3, 5]
  }
]
```

**User Satisfaction Scores:**
```json
{
  "query_0": 4.5,
  "query_1": 4.2
}
```

## Step-by-Step Instructions

### Step 1: Upload Documents
1. Go to **Upload & Tagging** tab
2. Upload your PDF/TXT documents
3. Click **"Build RAG Index"**
4. Wait for indexing to complete

### Step 2: Prepare Ground Truth
**Important:** Ground truth must match the **actual text content** of chunks in your indexed documents.

**How to find ground truth:**
1. Use **Search & Compare** tab to search for similar queries
2. Check the retrieved document content
3. Copy the exact text from relevant chunks
4. Use these as your `ground_truth` array

**Example:**
If a chunk contains:
```
"Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits."
```

Then use:
```json
"ground_truth": ["Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits."]
```

### Step 3: Enter Evaluation Queries
1. Go to **Analytics & Evaluation** tab
2. In **"Evaluation Queries (JSON)"** field, paste your JSON array
3. Use the sample format above as a template

### Step 4: (Optional) Add User Satisfaction
1. In **"User Satisfaction Scores (JSON, optional)"** field
2. Enter satisfaction scores as JSON object
3. Use `query_0`, `query_1`, etc. as keys

### Step 5: Set Output Filename
1. In **"Output Filename"** field
2. Enter filename (e.g., `evaluation_results.csv`)
3. Results will be saved to `reports/` directory

### Step 6: Run Evaluation
1. Click **"Run Evaluation"** button
2. Wait for evaluation to complete (may take several minutes)
3. Results will appear in:
   - **Evaluation Status**: Summary message
   - **Evaluation Results**: DataFrame with all metrics
   - **Summary Statistics**: Aggregated metrics by pipeline
   - **Visualization Tabs**: Charts and graphs

## Understanding Results

### Metrics Explained

- **Precision@k**: Fraction of retrieved documents that are relevant
  - Range: 0.0 - 1.0 (higher is better)
  - Example: 0.8 means 80% of retrieved docs are relevant

- **nDCG@k**: Normalized Discounted Cumulative Gain
  - Range: 0.0 - 1.0 (higher is better)
  - Measures ranking quality with position weighting

- **Hit@k**: Whether at least one relevant document is in top-k
  - Value: 0.0 or 1.0 (1.0 = found at least one relevant doc)

- **MRR**: Mean Reciprocal Rank
  - Range: 0.0 - 1.0 (higher is better)
  - Average of 1/rank where first relevant doc appears

- **Semantic Similarity**: Average cosine similarity between query and retrieved docs
  - Range: 0.0 - 1.0 (higher is better)

- **Latency**: Response time in seconds (lower is better)

- **User Satisfaction**: Average satisfaction score (if provided)
  - Range: depends on your scale (typically 1-5)

### Results DataFrame

Columns include:
- `query_id`: Query identifier
- `query`: Query text
- `pipeline`: Pipeline name (base_rag, tag_filter_rag, hybrid_rag, hybrid_rerank_rag)
- `k`: Number of results requested
- `precision_at_k`: Precision metric
- `ndcg_at_k`: nDCG metric
- `hit_at_k`: Hit metric
- `mrr`: MRR metric
- `semantic_similarity`: Similarity score
- `latency`: Response time
- `retrieved_count`: Number of documents retrieved
- `user_satisfaction`: Satisfaction score (if provided)

## Common Issues and Solutions

### Issue 1: "No results found" or Low Precision

**Problem:** Ground truth doesn't match indexed documents

**Solution:**
1. Check that ground truth text **exactly matches** chunk content
2. Use **Search & Compare** to verify what's actually indexed
3. Copy exact text from retrieved chunks

### Issue 2: "Invalid JSON format"

**Problem:** JSON syntax error

**Solution:**
1. Validate JSON using an online JSON validator
2. Ensure all strings are in double quotes `"`, not single quotes `'`
3. Ensure no trailing commas
4. Check brackets and braces are balanced

### Issue 3: Evaluation Takes Too Long

**Problem:** Too many queries or high k values

**Solution:**
1. Start with 2-3 queries
2. Use lower k values (e.g., `[1, 3]` instead of `[1, 3, 5, 10]`)
3. Evaluation runs sequentially - be patient

### Issue 4: All Metrics Are Zero

**Problem:** Ground truth doesn't match any retrieved documents

**Solution:**
1. Verify documents are actually indexed (check document count)
2. Check that ground truth text matches indexed chunk content exactly
3. Use semantic matching threshold (system uses ~0.8 similarity threshold)

## Tips for Better Evaluation

1. **Start Small**: Begin with 2-3 queries to test the format
2. **Verify Ground Truth**: Always check what's actually indexed before creating ground truth
3. **Use Representative Queries**: Include queries that reflect real user needs
4. **Test Different k Values**: Try `[1, 3, 5]` to see how results improve with more documents
5. **Compare Methods**: Use evaluation to see which pipeline performs best for your data
6. **Include Edge Cases**: Test with queries that might not have perfect matches

## Output Files

Evaluation generates several files in `reports/` directory:

1. **CSV File**: `evaluation_results.csv` - Detailed metrics per query/pipeline/k
2. **JSON File**: `evaluation_results.json` - Complete results with summary
3. **PNG Charts**: Various visualization charts in `reports/visualizations/`
4. **HTML Report**: Comprehensive report with embedded charts

## Sample Workflow

1. **Upload documents** → Index with tags
2. **Search manually** → Find relevant chunks
3. **Create queries** → Based on document topics
4. **Extract ground truth** → Copy exact chunk text
5. **Run evaluation** → Get quantitative metrics
6. **Analyze results** → Compare pipeline performance
7. **Iterate** → Refine queries and ground truth

## Quick Reference

**Minimal Valid Input:**
```json
[
  {
    "query": "Your question",
    "ground_truth": ["Exact chunk text 1", "Exact chunk text 2"]
  }
]
```

**Full Input Example:**
```json
[
  {
    "query": "What are safety protocols?",
    "ground_truth": ["Safety protocol text from indexed document"],
    "k_values": [1, 3, 5],
    "tags": ["safety", "protocol"],
    "tag_operator": "OR"
  }
]
```

Remember: Ground truth must **exactly match** the content of your indexed document chunks!