auto-tagging-rag / EVALUATION_GUIDE.md
soft.engineer
initial project
5ee86b8

A newer version of the Gradio SDK is available: 6.8.0

Upgrade

Evaluation Guide - How to Run Evaluation with Samples

Overview

The Analytics & Evaluation tab allows you to run comprehensive quantitative evaluation of all 4 retrieval methods using test queries with ground truth documents.

Input Format

1. Evaluation Queries (JSON)

Required Format:

[
  {
    "query": "Your question here",
    "ground_truth": ["chunk_content_1", "chunk_content_2"],
    "k_values": [1, 3, 5],
    "tags": ["tag1", "tag2"],
    "tag_operator": "OR",
    "vector_weight": 0.7,
    "tag_weight": 0.3
  }
]

Fields:

  • query (required): The search question/query string
  • ground_truth (required): List of actual document contents that should be retrieved. These should match the actual text content of chunks in your indexed documents.
  • k_values (optional): List of k values to test (default: [1, 3, 5])
  • tags (optional): Tags for tag-based pipelines
  • tag_operator (optional): "OR", "AND", or "NOT" (default: "OR")
  • vector_weight (optional): For hybrid pipelines (default: 0.7)
  • tag_weight (optional): For hybrid pipelines (default: 0.3)

2. User Satisfaction Scores (JSON, Optional)

Format:

{
  "query_0": 4.5,
  "query_1": 3.8,
  "query_2": 5.0
}
  • Keys are "query_0", "query_1", etc. (index-based)
  • Values are satisfaction scores (typically 1-5)

Sample Evaluation Input

Example 1: Basic Evaluation

[
  {
    "query": "What are the emergency procedures for fire incidents?",
    "ground_truth": [
      "In case of fire, immediately activate the nearest fire alarm and evacuate the building following the posted exit routes.",
      "Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits.",
      "During fire emergencies, do not use elevators and stay low to avoid smoke inhalation."
    ],
    "k_values": [1, 3, 5]
  },
  {
    "query": "What equipment is needed for patient safety monitoring?",
    "ground_truth": [
      "Standard patient monitoring equipment includes blood pressure cuffs, pulse oximeters, and ECG monitors.",
      "Safety monitoring requires regular calibration of medical devices and documented maintenance logs."
    ],
    "k_values": [1, 3, 5]
  }
]

Example 2: With Tags

[
  {
    "query": "What are surgical safety protocols?",
    "ground_truth": [
      "All surgical procedures require pre-operative checklists and sterile environment protocols.",
      "Surgical safety includes patient identification verification and site marking procedures.",
      "Post-operative care involves monitoring vital signs and wound care instructions."
    ],
    "k_values": [1, 3, 5],
    "tags": ["surgery", "safety", "protocol"],
    "tag_operator": "AND"
  },
  {
    "query": "How to handle medical emergencies?",
    "ground_truth": [
      "Medical emergency response begins with assessing patient ABC (Airway, Breathing, Circulation).",
      "Emergency protocols require immediate notification of medical team and preparation of emergency equipment."
    ],
    "k_values": [1, 3, 5],
    "tags": ["emergency", "medical", "response"],
    "tag_operator": "OR"
  }
]

Example 3: With User Satisfaction

Evaluation Queries:

[
  {
    "query": "What are infection control measures?",
    "ground_truth": [
      "Infection control requires hand hygiene, use of personal protective equipment, and proper sterilization of instruments.",
      "Standard precautions must be followed for all patients to prevent transmission of infectious diseases."
    ],
    "k_values": [1, 3, 5]
  },
  {
    "query": "What are patient care guidelines?",
    "ground_truth": [
      "Patient care guidelines emphasize respect for patient autonomy, informed consent, and maintaining confidentiality.",
      "Care protocols require documentation of all interventions and regular assessment of patient condition."
    ],
    "k_values": [1, 3, 5]
  }
]

User Satisfaction Scores:

{
  "query_0": 4.5,
  "query_1": 4.2
}

Step-by-Step Instructions

Step 1: Upload Documents

  1. Go to Upload & Tagging tab
  2. Upload your PDF/TXT documents
  3. Click "Build RAG Index"
  4. Wait for indexing to complete

Step 2: Prepare Ground Truth

Important: Ground truth must match the actual text content of chunks in your indexed documents.

How to find ground truth:

  1. Use Search & Compare tab to search for similar queries
  2. Check the retrieved document content
  3. Copy the exact text from relevant chunks
  4. Use these as your ground_truth array

Example: If a chunk contains:

"Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits."

Then use:

"ground_truth": ["Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits."]

Step 3: Enter Evaluation Queries

  1. Go to Analytics & Evaluation tab
  2. In "Evaluation Queries (JSON)" field, paste your JSON array
  3. Use the sample format above as a template

Step 4: (Optional) Add User Satisfaction

  1. In "User Satisfaction Scores (JSON, optional)" field
  2. Enter satisfaction scores as JSON object
  3. Use query_0, query_1, etc. as keys

Step 5: Set Output Filename

  1. In "Output Filename" field
  2. Enter filename (e.g., evaluation_results.csv)
  3. Results will be saved to reports/ directory

Step 6: Run Evaluation

  1. Click "Run Evaluation" button
  2. Wait for evaluation to complete (may take several minutes)
  3. Results will appear in:
    • Evaluation Status: Summary message
    • Evaluation Results: DataFrame with all metrics
    • Summary Statistics: Aggregated metrics by pipeline
    • Visualization Tabs: Charts and graphs

Understanding Results

Metrics Explained

  • Precision@k: Fraction of retrieved documents that are relevant

    • Range: 0.0 - 1.0 (higher is better)
    • Example: 0.8 means 80% of retrieved docs are relevant
  • nDCG@k: Normalized Discounted Cumulative Gain

    • Range: 0.0 - 1.0 (higher is better)
    • Measures ranking quality with position weighting
  • Hit@k: Whether at least one relevant document is in top-k

    • Value: 0.0 or 1.0 (1.0 = found at least one relevant doc)
  • MRR: Mean Reciprocal Rank

    • Range: 0.0 - 1.0 (higher is better)
    • Average of 1/rank where first relevant doc appears
  • Semantic Similarity: Average cosine similarity between query and retrieved docs

    • Range: 0.0 - 1.0 (higher is better)
  • Latency: Response time in seconds (lower is better)

  • User Satisfaction: Average satisfaction score (if provided)

    • Range: depends on your scale (typically 1-5)

Results DataFrame

Columns include:

  • query_id: Query identifier
  • query: Query text
  • pipeline: Pipeline name (base_rag, tag_filter_rag, hybrid_rag, hybrid_rerank_rag)
  • k: Number of results requested
  • precision_at_k: Precision metric
  • ndcg_at_k: nDCG metric
  • hit_at_k: Hit metric
  • mrr: MRR metric
  • semantic_similarity: Similarity score
  • latency: Response time
  • retrieved_count: Number of documents retrieved
  • user_satisfaction: Satisfaction score (if provided)

Common Issues and Solutions

Issue 1: "No results found" or Low Precision

Problem: Ground truth doesn't match indexed documents

Solution:

  1. Check that ground truth text exactly matches chunk content
  2. Use Search & Compare to verify what's actually indexed
  3. Copy exact text from retrieved chunks

Issue 2: "Invalid JSON format"

Problem: JSON syntax error

Solution:

  1. Validate JSON using an online JSON validator
  2. Ensure all strings are in double quotes ", not single quotes '
  3. Ensure no trailing commas
  4. Check brackets and braces are balanced

Issue 3: Evaluation Takes Too Long

Problem: Too many queries or high k values

Solution:

  1. Start with 2-3 queries
  2. Use lower k values (e.g., [1, 3] instead of [1, 3, 5, 10])
  3. Evaluation runs sequentially - be patient

Issue 4: All Metrics Are Zero

Problem: Ground truth doesn't match any retrieved documents

Solution:

  1. Verify documents are actually indexed (check document count)
  2. Check that ground truth text matches indexed chunk content exactly
  3. Use semantic matching threshold (system uses ~0.8 similarity threshold)

Tips for Better Evaluation

  1. Start Small: Begin with 2-3 queries to test the format
  2. Verify Ground Truth: Always check what's actually indexed before creating ground truth
  3. Use Representative Queries: Include queries that reflect real user needs
  4. Test Different k Values: Try [1, 3, 5] to see how results improve with more documents
  5. Compare Methods: Use evaluation to see which pipeline performs best for your data
  6. Include Edge Cases: Test with queries that might not have perfect matches

Output Files

Evaluation generates several files in reports/ directory:

  1. CSV File: evaluation_results.csv - Detailed metrics per query/pipeline/k
  2. JSON File: evaluation_results.json - Complete results with summary
  3. PNG Charts: Various visualization charts in reports/visualizations/
  4. HTML Report: Comprehensive report with embedded charts

Sample Workflow

  1. Upload documents β†’ Index with tags
  2. Search manually β†’ Find relevant chunks
  3. Create queries β†’ Based on document topics
  4. Extract ground truth β†’ Copy exact chunk text
  5. Run evaluation β†’ Get quantitative metrics
  6. Analyze results β†’ Compare pipeline performance
  7. Iterate β†’ Refine queries and ground truth

Quick Reference

Minimal Valid Input:

[
  {
    "query": "Your question",
    "ground_truth": ["Exact chunk text 1", "Exact chunk text 2"]
  }
]

Full Input Example:

[
  {
    "query": "What are safety protocols?",
    "ground_truth": ["Safety protocol text from indexed document"],
    "k_values": [1, 3, 5],
    "tags": ["safety", "protocol"],
    "tag_operator": "OR"
  }
]

Remember: Ground truth must exactly match the content of your indexed document chunks!