Spaces:

softblackhole
/

auto-tagging-rag

Sleeping

App Files Files Community

auto-tagging-rag / EVALUATION_GUIDE.md

soft.engineer

initial project

5ee86b8 4 months ago

preview code

raw

history blame contribute delete

10.2 kB

A newer version of the Gradio SDK is available: 6.8.0

Upgrade

Evaluation Guide - How to Run Evaluation with Samples

Overview

The Analytics & Evaluation tab allows you to run comprehensive quantitative evaluation of all 4 retrieval methods using test queries with ground truth documents.

Input Format

1. Evaluation Queries (JSON)

Required Format:

[
  {
    "query": "Your question here",
    "ground_truth": ["chunk_content_1", "chunk_content_2"],
    "k_values": [1, 3, 5],
    "tags": ["tag1", "tag2"],
    "tag_operator": "OR",
    "vector_weight": 0.7,
    "tag_weight": 0.3
  }
]

Fields:

query (required): The search question/query string
ground_truth (required): List of actual document contents that should be retrieved. These should match the actual text content of chunks in your indexed documents.
k_values (optional): List of k values to test (default: [1, 3, 5])
tags (optional): Tags for tag-based pipelines
tag_operator (optional): "OR", "AND", or "NOT" (default: "OR")
vector_weight (optional): For hybrid pipelines (default: 0.7)
tag_weight (optional): For hybrid pipelines (default: 0.3)

2. User Satisfaction Scores (JSON, Optional)

Format:

{
  "query_0": 4.5,
  "query_1": 3.8,
  "query_2": 5.0
}

Keys are "query_0", "query_1", etc. (index-based)
Values are satisfaction scores (typically 1-5)

Sample Evaluation Input

Example 1: Basic Evaluation

[
  {
    "query": "What are the emergency procedures for fire incidents?",
    "ground_truth": [
      "In case of fire, immediately activate the nearest fire alarm and evacuate the building following the posted exit routes.",
      "Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits.",
      "During fire emergencies, do not use elevators and stay low to avoid smoke inhalation."
    ],
    "k_values": [1, 3, 5]
  },
  {
    "query": "What equipment is needed for patient safety monitoring?",
    "ground_truth": [
      "Standard patient monitoring equipment includes blood pressure cuffs, pulse oximeters, and ECG monitors.",
      "Safety monitoring requires regular calibration of medical devices and documented maintenance logs."
    ],
    "k_values": [1, 3, 5]
  }
]

Example 2: With Tags

[
  {
    "query": "What are surgical safety protocols?",
    "ground_truth": [
      "All surgical procedures require pre-operative checklists and sterile environment protocols.",
      "Surgical safety includes patient identification verification and site marking procedures.",
      "Post-operative care involves monitoring vital signs and wound care instructions."
    ],
    "k_values": [1, 3, 5],
    "tags": ["surgery", "safety", "protocol"],
    "tag_operator": "AND"
  },
  {
    "query": "How to handle medical emergencies?",
    "ground_truth": [
      "Medical emergency response begins with assessing patient ABC (Airway, Breathing, Circulation).",
      "Emergency protocols require immediate notification of medical team and preparation of emergency equipment."
    ],
    "k_values": [1, 3, 5],
    "tags": ["emergency", "medical", "response"],
    "tag_operator": "OR"
  }
]

Example 3: With User Satisfaction

Evaluation Queries:

[
  {
    "query": "What are infection control measures?",
    "ground_truth": [
      "Infection control requires hand hygiene, use of personal protective equipment, and proper sterilization of instruments.",
      "Standard precautions must be followed for all patients to prevent transmission of infectious diseases."
    ],
    "k_values": [1, 3, 5]
  },
  {
    "query": "What are patient care guidelines?",
    "ground_truth": [
      "Patient care guidelines emphasize respect for patient autonomy, informed consent, and maintaining confidentiality.",
      "Care protocols require documentation of all interventions and regular assessment of patient condition."
    ],
    "k_values": [1, 3, 5]
  }
]

User Satisfaction Scores:

{
  "query_0": 4.5,
  "query_1": 4.2
}

Step-by-Step Instructions

Step 1: Upload Documents

Go to Upload & Tagging tab
Upload your PDF/TXT documents
Click "Build RAG Index"
Wait for indexing to complete

Step 2: Prepare Ground Truth

Important: Ground truth must match the actual text content of chunks in your indexed documents.

How to find ground truth:

Use Search & Compare tab to search for similar queries
Check the retrieved document content
Copy the exact text from relevant chunks
Use these as your ground_truth array

Example: If a chunk contains:

"Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits."

Then use:

"ground_truth": ["Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits."]

Step 3: Enter Evaluation Queries

Go to Analytics & Evaluation tab
In "Evaluation Queries (JSON)" field, paste your JSON array
Use the sample format above as a template

Step 4: (Optional) Add User Satisfaction

In "User Satisfaction Scores (JSON, optional)" field
Enter satisfaction scores as JSON object
Use query_0, query_1, etc. as keys

Step 5: Set Output Filename

In "Output Filename" field
Enter filename (e.g., evaluation_results.csv)
Results will be saved to reports/ directory

Step 6: Run Evaluation

Click "Run Evaluation" button
Wait for evaluation to complete (may take several minutes)
Results will appear in:
- Evaluation Status: Summary message
- Evaluation Results: DataFrame with all metrics
- Summary Statistics: Aggregated metrics by pipeline
- Visualization Tabs: Charts and graphs

Understanding Results

Metrics Explained

Precision@k: Fraction of retrieved documents that are relevant
- Range: 0.0 - 1.0 (higher is better)
- Example: 0.8 means 80% of retrieved docs are relevant
nDCG@k: Normalized Discounted Cumulative Gain
- Range: 0.0 - 1.0 (higher is better)
- Measures ranking quality with position weighting
Hit@k: Whether at least one relevant document is in top-k
- Value: 0.0 or 1.0 (1.0 = found at least one relevant doc)
MRR: Mean Reciprocal Rank
- Range: 0.0 - 1.0 (higher is better)
- Average of 1/rank where first relevant doc appears
Semantic Similarity: Average cosine similarity between query and retrieved docs
- Range: 0.0 - 1.0 (higher is better)
Latency: Response time in seconds (lower is better)
User Satisfaction: Average satisfaction score (if provided)
- Range: depends on your scale (typically 1-5)

Results DataFrame

Columns include:

query_id: Query identifier
query: Query text
pipeline: Pipeline name (base_rag, tag_filter_rag, hybrid_rag, hybrid_rerank_rag)
k: Number of results requested
precision_at_k: Precision metric
ndcg_at_k: nDCG metric
hit_at_k: Hit metric
mrr: MRR metric
semantic_similarity: Similarity score
latency: Response time
retrieved_count: Number of documents retrieved
user_satisfaction: Satisfaction score (if provided)

Common Issues and Solutions

Issue 1: "No results found" or Low Precision

Problem: Ground truth doesn't match indexed documents

Solution:

Check that ground truth text exactly matches chunk content
Use Search & Compare to verify what's actually indexed
Copy exact text from retrieved chunks

Issue 2: "Invalid JSON format"

Problem: JSON syntax error

Solution:

Validate JSON using an online JSON validator
Ensure all strings are in double quotes ", not single quotes '
Ensure no trailing commas
Check brackets and braces are balanced

Issue 3: Evaluation Takes Too Long

Problem: Too many queries or high k values

Solution:

Start with 2-3 queries
Use lower k values (e.g., [1, 3] instead of [1, 3, 5, 10])
Evaluation runs sequentially - be patient

Issue 4: All Metrics Are Zero

Problem: Ground truth doesn't match any retrieved documents

Solution:

Verify documents are actually indexed (check document count)
Check that ground truth text matches indexed chunk content exactly
Use semantic matching threshold (system uses ~0.8 similarity threshold)

Tips for Better Evaluation

Start Small: Begin with 2-3 queries to test the format
Verify Ground Truth: Always check what's actually indexed before creating ground truth
Use Representative Queries: Include queries that reflect real user needs
Test Different k Values: Try [1, 3, 5] to see how results improve with more documents
Compare Methods: Use evaluation to see which pipeline performs best for your data
Include Edge Cases: Test with queries that might not have perfect matches

Output Files

Evaluation generates several files in reports/ directory:

CSV File: evaluation_results.csv - Detailed metrics per query/pipeline/k
JSON File: evaluation_results.json - Complete results with summary
PNG Charts: Various visualization charts in reports/visualizations/
HTML Report: Comprehensive report with embedded charts

Sample Workflow

Upload documents → Index with tags
Search manually → Find relevant chunks
Create queries → Based on document topics
Extract ground truth → Copy exact chunk text
Run evaluation → Get quantitative metrics
Analyze results → Compare pipeline performance
Iterate → Refine queries and ground truth

Quick Reference

Minimal Valid Input:

[
  {
    "query": "Your question",
    "ground_truth": ["Exact chunk text 1", "Exact chunk text 2"]
  }
]

Full Input Example:

[
  {
    "query": "What are safety protocols?",
    "ground_truth": ["Safety protocol text from indexed document"],
    "k_values": [1, 3, 5],
    "tags": ["safety", "protocol"],
    "tag_operator": "OR"
  }
]

Remember: Ground truth must exactly match the content of your indexed document chunks!