Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.8.0
Evaluation Guide - How to Run Evaluation with Samples
Overview
The Analytics & Evaluation tab allows you to run comprehensive quantitative evaluation of all 4 retrieval methods using test queries with ground truth documents.
Input Format
1. Evaluation Queries (JSON)
Required Format:
[
{
"query": "Your question here",
"ground_truth": ["chunk_content_1", "chunk_content_2"],
"k_values": [1, 3, 5],
"tags": ["tag1", "tag2"],
"tag_operator": "OR",
"vector_weight": 0.7,
"tag_weight": 0.3
}
]
Fields:
query(required): The search question/query stringground_truth(required): List of actual document contents that should be retrieved. These should match the actual text content of chunks in your indexed documents.k_values(optional): List of k values to test (default:[1, 3, 5])tags(optional): Tags for tag-based pipelinestag_operator(optional):"OR","AND", or"NOT"(default:"OR")vector_weight(optional): For hybrid pipelines (default:0.7)tag_weight(optional): For hybrid pipelines (default:0.3)
2. User Satisfaction Scores (JSON, Optional)
Format:
{
"query_0": 4.5,
"query_1": 3.8,
"query_2": 5.0
}
- Keys are
"query_0","query_1", etc. (index-based) - Values are satisfaction scores (typically 1-5)
Sample Evaluation Input
Example 1: Basic Evaluation
[
{
"query": "What are the emergency procedures for fire incidents?",
"ground_truth": [
"In case of fire, immediately activate the nearest fire alarm and evacuate the building following the posted exit routes.",
"Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits.",
"During fire emergencies, do not use elevators and stay low to avoid smoke inhalation."
],
"k_values": [1, 3, 5]
},
{
"query": "What equipment is needed for patient safety monitoring?",
"ground_truth": [
"Standard patient monitoring equipment includes blood pressure cuffs, pulse oximeters, and ECG monitors.",
"Safety monitoring requires regular calibration of medical devices and documented maintenance logs."
],
"k_values": [1, 3, 5]
}
]
Example 2: With Tags
[
{
"query": "What are surgical safety protocols?",
"ground_truth": [
"All surgical procedures require pre-operative checklists and sterile environment protocols.",
"Surgical safety includes patient identification verification and site marking procedures.",
"Post-operative care involves monitoring vital signs and wound care instructions."
],
"k_values": [1, 3, 5],
"tags": ["surgery", "safety", "protocol"],
"tag_operator": "AND"
},
{
"query": "How to handle medical emergencies?",
"ground_truth": [
"Medical emergency response begins with assessing patient ABC (Airway, Breathing, Circulation).",
"Emergency protocols require immediate notification of medical team and preparation of emergency equipment."
],
"k_values": [1, 3, 5],
"tags": ["emergency", "medical", "response"],
"tag_operator": "OR"
}
]
Example 3: With User Satisfaction
Evaluation Queries:
[
{
"query": "What are infection control measures?",
"ground_truth": [
"Infection control requires hand hygiene, use of personal protective equipment, and proper sterilization of instruments.",
"Standard precautions must be followed for all patients to prevent transmission of infectious diseases."
],
"k_values": [1, 3, 5]
},
{
"query": "What are patient care guidelines?",
"ground_truth": [
"Patient care guidelines emphasize respect for patient autonomy, informed consent, and maintaining confidentiality.",
"Care protocols require documentation of all interventions and regular assessment of patient condition."
],
"k_values": [1, 3, 5]
}
]
User Satisfaction Scores:
{
"query_0": 4.5,
"query_1": 4.2
}
Step-by-Step Instructions
Step 1: Upload Documents
- Go to Upload & Tagging tab
- Upload your PDF/TXT documents
- Click "Build RAG Index"
- Wait for indexing to complete
Step 2: Prepare Ground Truth
Important: Ground truth must match the actual text content of chunks in your indexed documents.
How to find ground truth:
- Use Search & Compare tab to search for similar queries
- Check the retrieved document content
- Copy the exact text from relevant chunks
- Use these as your
ground_trutharray
Example: If a chunk contains:
"Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits."
Then use:
"ground_truth": ["Fire safety protocols require all personnel to know the location of fire extinguishers and emergency exits."]
Step 3: Enter Evaluation Queries
- Go to Analytics & Evaluation tab
- In "Evaluation Queries (JSON)" field, paste your JSON array
- Use the sample format above as a template
Step 4: (Optional) Add User Satisfaction
- In "User Satisfaction Scores (JSON, optional)" field
- Enter satisfaction scores as JSON object
- Use
query_0,query_1, etc. as keys
Step 5: Set Output Filename
- In "Output Filename" field
- Enter filename (e.g.,
evaluation_results.csv) - Results will be saved to
reports/directory
Step 6: Run Evaluation
- Click "Run Evaluation" button
- Wait for evaluation to complete (may take several minutes)
- Results will appear in:
- Evaluation Status: Summary message
- Evaluation Results: DataFrame with all metrics
- Summary Statistics: Aggregated metrics by pipeline
- Visualization Tabs: Charts and graphs
Understanding Results
Metrics Explained
Precision@k: Fraction of retrieved documents that are relevant
- Range: 0.0 - 1.0 (higher is better)
- Example: 0.8 means 80% of retrieved docs are relevant
nDCG@k: Normalized Discounted Cumulative Gain
- Range: 0.0 - 1.0 (higher is better)
- Measures ranking quality with position weighting
Hit@k: Whether at least one relevant document is in top-k
- Value: 0.0 or 1.0 (1.0 = found at least one relevant doc)
MRR: Mean Reciprocal Rank
- Range: 0.0 - 1.0 (higher is better)
- Average of 1/rank where first relevant doc appears
Semantic Similarity: Average cosine similarity between query and retrieved docs
- Range: 0.0 - 1.0 (higher is better)
Latency: Response time in seconds (lower is better)
User Satisfaction: Average satisfaction score (if provided)
- Range: depends on your scale (typically 1-5)
Results DataFrame
Columns include:
query_id: Query identifierquery: Query textpipeline: Pipeline name (base_rag, tag_filter_rag, hybrid_rag, hybrid_rerank_rag)k: Number of results requestedprecision_at_k: Precision metricndcg_at_k: nDCG metrichit_at_k: Hit metricmrr: MRR metricsemantic_similarity: Similarity scorelatency: Response timeretrieved_count: Number of documents retrieveduser_satisfaction: Satisfaction score (if provided)
Common Issues and Solutions
Issue 1: "No results found" or Low Precision
Problem: Ground truth doesn't match indexed documents
Solution:
- Check that ground truth text exactly matches chunk content
- Use Search & Compare to verify what's actually indexed
- Copy exact text from retrieved chunks
Issue 2: "Invalid JSON format"
Problem: JSON syntax error
Solution:
- Validate JSON using an online JSON validator
- Ensure all strings are in double quotes
", not single quotes' - Ensure no trailing commas
- Check brackets and braces are balanced
Issue 3: Evaluation Takes Too Long
Problem: Too many queries or high k values
Solution:
- Start with 2-3 queries
- Use lower k values (e.g.,
[1, 3]instead of[1, 3, 5, 10]) - Evaluation runs sequentially - be patient
Issue 4: All Metrics Are Zero
Problem: Ground truth doesn't match any retrieved documents
Solution:
- Verify documents are actually indexed (check document count)
- Check that ground truth text matches indexed chunk content exactly
- Use semantic matching threshold (system uses ~0.8 similarity threshold)
Tips for Better Evaluation
- Start Small: Begin with 2-3 queries to test the format
- Verify Ground Truth: Always check what's actually indexed before creating ground truth
- Use Representative Queries: Include queries that reflect real user needs
- Test Different k Values: Try
[1, 3, 5]to see how results improve with more documents - Compare Methods: Use evaluation to see which pipeline performs best for your data
- Include Edge Cases: Test with queries that might not have perfect matches
Output Files
Evaluation generates several files in reports/ directory:
- CSV File:
evaluation_results.csv- Detailed metrics per query/pipeline/k - JSON File:
evaluation_results.json- Complete results with summary - PNG Charts: Various visualization charts in
reports/visualizations/ - HTML Report: Comprehensive report with embedded charts
Sample Workflow
- Upload documents β Index with tags
- Search manually β Find relevant chunks
- Create queries β Based on document topics
- Extract ground truth β Copy exact chunk text
- Run evaluation β Get quantitative metrics
- Analyze results β Compare pipeline performance
- Iterate β Refine queries and ground truth
Quick Reference
Minimal Valid Input:
[
{
"query": "Your question",
"ground_truth": ["Exact chunk text 1", "Exact chunk text 2"]
}
]
Full Input Example:
[
{
"query": "What are safety protocols?",
"ground_truth": ["Safety protocol text from indexed document"],
"k_values": [1, 3, 5],
"tags": ["safety", "protocol"],
"tag_operator": "OR"
}
]
Remember: Ground truth must exactly match the content of your indexed document chunks!