Spaces:
Sleeping
RMSE Metrics Implementation Guide
Overview
RMSE (Root Mean Squared Error) aggregation and per-metric statistics are now fully integrated into the evaluation system. These metrics are automatically computed during batch evaluation and included in both the UI display and JSON downloads.
What Was Implemented
1. RMSE Aggregation for Batch Evaluation
Method: RMSECalculator.compute_rmse_aggregation_for_batch(results)
Computes consistency metrics for each TRACE metric across all evaluations. Shows how much each metric varies across the batch.
Output Structure:
{
"rmse_metrics": {
"context_relevance": {
"mean": 0.3500,
"std_dev": 0.1225,
"min": 0.2000,
"max": 0.5000,
"variance": 0.0150,
"count": 3
},
"context_utilization": {
"mean": 0.7500,
"std_dev": 0.1225,
"min": 0.6000,
"max": 0.9000,
"variance": 0.0150,
"count": 3
},
"completeness": { ... },
"adherence": { ... }
}
}
Interpretation:
- Mean: Average score for that metric across all evaluations
- Std Dev: Variation (consistency) - lower is more consistent
- Min/Max: Range of values observed
- Variance: Squared standard deviation
- Count: Number of evaluations
2. Per-Metric Statistics
Method: AUCROCCalculator.compute_per_metric_statistics(results)
Provides detailed statistical breakdown of each TRACE metric without requiring ground truth.
Output Structure:
{
"per_metric_statistics": {
"context_relevance": {
"mean": 0.3500,
"median": 0.3500,
"std_dev": 0.1225,
"min": 0.2000,
"max": 0.5000,
"percentile_25": 0.2750,
"percentile_75": 0.4250,
"perfect_count": 0,
"poor_count": 1,
"sample_count": 3
},
"context_utilization": { ... },
"completeness": { ... },
"adherence": { ... }
}
}
Interpretation:
- Mean/Median: Central tendency of metric values
- Percentile 25/75: Distribution quartiles
- Perfect Count: How many evaluations scored >= 0.95
- Poor Count: How many evaluations scored < 0.3
- Sample Count: Total number of evaluations
UI Display
RMSE Aggregation Metrics (Metric Consistency)
Shows mean and standard deviation for each metric:
Relevance 0.350 ±0.123
Utilization 0.750 ±0.123
Completeness 0.717 ±0.125
Adherence 0.600 ±0.432
What it means:
- Lower Std Dev = More consistent metric
- High Std Dev (like Adherence 0.432) = Metric varies significantly across evaluations
Per-Metric Statistics (Distribution)
Shows distribution characteristics:
Relevance Mean 0.350 (Median: 0.350)
Utilization Mean 0.750 (Median: 0.750)
Completeness Mean 0.717 (Median: 0.750)
Adherence Mean 0.600 (Median: 0.800)
Expandable Details Include:
- All percentiles
- Perfect score count (>=0.95)
- Poor score count (<0.3)
- Min/max values
JSON Download Structure
Complete Results JSON
All metrics are now included in the downloaded JSON:
{
"evaluation_metadata": {
"timestamp": "2025-12-27T...",
"dataset": "...",
"method": "gpt_labeling_prompts",
"total_samples": 3,
"embedding_model": "..."
},
"aggregate_metrics": {
"context_relevance": 0.35,
"context_utilization": 0.75,
"completeness": 0.717,
"adherence": 0.60,
"average": 0.595
},
"rmse_metrics": {
"context_relevance": { "mean": 0.35, "std_dev": 0.1225, ... },
"context_utilization": { ... },
"completeness": { ... },
"adherence": { ... }
},
"per_metric_statistics": {
"context_relevance": { "mean": 0.35, "median": 0.35, ... },
"context_utilization": { ... },
"completeness": { ... },
"adherence": { ... }
},
"detailed_results": [ ... ]
}
How to Use These Metrics
1. Identify Inconsistent Metrics
Look at RMSE Aggregation Std Dev:
- Std Dev > 0.3 = High variance (unstable metric)
- Std Dev < 0.1 = Low variance (stable metric)
Example:
Adherence Std Dev: 0.432 <- Highly variable, evaluate consistency
2. Find Problem Areas
Look at Per-Metric Statistics:
- Poor Count > 0 = Metric has low scores (< 0.3)
- Perfect Count = 0 = No perfect scores
Example:
Context Relevance Poor Count: 1 <- Some queries have low relevance
Adherence Poor Count: 1 <- Some responses have hallucinations
3. Distribution Analysis
Compare Mean vs Median:
- If Mean ≈ Median: Symmetric distribution
- If Mean > Median: Right-skewed (some high values)
- If Mean < Median: Left-skewed (some low values)
Example:
Adherence Mean: 0.600, Median: 0.800
-> Left-skewed (pulled down by low values)
4. Evaluate Percentile Range
Use 25th and 75th percentiles to understand typical range:
Example:
Context Relevance: 25th=0.275, 75th=0.425
-> Typical range is 0.275-0.425 (middle 50%)
Integration with Evaluation Process
Automatic Computation
RMSE and per-metric statistics are computed automatically during evaluate_batch():
def evaluate_batch(self, test_cases):
# ... evaluation code ...
# Automatically compute metrics
rmse_metrics = RMSECalculator.compute_rmse_aggregation_for_batch(detailed_results)
per_metric_stats = AUCROCCalculator.compute_per_metric_statistics(detailed_results)
results["rmse_metrics"] = rmse_metrics
results["per_metric_statistics"] = per_metric_stats
return results
No Ground Truth Required
Unlike RMSE vs ground truth or AUCROC calculations:
- No ground truth needed
- Works with actual evaluation results
- Provides consistency/distribution insights
- Suitable for real-world evaluation
Example Analysis Workflow
Scenario: Evaluation Results
Sample 1: R=0.20, U=0.75, C=0.75, A=0.0
Sample 2: R=0.50, U=0.90, C=0.85, A=0.8
Sample 3: R=0.35, U=0.60, C=0.55, A=1.0
Step 1: Check RMSE Aggregation
Adherence Std Dev: 0.432 (highest variability)
-> Adherence scores vary widely (0.0 to 1.0)
Step 2: Check Per-Metric Statistics
Adherence: Mean=0.60, Median=0.80, Poor=1, Perfect=1
-> One perfect response, one with hallucinations
Step 3: Investigate Issues
Poor Adherence (0.0) appears in Sample 1
-> Investigate what caused the hallucination
-> Check retrieved documents and response
Step 4: Recommendation
Adherence is inconsistent (Std Dev 0.432)
-> Improve retrieval quality to avoid hallucinations
-> Focus on samples with A=0.0
Comparison with Previous Approach
Before
- Only overall averages shown
- No distribution information
- No consistency metrics
- Empty RMSE/AUCROC in JSON
After
- Overall averages + statistical breakdown
- Full distribution analysis (percentiles, quartiles)
- Consistency measurement (standard deviation)
- Populated RMSE and per-metric stats in JSON
- Perfect/poor count indicators
Technical Details
RMSE Aggregation Formula
For each metric:
Where:
- $x_i$ = metric value for evaluation $i$
- $\mu$ = mean metric value
- $n$ = number of evaluations
Per-Metric Statistics
- Percentile k: Value below which k% of data falls
- Perfect Count: Number of evaluations where metric >= 0.95
- Poor Count: Number of evaluations where metric < 0.3
Files Modified
advanced_rag_evaluator.py
- Added
compute_rmse_aggregation_for_batch()method - Added
compute_per_metric_statistics()method - Updated
evaluate_batch()to compute metrics
- Added
streamlit_app.py
- Added RMSE Aggregation section to UI
- Added Per-Metric Statistics section to UI
- Updated JSON download to include both metrics
Next Steps
Visualization
- Add charts showing metric distributions
- Comparison plots across evaluations
- Heatmaps for metric correlations
Advanced Analysis
- Metric trend analysis over time
- Correlation between metrics
- Root cause analysis for poor scores
Optimization
- Use insights to improve retrieval
- Adjust chunk size/overlap based on metrics
- Select embedding model based on metric performance