Spaces:
Sleeping
TRACE Metrics RMSE Aggregation
Overview
RMSE (Root Mean Squared Error) aggregation for TRACE metrics provides a quantitative measure of consistency and quality across all four evaluation dimensions when using GPT-based labeling.
What is RMSE Aggregation?
RMSE aggregation is a statistical method that:
- Measures Consistency: Penalizes inconsistency across the four TRACE metrics
- Identifies Imbalances: Detects when some metrics are much higher/lower than others
- Quantifies Quality: Provides a single score representing overall evaluation coherence
TRACE Metrics Overview
The four core TRACE metrics evaluated with GPT labeling are:
| Metric | Description | Formula | Range |
|---|---|---|---|
| Context Relevance (R) | Fraction of retrieved context relevant to the query | Relevant sentences / Total retrieved sentences | 0-1 |
| Context Utilization (T) | Fraction of retrieved context actually used in response | Used sentences / Total retrieved sentences | 0-1 |
| Completeness (C) | Fraction of relevant information covered by response | (Relevant AND Used) / Relevant | 0-1 |
| Adherence (A) | Whether response is grounded in context (no hallucinations) | Fully supported sentences / Total sentences | 0-1 |
RMSE Aggregation Calculation
Single Evaluation RMSE
For a single TRACE evaluation with 4 metric scores, RMSE aggregation is calculated as:
μ = (R + A + C + U) / 4 [Mean of all metrics]
RMSE = √(((R-μ)² + (A-μ)² + (C-μ)² + (U-μ)²) / 4)
Interpretation:
- RMSE = 0: All metrics are perfectly equal (perfect consistency)
- RMSE < 0.15: Metrics are well-balanced (good quality)
- RMSE 0.15-0.30: Metrics show some imbalance (acceptable)
- RMSE > 0.30: Significant inconsistency between metrics (quality issue)
Multiple Evaluations RMSE
For comparing predicted vs ground truth across multiple evaluations:
For each metric M in [R, A, C, U]:
RMSE_M = √(Σ(predicted_M_i - truth_M_i)² / n)
Aggregated RMSE = √(Σ(RMSE_M)² / 4)
Consistency Score = 1.0 - min(Aggregated RMSE, 1.0)
Implementation in Code
1. Single Evaluation Aggregation
Added to AdvancedTRACEScores class:
def rmse_aggregation(self) -> float:
"""Calculate RMSE aggregation across all four TRACE metrics.
Returns:
RMSE value (0-1), where 0 = perfect consistency
"""
metrics = [
self.context_relevance,
self.context_utilization,
self.completeness,
self.adherence
]
mean = self.average()
squared_errors = [(m - mean) ** 2 for m in metrics]
mse = np.mean(squared_errors)
rmse = np.sqrt(mse)
return float(rmse)
Usage:
scores = evaluator.evaluate(question, response, documents)
rmse = scores.rmse_aggregation() # Returns 0-1 value
# Included automatically in to_dict()
score_dict = scores.to_dict()
# Contains: "rmse_aggregation": 0.15
2. Single Evaluation Ground Truth Comparison
Added to RMSECalculator class:
@staticmethod
def compute_rmse_single_trace_evaluation(
predicted_scores: AdvancedTRACEScores,
ground_truth_scores: AdvancedTRACEScores
) -> Dict[str, float]:
"""Compute RMSE for a single TRACE evaluation against ground truth."""
metrics = {
"context_relevance": (predicted_scores.context_relevance,
ground_truth_scores.context_relevance),
"context_utilization": (predicted_scores.context_utilization,
ground_truth_scores.context_utilization),
"completeness": (predicted_scores.completeness,
ground_truth_scores.completeness),
"adherence": (predicted_scores.adherence,
ground_truth_scores.adherence)
}
# Calculate RMSE for each metric
rmse_per_metric = {}
for metric_name, (pred, truth) in metrics.items():
rmse_per_metric[metric_name] = float((pred - truth) ** 2) ** 0.5
# Aggregate RMSE across metrics
aggregated_rmse = np.sqrt(np.mean(list(rmse_per_metric.values())))
return {
"per_metric": rmse_per_metric,
"aggregated_rmse": float(aggregated_rmse)
}
Usage:
predicted = evaluator.evaluate(question, response, documents)
ground_truth = AdvancedTRACEScores(...) # From labeled data
rmse_results = RMSECalculator.compute_rmse_single_trace_evaluation(
predicted, ground_truth
)
# Returns:
# {
# "per_metric": {
# "context_relevance": 0.05,
# "context_utilization": 0.08,
# "completeness": 0.03,
# "adherence": 0.12
# },
# "aggregated_rmse": 0.074
# }
3. Batch Evaluation RMSE Aggregation
Added to RMSECalculator class:
@staticmethod
def compute_trace_rmse_aggregation(results: List[Dict]) -> Dict[str, float]:
"""Compute RMSE aggregation across TRACE metrics for multiple evaluations.
Args:
results: List of evaluation results with metrics and ground truth
Returns:
Dictionary with per-metric RMSE, aggregated RMSE, and consistency score
"""
Usage:
# After evaluating multiple test cases
results = [
{
"metrics": {"context_relevance": 0.8, "context_utilization": 0.75, ...},
"ground_truth_scores": {"context_relevance": 0.82, ...}
},
# ... more results
]
aggregation = RMSECalculator.compute_trace_rmse_aggregation(results)
# Returns:
# {
# "per_metric_rmse": {
# "context_relevance": 0.045,
# "context_utilization": 0.062,
# "completeness": 0.038,
# "adherence": 0.091
# },
# "aggregated_rmse": 0.058,
# "consistency_score": 0.942,
# "num_evaluations": 50,
# "evaluated_metrics": ["context_relevance", ...]
# }
Practical Examples
Example 1: Balanced Metrics
Evaluation:
- Context Relevance: 0.85
- Context Utilization: 0.82
- Completeness: 0.88
- Adherence: 0.84
Calculation:
μ = (0.85 + 0.82 + 0.88 + 0.84) / 4 = 0.8475
Deviations: [0.0025, -0.0275, 0.0325, -0.0075]
MSE = (0.0025² + 0.0275² + 0.0325² + 0.0075²) / 4 = 0.000488
RMSE = √0.000488 = 0.022
Interpretation: Excellent consistency - all metrics are very similar.
Example 2: Imbalanced Metrics
Evaluation:
- Context Relevance: 0.95
- Context Utilization: 0.50
- Completeness: 0.85
- Adherence: 0.70
Calculation:
μ = (0.95 + 0.50 + 0.85 + 0.70) / 4 = 0.75
Deviations: [0.20, -0.25, 0.10, -0.05]
MSE = (0.04 + 0.0625 + 0.01 + 0.0025) / 4 = 0.0237
RMSE = √0.0237 = 0.154
Interpretation: Concerning inconsistency. High relevance but low utilization suggests either:
- Retrieved context not useful despite relevance
- RAG pipeline retrieval issue
- Response generation not leveraging available context
Interpretation Guide
RMSE Aggregation Levels
| Range | Quality | Meaning | Action |
|---|---|---|---|
| 0.00-0.10 | Excellent | Metrics perfectly balanced | No action needed |
| 0.10-0.20 | Good | Slight metric variation | Monitor and optimize |
| 0.20-0.30 | Acceptable | Moderate inconsistency | Investigate specific metrics |
| 0.30+ | Poor | High inconsistency | Review RAG pipeline |
Common Patterns
Low Context Utilization, High Relevance
- RMSE will be high
- Indicates good retrieval but poor generation
- Fix: Improve prompt, LLM instructions
Low Completeness, High Adherence
- RMSE will be moderate-high
- Indicates grounded but incomplete responses
- Fix: Improve retrieval coverage
Balanced but Low All Metrics
- RMSE will be low but overall quality low
- Indicates systematic issue across pipeline
- Fix: Review entire RAG pipeline
Consistency Score
The consistency score (0-1) is the inverse of aggregated RMSE:
Consistency Score = 1.0 - min(Aggregated RMSE, 1.0)
- Score = 1.0: Perfect consistency (RMSE = 0)
- Score = 0.94: Excellent consistency (RMSE = 0.06)
- Score = 0.80: Good consistency (RMSE = 0.20)
- **Score < 0.70**: Poor consistency (RMSE > 0.30)
Use Cases
1. Evaluation Quality Monitoring
Track RMSE aggregation over time to detect RAG pipeline degradation:
# Weekly evaluation report
rmse_trend = [0.08, 0.09, 0.12, 0.15, 0.20] # Degrading
# Alert: Pipeline quality declining
2. A/B Testing
Compare RAG configurations using RMSE:
config_a_rmse = 0.15 # Some imbalance
config_b_rmse = 0.08 # Better balance
# Choose config_b
3. Metric Target Setting
Use RMSE to set balanced improvement goals:
Current: R=0.95, U=0.50, C=0.85, A=0.70 (RMSE=0.154)
Target: R=0.87, U=0.85, C=0.85, A=0.82 (RMSE=0.019)
# Focus on improving utilization from 0.50→0.85
4. Problem Diagnosis
High RMSE with specific pattern identifies problems:
if high_relevance and low_utilization:
# Problem: Retrieval good, generation poor
focus_on = "LLM prompting and context usage"
elif low_completeness and high_adherence:
# Problem: Too conservative, missing info
focus_on = "Retrieval coverage and context richness"
JSON Output Format
When scores are exported to JSON (e.g., in JSON downloads):
{
"context_relevance": 0.85,
"context_utilization": 0.82,
"completeness": 0.88,
"adherence": 0.84,
"average": 0.8475,
"rmse_aggregation": 0.022,
"overall_supported": true,
"fully_supported_sentences": 8,
"partially_supported_sentences": 1,
"unsupported_sentences": 0
}
Advanced Analysis
Variance-Covariance Structure
RMSE aggregation reveals which metrics co-vary:
# High utilization always with high completeness
# (low RMSE with these two, high RMSE overall)
# → Indicates utilization → completeness dependency
# Low relevance but high utilization
# (high RMSE)
# → Indicates potential hallucination risk
Statistical Bounds
For normally distributed metrics, expected RMSE:
- Random metrics: ~0.27 (high variance)
- Well-tuned system: < 0.15 (low variance)
- Perfect system: 0.00 (no variance)
References
- TRACE Framework: RAGBench Paper (arXiv:2407.11005)
- RMSE Metric: Statistical standard measure of error
- Consistency Analysis: Quality assurance in ML/AI systems