CapStoneRAG10 / docs /TRACE_RMSE_AGGREGATION.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

TRACE Metrics RMSE Aggregation

Overview

RMSE (Root Mean Squared Error) aggregation for TRACE metrics provides a quantitative measure of consistency and quality across all four evaluation dimensions when using GPT-based labeling.

What is RMSE Aggregation?

RMSE aggregation is a statistical method that:

  1. Measures Consistency: Penalizes inconsistency across the four TRACE metrics
  2. Identifies Imbalances: Detects when some metrics are much higher/lower than others
  3. Quantifies Quality: Provides a single score representing overall evaluation coherence

TRACE Metrics Overview

The four core TRACE metrics evaluated with GPT labeling are:

Metric Description Formula Range
Context Relevance (R) Fraction of retrieved context relevant to the query Relevant sentences / Total retrieved sentences 0-1
Context Utilization (T) Fraction of retrieved context actually used in response Used sentences / Total retrieved sentences 0-1
Completeness (C) Fraction of relevant information covered by response (Relevant AND Used) / Relevant 0-1
Adherence (A) Whether response is grounded in context (no hallucinations) Fully supported sentences / Total sentences 0-1

RMSE Aggregation Calculation

Single Evaluation RMSE

For a single TRACE evaluation with 4 metric scores, RMSE aggregation is calculated as:

μ = (R + A + C + U) / 4                    [Mean of all metrics]
RMSE = √(((R-μ)² + (A-μ)² + (C-μ)² + (U-μ)²) / 4)

Interpretation:

  • RMSE = 0: All metrics are perfectly equal (perfect consistency)
  • RMSE < 0.15: Metrics are well-balanced (good quality)
  • RMSE 0.15-0.30: Metrics show some imbalance (acceptable)
  • RMSE > 0.30: Significant inconsistency between metrics (quality issue)

Multiple Evaluations RMSE

For comparing predicted vs ground truth across multiple evaluations:

For each metric M in [R, A, C, U]:
    RMSE_M = √(Σ(predicted_M_i - truth_M_i)² / n)

Aggregated RMSE = √(Σ(RMSE_M)² / 4)
Consistency Score = 1.0 - min(Aggregated RMSE, 1.0)

Implementation in Code

1. Single Evaluation Aggregation

Added to AdvancedTRACEScores class:

def rmse_aggregation(self) -> float:
    """Calculate RMSE aggregation across all four TRACE metrics.
    
    Returns:
        RMSE value (0-1), where 0 = perfect consistency
    """
    metrics = [
        self.context_relevance,
        self.context_utilization,
        self.completeness,
        self.adherence
    ]
    mean = self.average()
    squared_errors = [(m - mean) ** 2 for m in metrics]
    mse = np.mean(squared_errors)
    rmse = np.sqrt(mse)
    return float(rmse)

Usage:

scores = evaluator.evaluate(question, response, documents)
rmse = scores.rmse_aggregation()  # Returns 0-1 value

# Included automatically in to_dict()
score_dict = scores.to_dict()
# Contains: "rmse_aggregation": 0.15

2. Single Evaluation Ground Truth Comparison

Added to RMSECalculator class:

@staticmethod
def compute_rmse_single_trace_evaluation(
    predicted_scores: AdvancedTRACEScores,
    ground_truth_scores: AdvancedTRACEScores
) -> Dict[str, float]:
    """Compute RMSE for a single TRACE evaluation against ground truth."""
    
    metrics = {
        "context_relevance": (predicted_scores.context_relevance, 
                              ground_truth_scores.context_relevance),
        "context_utilization": (predicted_scores.context_utilization, 
                                ground_truth_scores.context_utilization),
        "completeness": (predicted_scores.completeness, 
                         ground_truth_scores.completeness),
        "adherence": (predicted_scores.adherence, 
                      ground_truth_scores.adherence)
    }
    
    # Calculate RMSE for each metric
    rmse_per_metric = {}
    for metric_name, (pred, truth) in metrics.items():
        rmse_per_metric[metric_name] = float((pred - truth) ** 2) ** 0.5
    
    # Aggregate RMSE across metrics
    aggregated_rmse = np.sqrt(np.mean(list(rmse_per_metric.values())))
    
    return {
        "per_metric": rmse_per_metric,
        "aggregated_rmse": float(aggregated_rmse)
    }

Usage:

predicted = evaluator.evaluate(question, response, documents)
ground_truth = AdvancedTRACEScores(...)  # From labeled data

rmse_results = RMSECalculator.compute_rmse_single_trace_evaluation(
    predicted, ground_truth
)
# Returns:
# {
#     "per_metric": {
#         "context_relevance": 0.05,
#         "context_utilization": 0.08,
#         "completeness": 0.03,
#         "adherence": 0.12
#     },
#     "aggregated_rmse": 0.074
# }

3. Batch Evaluation RMSE Aggregation

Added to RMSECalculator class:

@staticmethod
def compute_trace_rmse_aggregation(results: List[Dict]) -> Dict[str, float]:
    """Compute RMSE aggregation across TRACE metrics for multiple evaluations.
    
    Args:
        results: List of evaluation results with metrics and ground truth
        
    Returns:
        Dictionary with per-metric RMSE, aggregated RMSE, and consistency score
    """

Usage:

# After evaluating multiple test cases
results = [
    {
        "metrics": {"context_relevance": 0.8, "context_utilization": 0.75, ...},
        "ground_truth_scores": {"context_relevance": 0.82, ...}
    },
    # ... more results
]

aggregation = RMSECalculator.compute_trace_rmse_aggregation(results)
# Returns:
# {
#     "per_metric_rmse": {
#         "context_relevance": 0.045,
#         "context_utilization": 0.062,
#         "completeness": 0.038,
#         "adherence": 0.091
#     },
#     "aggregated_rmse": 0.058,
#     "consistency_score": 0.942,
#     "num_evaluations": 50,
#     "evaluated_metrics": ["context_relevance", ...]
# }

Practical Examples

Example 1: Balanced Metrics

Evaluation:

  • Context Relevance: 0.85
  • Context Utilization: 0.82
  • Completeness: 0.88
  • Adherence: 0.84

Calculation:

μ = (0.85 + 0.82 + 0.88 + 0.84) / 4 = 0.8475
Deviations: [0.0025, -0.0275, 0.0325, -0.0075]
MSE = (0.0025² + 0.0275² + 0.0325² + 0.0075²) / 4 = 0.000488
RMSE = √0.000488 = 0.022

Interpretation: Excellent consistency - all metrics are very similar.

Example 2: Imbalanced Metrics

Evaluation:

  • Context Relevance: 0.95
  • Context Utilization: 0.50
  • Completeness: 0.85
  • Adherence: 0.70

Calculation:

μ = (0.95 + 0.50 + 0.85 + 0.70) / 4 = 0.75
Deviations: [0.20, -0.25, 0.10, -0.05]
MSE = (0.04 + 0.0625 + 0.01 + 0.0025) / 4 = 0.0237
RMSE = √0.0237 = 0.154

Interpretation: Concerning inconsistency. High relevance but low utilization suggests either:

  • Retrieved context not useful despite relevance
  • RAG pipeline retrieval issue
  • Response generation not leveraging available context

Interpretation Guide

RMSE Aggregation Levels

Range Quality Meaning Action
0.00-0.10 Excellent Metrics perfectly balanced No action needed
0.10-0.20 Good Slight metric variation Monitor and optimize
0.20-0.30 Acceptable Moderate inconsistency Investigate specific metrics
0.30+ Poor High inconsistency Review RAG pipeline

Common Patterns

Low Context Utilization, High Relevance

  • RMSE will be high
  • Indicates good retrieval but poor generation
  • Fix: Improve prompt, LLM instructions

Low Completeness, High Adherence

  • RMSE will be moderate-high
  • Indicates grounded but incomplete responses
  • Fix: Improve retrieval coverage

Balanced but Low All Metrics

  • RMSE will be low but overall quality low
  • Indicates systematic issue across pipeline
  • Fix: Review entire RAG pipeline

Consistency Score

The consistency score (0-1) is the inverse of aggregated RMSE:

Consistency Score = 1.0 - min(Aggregated RMSE, 1.0)
  • Score = 1.0: Perfect consistency (RMSE = 0)
  • Score = 0.94: Excellent consistency (RMSE = 0.06)
  • Score = 0.80: Good consistency (RMSE = 0.20)
  • **Score < 0.70**: Poor consistency (RMSE > 0.30)

Use Cases

1. Evaluation Quality Monitoring

Track RMSE aggregation over time to detect RAG pipeline degradation:

# Weekly evaluation report
rmse_trend = [0.08, 0.09, 0.12, 0.15, 0.20]  # Degrading
# Alert: Pipeline quality declining

2. A/B Testing

Compare RAG configurations using RMSE:

config_a_rmse = 0.15  # Some imbalance
config_b_rmse = 0.08  # Better balance
# Choose config_b

3. Metric Target Setting

Use RMSE to set balanced improvement goals:

Current: R=0.95, U=0.50, C=0.85, A=0.70 (RMSE=0.154)
Target:  R=0.87, U=0.85, C=0.85, A=0.82 (RMSE=0.019)
# Focus on improving utilization from 0.50→0.85

4. Problem Diagnosis

High RMSE with specific pattern identifies problems:

if high_relevance and low_utilization:
    # Problem: Retrieval good, generation poor
    focus_on = "LLM prompting and context usage"
elif low_completeness and high_adherence:
    # Problem: Too conservative, missing info
    focus_on = "Retrieval coverage and context richness"

JSON Output Format

When scores are exported to JSON (e.g., in JSON downloads):

{
  "context_relevance": 0.85,
  "context_utilization": 0.82,
  "completeness": 0.88,
  "adherence": 0.84,
  "average": 0.8475,
  "rmse_aggregation": 0.022,
  "overall_supported": true,
  "fully_supported_sentences": 8,
  "partially_supported_sentences": 1,
  "unsupported_sentences": 0
}

Advanced Analysis

Variance-Covariance Structure

RMSE aggregation reveals which metrics co-vary:

# High utilization always with high completeness
# (low RMSE with these two, high RMSE overall)
# → Indicates utilization → completeness dependency

# Low relevance but high utilization
# (high RMSE)
# → Indicates potential hallucination risk

Statistical Bounds

For normally distributed metrics, expected RMSE:

  • Random metrics: ~0.27 (high variance)
  • Well-tuned system: < 0.15 (low variance)
  • Perfect system: 0.00 (no variance)

References

  • TRACE Framework: RAGBench Paper (arXiv:2407.11005)
  • RMSE Metric: Statistical standard measure of error
  • Consistency Analysis: Quality assurance in ML/AI systems