CapStoneRAG10 / docs /RMSE_METRICS_IMPLEMENTATION.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

RMSE Metrics Implementation Guide

Overview

RMSE (Root Mean Squared Error) aggregation and per-metric statistics are now fully integrated into the evaluation system. These metrics are automatically computed during batch evaluation and included in both the UI display and JSON downloads.

What Was Implemented

1. RMSE Aggregation for Batch Evaluation

Method: RMSECalculator.compute_rmse_aggregation_for_batch(results)

Computes consistency metrics for each TRACE metric across all evaluations. Shows how much each metric varies across the batch.

Output Structure:

{
  "rmse_metrics": {
    "context_relevance": {
      "mean": 0.3500,
      "std_dev": 0.1225,
      "min": 0.2000,
      "max": 0.5000,
      "variance": 0.0150,
      "count": 3
    },
    "context_utilization": {
      "mean": 0.7500,
      "std_dev": 0.1225,
      "min": 0.6000,
      "max": 0.9000,
      "variance": 0.0150,
      "count": 3
    },
    "completeness": { ... },
    "adherence": { ... }
  }
}

Interpretation:

  • Mean: Average score for that metric across all evaluations
  • Std Dev: Variation (consistency) - lower is more consistent
  • Min/Max: Range of values observed
  • Variance: Squared standard deviation
  • Count: Number of evaluations

2. Per-Metric Statistics

Method: AUCROCCalculator.compute_per_metric_statistics(results)

Provides detailed statistical breakdown of each TRACE metric without requiring ground truth.

Output Structure:

{
  "per_metric_statistics": {
    "context_relevance": {
      "mean": 0.3500,
      "median": 0.3500,
      "std_dev": 0.1225,
      "min": 0.2000,
      "max": 0.5000,
      "percentile_25": 0.2750,
      "percentile_75": 0.4250,
      "perfect_count": 0,
      "poor_count": 1,
      "sample_count": 3
    },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  }
}

Interpretation:

  • Mean/Median: Central tendency of metric values
  • Percentile 25/75: Distribution quartiles
  • Perfect Count: How many evaluations scored >= 0.95
  • Poor Count: How many evaluations scored < 0.3
  • Sample Count: Total number of evaluations

UI Display

RMSE Aggregation Metrics (Metric Consistency)

Shows mean and standard deviation for each metric:

Relevance      0.350 ±0.123
Utilization    0.750 ±0.123
Completeness   0.717 ±0.125
Adherence      0.600 ±0.432

What it means:

  • Lower Std Dev = More consistent metric
  • High Std Dev (like Adherence 0.432) = Metric varies significantly across evaluations

Per-Metric Statistics (Distribution)

Shows distribution characteristics:

Relevance Mean       0.350 (Median: 0.350)
Utilization Mean     0.750 (Median: 0.750)
Completeness Mean    0.717 (Median: 0.750)
Adherence Mean       0.600 (Median: 0.800)

Expandable Details Include:

  • All percentiles
  • Perfect score count (>=0.95)
  • Poor score count (<0.3)
  • Min/max values

JSON Download Structure

Complete Results JSON

All metrics are now included in the downloaded JSON:

{
  "evaluation_metadata": {
    "timestamp": "2025-12-27T...",
    "dataset": "...",
    "method": "gpt_labeling_prompts",
    "total_samples": 3,
    "embedding_model": "..."
  },
  "aggregate_metrics": {
    "context_relevance": 0.35,
    "context_utilization": 0.75,
    "completeness": 0.717,
    "adherence": 0.60,
    "average": 0.595
  },
  "rmse_metrics": {
    "context_relevance": { "mean": 0.35, "std_dev": 0.1225, ... },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  },
  "per_metric_statistics": {
    "context_relevance": { "mean": 0.35, "median": 0.35, ... },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  },
  "detailed_results": [ ... ]
}

How to Use These Metrics

1. Identify Inconsistent Metrics

Look at RMSE Aggregation Std Dev:

  • Std Dev > 0.3 = High variance (unstable metric)
  • Std Dev < 0.1 = Low variance (stable metric)

Example:

Adherence Std Dev: 0.432  <- Highly variable, evaluate consistency

2. Find Problem Areas

Look at Per-Metric Statistics:

  • Poor Count > 0 = Metric has low scores (< 0.3)
  • Perfect Count = 0 = No perfect scores

Example:

Context Relevance Poor Count: 1   <- Some queries have low relevance
Adherence Poor Count: 1           <- Some responses have hallucinations

3. Distribution Analysis

Compare Mean vs Median:

  • If Mean ≈ Median: Symmetric distribution
  • If Mean > Median: Right-skewed (some high values)
  • If Mean < Median: Left-skewed (some low values)

Example:

Adherence Mean: 0.600, Median: 0.800
-> Left-skewed (pulled down by low values)

4. Evaluate Percentile Range

Use 25th and 75th percentiles to understand typical range:

Example:

Context Relevance: 25th=0.275, 75th=0.425
-> Typical range is 0.275-0.425 (middle 50%)

Integration with Evaluation Process

Automatic Computation

RMSE and per-metric statistics are computed automatically during evaluate_batch():

def evaluate_batch(self, test_cases):
    # ... evaluation code ...
    
    # Automatically compute metrics
    rmse_metrics = RMSECalculator.compute_rmse_aggregation_for_batch(detailed_results)
    per_metric_stats = AUCROCCalculator.compute_per_metric_statistics(detailed_results)
    
    results["rmse_metrics"] = rmse_metrics
    results["per_metric_statistics"] = per_metric_stats
    
    return results

No Ground Truth Required

Unlike RMSE vs ground truth or AUCROC calculations:

  • No ground truth needed
  • Works with actual evaluation results
  • Provides consistency/distribution insights
  • Suitable for real-world evaluation

Example Analysis Workflow

Scenario: Evaluation Results

Sample 1: R=0.20, U=0.75, C=0.75, A=0.0
Sample 2: R=0.50, U=0.90, C=0.85, A=0.8
Sample 3: R=0.35, U=0.60, C=0.55, A=1.0

Step 1: Check RMSE Aggregation

Adherence Std Dev: 0.432 (highest variability)
-> Adherence scores vary widely (0.0 to 1.0)

Step 2: Check Per-Metric Statistics

Adherence: Mean=0.60, Median=0.80, Poor=1, Perfect=1
-> One perfect response, one with hallucinations

Step 3: Investigate Issues

Poor Adherence (0.0) appears in Sample 1
-> Investigate what caused the hallucination
-> Check retrieved documents and response

Step 4: Recommendation

Adherence is inconsistent (Std Dev 0.432)
-> Improve retrieval quality to avoid hallucinations
-> Focus on samples with A=0.0

Comparison with Previous Approach

Before

  • Only overall averages shown
  • No distribution information
  • No consistency metrics
  • Empty RMSE/AUCROC in JSON

After

  • Overall averages + statistical breakdown
  • Full distribution analysis (percentiles, quartiles)
  • Consistency measurement (standard deviation)
  • Populated RMSE and per-metric stats in JSON
  • Perfect/poor count indicators

Technical Details

RMSE Aggregation Formula

For each metric: Std Dev=(xiμ)2n\text{Std Dev} = \sqrt{\frac{\sum(x_i - \mu)^2}{n}}

Where:

  • $x_i$ = metric value for evaluation $i$
  • $\mu$ = mean metric value
  • $n$ = number of evaluations

Per-Metric Statistics

  • Percentile k: Value below which k% of data falls
  • Perfect Count: Number of evaluations where metric >= 0.95
  • Poor Count: Number of evaluations where metric < 0.3

Files Modified

  1. advanced_rag_evaluator.py

    • Added compute_rmse_aggregation_for_batch() method
    • Added compute_per_metric_statistics() method
    • Updated evaluate_batch() to compute metrics
  2. streamlit_app.py

    • Added RMSE Aggregation section to UI
    • Added Per-Metric Statistics section to UI
    • Updated JSON download to include both metrics

Next Steps

Visualization

  • Add charts showing metric distributions
  • Comparison plots across evaluations
  • Heatmaps for metric correlations

Advanced Analysis

  • Metric trend analysis over time
  • Correlation between metrics
  • Root cause analysis for poor scores

Optimization

  • Use insights to improve retrieval
  • Adjust chunk size/overlap based on metrics
  • Select embedding model based on metric performance