Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /RMSE_METRICS_IMPLEMENTATION.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 1 month ago

preview code

raw

history blame contribute delete

8.2 kB

RMSE Metrics Implementation Guide

Overview

RMSE (Root Mean Squared Error) aggregation and per-metric statistics are now fully integrated into the evaluation system. These metrics are automatically computed during batch evaluation and included in both the UI display and JSON downloads.

What Was Implemented

1. RMSE Aggregation for Batch Evaluation

Method: RMSECalculator.compute_rmse_aggregation_for_batch(results)

Computes consistency metrics for each TRACE metric across all evaluations. Shows how much each metric varies across the batch.

Output Structure:

{
  "rmse_metrics": {
    "context_relevance": {
      "mean": 0.3500,
      "std_dev": 0.1225,
      "min": 0.2000,
      "max": 0.5000,
      "variance": 0.0150,
      "count": 3
    },
    "context_utilization": {
      "mean": 0.7500,
      "std_dev": 0.1225,
      "min": 0.6000,
      "max": 0.9000,
      "variance": 0.0150,
      "count": 3
    },
    "completeness": { ... },
    "adherence": { ... }
  }
}

Interpretation:

Mean: Average score for that metric across all evaluations
Std Dev: Variation (consistency) - lower is more consistent
Min/Max: Range of values observed
Variance: Squared standard deviation
Count: Number of evaluations

2. Per-Metric Statistics

Method: AUCROCCalculator.compute_per_metric_statistics(results)

Provides detailed statistical breakdown of each TRACE metric without requiring ground truth.

Output Structure:

{
  "per_metric_statistics": {
    "context_relevance": {
      "mean": 0.3500,
      "median": 0.3500,
      "std_dev": 0.1225,
      "min": 0.2000,
      "max": 0.5000,
      "percentile_25": 0.2750,
      "percentile_75": 0.4250,
      "perfect_count": 0,
      "poor_count": 1,
      "sample_count": 3
    },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  }
}

Interpretation:

Mean/Median: Central tendency of metric values
Percentile 25/75: Distribution quartiles
Perfect Count: How many evaluations scored >= 0.95
Poor Count: How many evaluations scored < 0.3
Sample Count: Total number of evaluations

UI Display

RMSE Aggregation Metrics (Metric Consistency)

Shows mean and standard deviation for each metric:

Relevance      0.350 ±0.123
Utilization    0.750 ±0.123
Completeness   0.717 ±0.125
Adherence      0.600 ±0.432

What it means:

Lower Std Dev = More consistent metric
High Std Dev (like Adherence 0.432) = Metric varies significantly across evaluations

Per-Metric Statistics (Distribution)

Shows distribution characteristics:

Relevance Mean       0.350 (Median: 0.350)
Utilization Mean     0.750 (Median: 0.750)
Completeness Mean    0.717 (Median: 0.750)
Adherence Mean       0.600 (Median: 0.800)

Expandable Details Include:

All percentiles
Perfect score count (>=0.95)
Poor score count (<0.3)
Min/max values

JSON Download Structure

Complete Results JSON

All metrics are now included in the downloaded JSON:

{
  "evaluation_metadata": {
    "timestamp": "2025-12-27T...",
    "dataset": "...",
    "method": "gpt_labeling_prompts",
    "total_samples": 3,
    "embedding_model": "..."
  },
  "aggregate_metrics": {
    "context_relevance": 0.35,
    "context_utilization": 0.75,
    "completeness": 0.717,
    "adherence": 0.60,
    "average": 0.595
  },
  "rmse_metrics": {
    "context_relevance": { "mean": 0.35, "std_dev": 0.1225, ... },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  },
  "per_metric_statistics": {
    "context_relevance": { "mean": 0.35, "median": 0.35, ... },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  },
  "detailed_results": [ ... ]
}

How to Use These Metrics

1. Identify Inconsistent Metrics

Look at RMSE Aggregation Std Dev:

Std Dev > 0.3 = High variance (unstable metric)
Std Dev < 0.1 = Low variance (stable metric)

Example:

Adherence Std Dev: 0.432  <- Highly variable, evaluate consistency

2. Find Problem Areas

Look at Per-Metric Statistics:

Poor Count > 0 = Metric has low scores (< 0.3)
Perfect Count = 0 = No perfect scores

Example:

Context Relevance Poor Count: 1   <- Some queries have low relevance
Adherence Poor Count: 1           <- Some responses have hallucinations

3. Distribution Analysis

Compare Mean vs Median:

If Mean ≈ Median: Symmetric distribution
If Mean > Median: Right-skewed (some high values)
If Mean < Median: Left-skewed (some low values)

Example:

Adherence Mean: 0.600, Median: 0.800
-> Left-skewed (pulled down by low values)

4. Evaluate Percentile Range

Use 25th and 75th percentiles to understand typical range:

Example:

Context Relevance: 25th=0.275, 75th=0.425
-> Typical range is 0.275-0.425 (middle 50%)

Integration with Evaluation Process

Automatic Computation

RMSE and per-metric statistics are computed automatically during evaluate_batch():

def evaluate_batch(self, test_cases):
    # ... evaluation code ...
    
    # Automatically compute metrics
    rmse_metrics = RMSECalculator.compute_rmse_aggregation_for_batch(detailed_results)
    per_metric_stats = AUCROCCalculator.compute_per_metric_statistics(detailed_results)
    
    results["rmse_metrics"] = rmse_metrics
    results["per_metric_statistics"] = per_metric_stats
    
    return results

No Ground Truth Required

Unlike RMSE vs ground truth or AUCROC calculations:

No ground truth needed
Works with actual evaluation results
Provides consistency/distribution insights
Suitable for real-world evaluation

Example Analysis Workflow

Scenario: Evaluation Results

Sample 1: R=0.20, U=0.75, C=0.75, A=0.0
Sample 2: R=0.50, U=0.90, C=0.85, A=0.8
Sample 3: R=0.35, U=0.60, C=0.55, A=1.0

Step 1: Check RMSE Aggregation

Adherence Std Dev: 0.432 (highest variability)
-> Adherence scores vary widely (0.0 to 1.0)

Step 2: Check Per-Metric Statistics

Adherence: Mean=0.60, Median=0.80, Poor=1, Perfect=1
-> One perfect response, one with hallucinations

Step 3: Investigate Issues

Poor Adherence (0.0) appears in Sample 1
-> Investigate what caused the hallucination
-> Check retrieved documents and response

Step 4: Recommendation

Adherence is inconsistent (Std Dev 0.432)
-> Improve retrieval quality to avoid hallucinations
-> Focus on samples with A=0.0

Comparison with Previous Approach

Before

Only overall averages shown
No distribution information
No consistency metrics
Empty RMSE/AUCROC in JSON

After

Overall averages + statistical breakdown
Full distribution analysis (percentiles, quartiles)
Consistency measurement (standard deviation)
Populated RMSE and per-metric stats in JSON
Perfect/poor count indicators

Technical Details

RMSE Aggregation Formula

For each metric: $\text{Std Dev} = \sqrt{\frac{\sum(x_i - \mu)^2}{n}}$

Where:

$x_i$ = metric value for evaluation $i$
$\mu$ = mean metric value
$n$ = number of evaluations

Per-Metric Statistics

Percentile k: Value below which k% of data falls
Perfect Count: Number of evaluations where metric >= 0.95
Poor Count: Number of evaluations where metric < 0.3

Files Modified

advanced_rag_evaluator.py
- Added compute_rmse_aggregation_for_batch() method
- Added compute_per_metric_statistics() method
- Updated evaluate_batch() to compute metrics
streamlit_app.py
- Added RMSE Aggregation section to UI
- Added Per-Metric Statistics section to UI
- Updated JSON download to include both metrics

Next Steps

Visualization

Add charts showing metric distributions
Comparison plots across evaluations
Heatmaps for metric correlations

Advanced Analysis

Metric trend analysis over time
Correlation between metrics
Root cause analysis for poor scores

Optimization

Use insights to improve retrieval
Adjust chunk size/overlap based on metrics
Select embedding model based on metric performance