Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

File size: 8,203 Bytes

1d10b0a

# RMSE Metrics Implementation Guide

## Overview

RMSE (Root Mean Squared Error) aggregation and per-metric statistics are now fully integrated into the evaluation system. These metrics are automatically computed during batch evaluation and included in both the UI display and JSON downloads.

## What Was Implemented

### 1. RMSE Aggregation for Batch Evaluation

**Method**: `RMSECalculator.compute_rmse_aggregation_for_batch(results)`

Computes consistency metrics for each TRACE metric across all evaluations. Shows how much each metric varies across the batch.

**Output Structure**:
```json
{
  "rmse_metrics": {
    "context_relevance": {
      "mean": 0.3500,
      "std_dev": 0.1225,
      "min": 0.2000,
      "max": 0.5000,
      "variance": 0.0150,
      "count": 3
    },
    "context_utilization": {
      "mean": 0.7500,
      "std_dev": 0.1225,
      "min": 0.6000,
      "max": 0.9000,
      "variance": 0.0150,
      "count": 3
    },
    "completeness": { ... },
    "adherence": { ... }
  }
}
```

**Interpretation**:
- **Mean**: Average score for that metric across all evaluations
- **Std Dev**: Variation (consistency) - lower is more consistent
- **Min/Max**: Range of values observed
- **Variance**: Squared standard deviation
- **Count**: Number of evaluations

### 2. Per-Metric Statistics

**Method**: `AUCROCCalculator.compute_per_metric_statistics(results)`

Provides detailed statistical breakdown of each TRACE metric without requiring ground truth.

**Output Structure**:
```json
{
  "per_metric_statistics": {
    "context_relevance": {
      "mean": 0.3500,
      "median": 0.3500,
      "std_dev": 0.1225,
      "min": 0.2000,
      "max": 0.5000,
      "percentile_25": 0.2750,
      "percentile_75": 0.4250,
      "perfect_count": 0,
      "poor_count": 1,
      "sample_count": 3
    },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  }
}
```

**Interpretation**:
- **Mean/Median**: Central tendency of metric values
- **Percentile 25/75**: Distribution quartiles
- **Perfect Count**: How many evaluations scored >= 0.95
- **Poor Count**: How many evaluations scored < 0.3
- **Sample Count**: Total number of evaluations

## UI Display

### RMSE Aggregation Metrics (Metric Consistency)

Shows mean and standard deviation for each metric:

```
Relevance      0.350 ±0.123
Utilization    0.750 ±0.123
Completeness   0.717 ±0.125
Adherence      0.600 ±0.432
```

**What it means**:
- Lower Std Dev = More consistent metric
- High Std Dev (like Adherence 0.432) = Metric varies significantly across evaluations

### Per-Metric Statistics (Distribution)

Shows distribution characteristics:

```
Relevance Mean       0.350 (Median: 0.350)
Utilization Mean     0.750 (Median: 0.750)
Completeness Mean    0.717 (Median: 0.750)
Adherence Mean       0.600 (Median: 0.800)
```

**Expandable Details Include**:
- All percentiles
- Perfect score count (>=0.95)
- Poor score count (<0.3)
- Min/max values

## JSON Download Structure

### Complete Results JSON

All metrics are now included in the downloaded JSON:

```json
{
  "evaluation_metadata": {
    "timestamp": "2025-12-27T...",
    "dataset": "...",
    "method": "gpt_labeling_prompts",
    "total_samples": 3,
    "embedding_model": "..."
  },
  "aggregate_metrics": {
    "context_relevance": 0.35,
    "context_utilization": 0.75,
    "completeness": 0.717,
    "adherence": 0.60,
    "average": 0.595
  },
  "rmse_metrics": {
    "context_relevance": { "mean": 0.35, "std_dev": 0.1225, ... },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  },
  "per_metric_statistics": {
    "context_relevance": { "mean": 0.35, "median": 0.35, ... },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  },
  "detailed_results": [ ... ]
}
```

## How to Use These Metrics

### 1. Identify Inconsistent Metrics

Look at RMSE Aggregation Std Dev:
- Std Dev > 0.3 = High variance (unstable metric)
- Std Dev < 0.1 = Low variance (stable metric)

Example:
```
Adherence Std Dev: 0.432  <- Highly variable, evaluate consistency
```

### 2. Find Problem Areas

Look at Per-Metric Statistics:
- Poor Count > 0 = Metric has low scores (< 0.3)
- Perfect Count = 0 = No perfect scores

Example:
```
Context Relevance Poor Count: 1   <- Some queries have low relevance
Adherence Poor Count: 1           <- Some responses have hallucinations
```

### 3. Distribution Analysis

Compare Mean vs Median:
- If Mean ≈ Median: Symmetric distribution
- If Mean > Median: Right-skewed (some high values)
- If Mean < Median: Left-skewed (some low values)

Example:
```
Adherence Mean: 0.600, Median: 0.800
-> Left-skewed (pulled down by low values)
```

### 4. Evaluate Percentile Range

Use 25th and 75th percentiles to understand typical range:

Example:
```
Context Relevance: 25th=0.275, 75th=0.425
-> Typical range is 0.275-0.425 (middle 50%)
```

## Integration with Evaluation Process

### Automatic Computation

RMSE and per-metric statistics are computed automatically during `evaluate_batch()`:

```python
def evaluate_batch(self, test_cases):
    # ... evaluation code ...
    
    # Automatically compute metrics
    rmse_metrics = RMSECalculator.compute_rmse_aggregation_for_batch(detailed_results)
    per_metric_stats = AUCROCCalculator.compute_per_metric_statistics(detailed_results)
    
    results["rmse_metrics"] = rmse_metrics
    results["per_metric_statistics"] = per_metric_stats
    
    return results
```

### No Ground Truth Required

Unlike RMSE vs ground truth or AUCROC calculations:
- **No ground truth needed**
- Works with actual evaluation results
- Provides consistency/distribution insights
- Suitable for real-world evaluation

## Example Analysis Workflow

### Scenario: Evaluation Results
```
Sample 1: R=0.20, U=0.75, C=0.75, A=0.0
Sample 2: R=0.50, U=0.90, C=0.85, A=0.8
Sample 3: R=0.35, U=0.60, C=0.55, A=1.0
```

### Step 1: Check RMSE Aggregation
```
Adherence Std Dev: 0.432 (highest variability)
-> Adherence scores vary widely (0.0 to 1.0)
```

### Step 2: Check Per-Metric Statistics
```
Adherence: Mean=0.60, Median=0.80, Poor=1, Perfect=1
-> One perfect response, one with hallucinations
```

### Step 3: Investigate Issues
```
Poor Adherence (0.0) appears in Sample 1
-> Investigate what caused the hallucination
-> Check retrieved documents and response
```

### Step 4: Recommendation
```
Adherence is inconsistent (Std Dev 0.432)
-> Improve retrieval quality to avoid hallucinations
-> Focus on samples with A=0.0
```

## Comparison with Previous Approach

### Before
- Only overall averages shown
- No distribution information
- No consistency metrics
- Empty RMSE/AUCROC in JSON

### After
- Overall averages + statistical breakdown
- Full distribution analysis (percentiles, quartiles)
- Consistency measurement (standard deviation)
- Populated RMSE and per-metric stats in JSON
- Perfect/poor count indicators

## Technical Details

### RMSE Aggregation Formula

For each metric:
$$\text{Std Dev} = \sqrt{\frac{\sum(x_i - \mu)^2}{n}}$$

Where:
- $x_i$ = metric value for evaluation $i$
- $\mu$ = mean metric value
- $n$ = number of evaluations

### Per-Metric Statistics

- **Percentile k**: Value below which k% of data falls
- **Perfect Count**: Number of evaluations where metric >= 0.95
- **Poor Count**: Number of evaluations where metric < 0.3

## Files Modified

1. **advanced_rag_evaluator.py**
   - Added `compute_rmse_aggregation_for_batch()` method
   - Added `compute_per_metric_statistics()` method
   - Updated `evaluate_batch()` to compute metrics

2. **streamlit_app.py**
   - Added RMSE Aggregation section to UI
   - Added Per-Metric Statistics section to UI
   - Updated JSON download to include both metrics

## Next Steps

### Visualization
- Add charts showing metric distributions
- Comparison plots across evaluations
- Heatmaps for metric correlations

### Advanced Analysis
- Metric trend analysis over time
- Correlation between metrics
- Root cause analysis for poor scores

### Optimization
- Use insights to improve retrieval
- Adjust chunk size/overlap based on metrics
- Select embedding model based on metric performance