CapStoneRAG10 / docs /RMSE_METRICS_IMPLEMENTATION.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# RMSE Metrics Implementation Guide
## Overview
RMSE (Root Mean Squared Error) aggregation and per-metric statistics are now fully integrated into the evaluation system. These metrics are automatically computed during batch evaluation and included in both the UI display and JSON downloads.
## What Was Implemented
### 1. RMSE Aggregation for Batch Evaluation
**Method**: `RMSECalculator.compute_rmse_aggregation_for_batch(results)`
Computes consistency metrics for each TRACE metric across all evaluations. Shows how much each metric varies across the batch.
**Output Structure**:
```json
{
"rmse_metrics": {
"context_relevance": {
"mean": 0.3500,
"std_dev": 0.1225,
"min": 0.2000,
"max": 0.5000,
"variance": 0.0150,
"count": 3
},
"context_utilization": {
"mean": 0.7500,
"std_dev": 0.1225,
"min": 0.6000,
"max": 0.9000,
"variance": 0.0150,
"count": 3
},
"completeness": { ... },
"adherence": { ... }
}
}
```
**Interpretation**:
- **Mean**: Average score for that metric across all evaluations
- **Std Dev**: Variation (consistency) - lower is more consistent
- **Min/Max**: Range of values observed
- **Variance**: Squared standard deviation
- **Count**: Number of evaluations
### 2. Per-Metric Statistics
**Method**: `AUCROCCalculator.compute_per_metric_statistics(results)`
Provides detailed statistical breakdown of each TRACE metric without requiring ground truth.
**Output Structure**:
```json
{
"per_metric_statistics": {
"context_relevance": {
"mean": 0.3500,
"median": 0.3500,
"std_dev": 0.1225,
"min": 0.2000,
"max": 0.5000,
"percentile_25": 0.2750,
"percentile_75": 0.4250,
"perfect_count": 0,
"poor_count": 1,
"sample_count": 3
},
"context_utilization": { ... },
"completeness": { ... },
"adherence": { ... }
}
}
```
**Interpretation**:
- **Mean/Median**: Central tendency of metric values
- **Percentile 25/75**: Distribution quartiles
- **Perfect Count**: How many evaluations scored >= 0.95
- **Poor Count**: How many evaluations scored < 0.3
- **Sample Count**: Total number of evaluations
## UI Display
### RMSE Aggregation Metrics (Metric Consistency)
Shows mean and standard deviation for each metric:
```
Relevance 0.350 ±0.123
Utilization 0.750 ±0.123
Completeness 0.717 ±0.125
Adherence 0.600 ±0.432
```
**What it means**:
- Lower Std Dev = More consistent metric
- High Std Dev (like Adherence 0.432) = Metric varies significantly across evaluations
### Per-Metric Statistics (Distribution)
Shows distribution characteristics:
```
Relevance Mean 0.350 (Median: 0.350)
Utilization Mean 0.750 (Median: 0.750)
Completeness Mean 0.717 (Median: 0.750)
Adherence Mean 0.600 (Median: 0.800)
```
**Expandable Details Include**:
- All percentiles
- Perfect score count (>=0.95)
- Poor score count (<0.3)
- Min/max values
## JSON Download Structure
### Complete Results JSON
All metrics are now included in the downloaded JSON:
```json
{
"evaluation_metadata": {
"timestamp": "2025-12-27T...",
"dataset": "...",
"method": "gpt_labeling_prompts",
"total_samples": 3,
"embedding_model": "..."
},
"aggregate_metrics": {
"context_relevance": 0.35,
"context_utilization": 0.75,
"completeness": 0.717,
"adherence": 0.60,
"average": 0.595
},
"rmse_metrics": {
"context_relevance": { "mean": 0.35, "std_dev": 0.1225, ... },
"context_utilization": { ... },
"completeness": { ... },
"adherence": { ... }
},
"per_metric_statistics": {
"context_relevance": { "mean": 0.35, "median": 0.35, ... },
"context_utilization": { ... },
"completeness": { ... },
"adherence": { ... }
},
"detailed_results": [ ... ]
}
```
## How to Use These Metrics
### 1. Identify Inconsistent Metrics
Look at RMSE Aggregation Std Dev:
- Std Dev > 0.3 = High variance (unstable metric)
- Std Dev < 0.1 = Low variance (stable metric)
Example:
```
Adherence Std Dev: 0.432 <- Highly variable, evaluate consistency
```
### 2. Find Problem Areas
Look at Per-Metric Statistics:
- Poor Count > 0 = Metric has low scores (< 0.3)
- Perfect Count = 0 = No perfect scores
Example:
```
Context Relevance Poor Count: 1 <- Some queries have low relevance
Adherence Poor Count: 1 <- Some responses have hallucinations
```
### 3. Distribution Analysis
Compare Mean vs Median:
- If Mean ≈ Median: Symmetric distribution
- If Mean > Median: Right-skewed (some high values)
- If Mean < Median: Left-skewed (some low values)
Example:
```
Adherence Mean: 0.600, Median: 0.800
-> Left-skewed (pulled down by low values)
```
### 4. Evaluate Percentile Range
Use 25th and 75th percentiles to understand typical range:
Example:
```
Context Relevance: 25th=0.275, 75th=0.425
-> Typical range is 0.275-0.425 (middle 50%)
```
## Integration with Evaluation Process
### Automatic Computation
RMSE and per-metric statistics are computed automatically during `evaluate_batch()`:
```python
def evaluate_batch(self, test_cases):
# ... evaluation code ...
# Automatically compute metrics
rmse_metrics = RMSECalculator.compute_rmse_aggregation_for_batch(detailed_results)
per_metric_stats = AUCROCCalculator.compute_per_metric_statistics(detailed_results)
results["rmse_metrics"] = rmse_metrics
results["per_metric_statistics"] = per_metric_stats
return results
```
### No Ground Truth Required
Unlike RMSE vs ground truth or AUCROC calculations:
- **No ground truth needed**
- Works with actual evaluation results
- Provides consistency/distribution insights
- Suitable for real-world evaluation
## Example Analysis Workflow
### Scenario: Evaluation Results
```
Sample 1: R=0.20, U=0.75, C=0.75, A=0.0
Sample 2: R=0.50, U=0.90, C=0.85, A=0.8
Sample 3: R=0.35, U=0.60, C=0.55, A=1.0
```
### Step 1: Check RMSE Aggregation
```
Adherence Std Dev: 0.432 (highest variability)
-> Adherence scores vary widely (0.0 to 1.0)
```
### Step 2: Check Per-Metric Statistics
```
Adherence: Mean=0.60, Median=0.80, Poor=1, Perfect=1
-> One perfect response, one with hallucinations
```
### Step 3: Investigate Issues
```
Poor Adherence (0.0) appears in Sample 1
-> Investigate what caused the hallucination
-> Check retrieved documents and response
```
### Step 4: Recommendation
```
Adherence is inconsistent (Std Dev 0.432)
-> Improve retrieval quality to avoid hallucinations
-> Focus on samples with A=0.0
```
## Comparison with Previous Approach
### Before
- Only overall averages shown
- No distribution information
- No consistency metrics
- Empty RMSE/AUCROC in JSON
### After
- Overall averages + statistical breakdown
- Full distribution analysis (percentiles, quartiles)
- Consistency measurement (standard deviation)
- Populated RMSE and per-metric stats in JSON
- Perfect/poor count indicators
## Technical Details
### RMSE Aggregation Formula
For each metric:
$$\text{Std Dev} = \sqrt{\frac{\sum(x_i - \mu)^2}{n}}$$
Where:
- $x_i$ = metric value for evaluation $i$
- $\mu$ = mean metric value
- $n$ = number of evaluations
### Per-Metric Statistics
- **Percentile k**: Value below which k% of data falls
- **Perfect Count**: Number of evaluations where metric >= 0.95
- **Poor Count**: Number of evaluations where metric < 0.3
## Files Modified
1. **advanced_rag_evaluator.py**
- Added `compute_rmse_aggregation_for_batch()` method
- Added `compute_per_metric_statistics()` method
- Updated `evaluate_batch()` to compute metrics
2. **streamlit_app.py**
- Added RMSE Aggregation section to UI
- Added Per-Metric Statistics section to UI
- Updated JSON download to include both metrics
## Next Steps
### Visualization
- Add charts showing metric distributions
- Comparison plots across evaluations
- Heatmaps for metric correlations
### Advanced Analysis
- Metric trend analysis over time
- Correlation between metrics
- Root cause analysis for poor scores
### Optimization
- Use insights to improve retrieval
- Adjust chunk size/overlap based on metrics
- Select embedding model based on metric performance