Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /RMSE_METRICS_IMPLEMENTATION.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 1 month ago

preview code

raw

history blame contribute delete

8.2 kB

	# RMSE Metrics Implementation Guide

	## Overview

	RMSE (Root Mean Squared Error) aggregation and per-metric statistics are now fully integrated into the evaluation system. These metrics are automatically computed during batch evaluation and included in both the UI display and JSON downloads.

	## What Was Implemented

	### 1. RMSE Aggregation for Batch Evaluation

	Method: `RMSECalculator.compute_rmse_aggregation_for_batch(results)`

	Computes consistency metrics for each TRACE metric across all evaluations. Shows how much each metric varies across the batch.

	Output Structure:
	```json
	{
	"rmse_metrics": {
	"context_relevance": {
	"mean": 0.3500,
	"std_dev": 0.1225,
	"min": 0.2000,
	"max": 0.5000,
	"variance": 0.0150,
	"count": 3
	},
	"context_utilization": {
	"mean": 0.7500,
	"std_dev": 0.1225,
	"min": 0.6000,
	"max": 0.9000,
	"variance": 0.0150,
	"count": 3
	},
	"completeness": { ... },
	"adherence": { ... }
	}
	}
	```

	Interpretation:
	- Mean: Average score for that metric across all evaluations
	- Std Dev: Variation (consistency) - lower is more consistent
	- Min/Max: Range of values observed
	- Variance: Squared standard deviation
	- Count: Number of evaluations

	### 2. Per-Metric Statistics

	Method: `AUCROCCalculator.compute_per_metric_statistics(results)`

	Provides detailed statistical breakdown of each TRACE metric without requiring ground truth.

	Output Structure:
	```json
	{
	"per_metric_statistics": {
	"context_relevance": {
	"mean": 0.3500,
	"median": 0.3500,
	"std_dev": 0.1225,
	"min": 0.2000,
	"max": 0.5000,
	"percentile_25": 0.2750,
	"percentile_75": 0.4250,
	"perfect_count": 0,
	"poor_count": 1,
	"sample_count": 3
	},
	"context_utilization": { ... },
	"completeness": { ... },
	"adherence": { ... }
	}
	}
	```

	Interpretation:
	- Mean/Median: Central tendency of metric values
	- Percentile 25/75: Distribution quartiles
	- Perfect Count: How many evaluations scored >= 0.95
	- Poor Count: How many evaluations scored < 0.3
	- Sample Count: Total number of evaluations

	## UI Display

	### RMSE Aggregation Metrics (Metric Consistency)

	Shows mean and standard deviation for each metric:

	```
	Relevance 0.350 ±0.123
	Utilization 0.750 ±0.123
	Completeness 0.717 ±0.125
	Adherence 0.600 ±0.432
	```

	What it means:
	- Lower Std Dev = More consistent metric
	- High Std Dev (like Adherence 0.432) = Metric varies significantly across evaluations

	### Per-Metric Statistics (Distribution)

	Shows distribution characteristics:

	```
	Relevance Mean 0.350 (Median: 0.350)
	Utilization Mean 0.750 (Median: 0.750)
	Completeness Mean 0.717 (Median: 0.750)
	Adherence Mean 0.600 (Median: 0.800)
	```

	Expandable Details Include:
	- All percentiles
	- Perfect score count (>=0.95)
	- Poor score count (<0.3)
	- Min/max values

	## JSON Download Structure

	### Complete Results JSON

	All metrics are now included in the downloaded JSON:

	```json
	{
	"evaluation_metadata": {
	"timestamp": "2025-12-27T...",
	"dataset": "...",
	"method": "gpt_labeling_prompts",
	"total_samples": 3,
	"embedding_model": "..."
	},
	"aggregate_metrics": {
	"context_relevance": 0.35,
	"context_utilization": 0.75,
	"completeness": 0.717,
	"adherence": 0.60,
	"average": 0.595
	},
	"rmse_metrics": {
	"context_relevance": { "mean": 0.35, "std_dev": 0.1225, ... },
	"context_utilization": { ... },
	"completeness": { ... },
	"adherence": { ... }
	},
	"per_metric_statistics": {
	"context_relevance": { "mean": 0.35, "median": 0.35, ... },
	"context_utilization": { ... },
	"completeness": { ... },
	"adherence": { ... }
	},
	"detailed_results": [ ... ]
	}
	```

	## How to Use These Metrics

	### 1. Identify Inconsistent Metrics

	Look at RMSE Aggregation Std Dev:
	- Std Dev > 0.3 = High variance (unstable metric)
	- Std Dev < 0.1 = Low variance (stable metric)

	Example:
	```
	Adherence Std Dev: 0.432 <- Highly variable, evaluate consistency
	```

	### 2. Find Problem Areas

	Look at Per-Metric Statistics:
	- Poor Count > 0 = Metric has low scores (< 0.3)
	- Perfect Count = 0 = No perfect scores

	Example:
	```
	Context Relevance Poor Count: 1 <- Some queries have low relevance
	Adherence Poor Count: 1 <- Some responses have hallucinations
	```

	### 3. Distribution Analysis

	Compare Mean vs Median:
	- If Mean ≈ Median: Symmetric distribution
	- If Mean > Median: Right-skewed (some high values)
	- If Mean < Median: Left-skewed (some low values)

	Example:
	```
	Adherence Mean: 0.600, Median: 0.800
	-> Left-skewed (pulled down by low values)
	```

	### 4. Evaluate Percentile Range

	Use 25th and 75th percentiles to understand typical range:

	Example:
	```
	Context Relevance: 25th=0.275, 75th=0.425
	-> Typical range is 0.275-0.425 (middle 50%)
	```

	## Integration with Evaluation Process

	### Automatic Computation

	RMSE and per-metric statistics are computed automatically during `evaluate_batch()`:

	```python
	def evaluate_batch(self, test_cases):
	# ... evaluation code ...

	# Automatically compute metrics
	rmse_metrics = RMSECalculator.compute_rmse_aggregation_for_batch(detailed_results)
	per_metric_stats = AUCROCCalculator.compute_per_metric_statistics(detailed_results)

	results["rmse_metrics"] = rmse_metrics
	results["per_metric_statistics"] = per_metric_stats

	return results
	```

	### No Ground Truth Required

	Unlike RMSE vs ground truth or AUCROC calculations:
	- No ground truth needed
	- Works with actual evaluation results
	- Provides consistency/distribution insights
	- Suitable for real-world evaluation

	## Example Analysis Workflow

	### Scenario: Evaluation Results
	```
	Sample 1: R=0.20, U=0.75, C=0.75, A=0.0
	Sample 2: R=0.50, U=0.90, C=0.85, A=0.8
	Sample 3: R=0.35, U=0.60, C=0.55, A=1.0
	```

	### Step 1: Check RMSE Aggregation
	```
	Adherence Std Dev: 0.432 (highest variability)
	-> Adherence scores vary widely (0.0 to 1.0)
	```

	### Step 2: Check Per-Metric Statistics
	```
	Adherence: Mean=0.60, Median=0.80, Poor=1, Perfect=1
	-> One perfect response, one with hallucinations
	```

	### Step 3: Investigate Issues
	```
	Poor Adherence (0.0) appears in Sample 1
	-> Investigate what caused the hallucination
	-> Check retrieved documents and response
	```

	### Step 4: Recommendation
	```
	Adherence is inconsistent (Std Dev 0.432)
	-> Improve retrieval quality to avoid hallucinations
	-> Focus on samples with A=0.0
	```

	## Comparison with Previous Approach

	### Before
	- Only overall averages shown
	- No distribution information
	- No consistency metrics
	- Empty RMSE/AUCROC in JSON

	### After
	- Overall averages + statistical breakdown
	- Full distribution analysis (percentiles, quartiles)
	- Consistency measurement (standard deviation)
	- Populated RMSE and per-metric stats in JSON
	- Perfect/poor count indicators

	## Technical Details

	### RMSE Aggregation Formula

	For each metric:
	$$\text{Std Dev} = \sqrt{\frac{\sum(x_i - \mu)^2}{n}}$$

	Where:
	- $x_i$ = metric value for evaluation $i$
	- $\mu$ = mean metric value
	- $n$ = number of evaluations

	### Per-Metric Statistics

	- Percentile k: Value below which k% of data falls
	- Perfect Count: Number of evaluations where metric >= 0.95
	- Poor Count: Number of evaluations where metric < 0.3

	## Files Modified

	1. advanced_rag_evaluator.py
	- Added `compute_rmse_aggregation_for_batch()` method
	- Added `compute_per_metric_statistics()` method
	- Updated `evaluate_batch()` to compute metrics

	2. streamlit_app.py
	- Added RMSE Aggregation section to UI
	- Added Per-Metric Statistics section to UI
	- Updated JSON download to include both metrics

	## Next Steps

	### Visualization
	- Add charts showing metric distributions
	- Comparison plots across evaluations
	- Heatmaps for metric correlations

	### Advanced Analysis
	- Metric trend analysis over time
	- Correlation between metrics
	- Root cause analysis for poor scores

	### Optimization
	- Use insights to improve retrieval
	- Adjust chunk size/overlap based on metrics
	- Select embedding model based on metric performance