# RMSE Metrics Implementation Guide ## Overview RMSE (Root Mean Squared Error) aggregation and per-metric statistics are now fully integrated into the evaluation system. These metrics are automatically computed during batch evaluation and included in both the UI display and JSON downloads. ## What Was Implemented ### 1. RMSE Aggregation for Batch Evaluation **Method**: `RMSECalculator.compute_rmse_aggregation_for_batch(results)` Computes consistency metrics for each TRACE metric across all evaluations. Shows how much each metric varies across the batch. **Output Structure**: ```json { "rmse_metrics": { "context_relevance": { "mean": 0.3500, "std_dev": 0.1225, "min": 0.2000, "max": 0.5000, "variance": 0.0150, "count": 3 }, "context_utilization": { "mean": 0.7500, "std_dev": 0.1225, "min": 0.6000, "max": 0.9000, "variance": 0.0150, "count": 3 }, "completeness": { ... }, "adherence": { ... } } } ``` **Interpretation**: - **Mean**: Average score for that metric across all evaluations - **Std Dev**: Variation (consistency) - lower is more consistent - **Min/Max**: Range of values observed - **Variance**: Squared standard deviation - **Count**: Number of evaluations ### 2. Per-Metric Statistics **Method**: `AUCROCCalculator.compute_per_metric_statistics(results)` Provides detailed statistical breakdown of each TRACE metric without requiring ground truth. **Output Structure**: ```json { "per_metric_statistics": { "context_relevance": { "mean": 0.3500, "median": 0.3500, "std_dev": 0.1225, "min": 0.2000, "max": 0.5000, "percentile_25": 0.2750, "percentile_75": 0.4250, "perfect_count": 0, "poor_count": 1, "sample_count": 3 }, "context_utilization": { ... }, "completeness": { ... }, "adherence": { ... } } } ``` **Interpretation**: - **Mean/Median**: Central tendency of metric values - **Percentile 25/75**: Distribution quartiles - **Perfect Count**: How many evaluations scored >= 0.95 - **Poor Count**: How many evaluations scored < 0.3 - **Sample Count**: Total number of evaluations ## UI Display ### RMSE Aggregation Metrics (Metric Consistency) Shows mean and standard deviation for each metric: ``` Relevance 0.350 ±0.123 Utilization 0.750 ±0.123 Completeness 0.717 ±0.125 Adherence 0.600 ±0.432 ``` **What it means**: - Lower Std Dev = More consistent metric - High Std Dev (like Adherence 0.432) = Metric varies significantly across evaluations ### Per-Metric Statistics (Distribution) Shows distribution characteristics: ``` Relevance Mean 0.350 (Median: 0.350) Utilization Mean 0.750 (Median: 0.750) Completeness Mean 0.717 (Median: 0.750) Adherence Mean 0.600 (Median: 0.800) ``` **Expandable Details Include**: - All percentiles - Perfect score count (>=0.95) - Poor score count (<0.3) - Min/max values ## JSON Download Structure ### Complete Results JSON All metrics are now included in the downloaded JSON: ```json { "evaluation_metadata": { "timestamp": "2025-12-27T...", "dataset": "...", "method": "gpt_labeling_prompts", "total_samples": 3, "embedding_model": "..." }, "aggregate_metrics": { "context_relevance": 0.35, "context_utilization": 0.75, "completeness": 0.717, "adherence": 0.60, "average": 0.595 }, "rmse_metrics": { "context_relevance": { "mean": 0.35, "std_dev": 0.1225, ... }, "context_utilization": { ... }, "completeness": { ... }, "adherence": { ... } }, "per_metric_statistics": { "context_relevance": { "mean": 0.35, "median": 0.35, ... }, "context_utilization": { ... }, "completeness": { ... }, "adherence": { ... } }, "detailed_results": [ ... ] } ``` ## How to Use These Metrics ### 1. Identify Inconsistent Metrics Look at RMSE Aggregation Std Dev: - Std Dev > 0.3 = High variance (unstable metric) - Std Dev < 0.1 = Low variance (stable metric) Example: ``` Adherence Std Dev: 0.432 <- Highly variable, evaluate consistency ``` ### 2. Find Problem Areas Look at Per-Metric Statistics: - Poor Count > 0 = Metric has low scores (< 0.3) - Perfect Count = 0 = No perfect scores Example: ``` Context Relevance Poor Count: 1 <- Some queries have low relevance Adherence Poor Count: 1 <- Some responses have hallucinations ``` ### 3. Distribution Analysis Compare Mean vs Median: - If Mean ≈ Median: Symmetric distribution - If Mean > Median: Right-skewed (some high values) - If Mean < Median: Left-skewed (some low values) Example: ``` Adherence Mean: 0.600, Median: 0.800 -> Left-skewed (pulled down by low values) ``` ### 4. Evaluate Percentile Range Use 25th and 75th percentiles to understand typical range: Example: ``` Context Relevance: 25th=0.275, 75th=0.425 -> Typical range is 0.275-0.425 (middle 50%) ``` ## Integration with Evaluation Process ### Automatic Computation RMSE and per-metric statistics are computed automatically during `evaluate_batch()`: ```python def evaluate_batch(self, test_cases): # ... evaluation code ... # Automatically compute metrics rmse_metrics = RMSECalculator.compute_rmse_aggregation_for_batch(detailed_results) per_metric_stats = AUCROCCalculator.compute_per_metric_statistics(detailed_results) results["rmse_metrics"] = rmse_metrics results["per_metric_statistics"] = per_metric_stats return results ``` ### No Ground Truth Required Unlike RMSE vs ground truth or AUCROC calculations: - **No ground truth needed** - Works with actual evaluation results - Provides consistency/distribution insights - Suitable for real-world evaluation ## Example Analysis Workflow ### Scenario: Evaluation Results ``` Sample 1: R=0.20, U=0.75, C=0.75, A=0.0 Sample 2: R=0.50, U=0.90, C=0.85, A=0.8 Sample 3: R=0.35, U=0.60, C=0.55, A=1.0 ``` ### Step 1: Check RMSE Aggregation ``` Adherence Std Dev: 0.432 (highest variability) -> Adherence scores vary widely (0.0 to 1.0) ``` ### Step 2: Check Per-Metric Statistics ``` Adherence: Mean=0.60, Median=0.80, Poor=1, Perfect=1 -> One perfect response, one with hallucinations ``` ### Step 3: Investigate Issues ``` Poor Adherence (0.0) appears in Sample 1 -> Investigate what caused the hallucination -> Check retrieved documents and response ``` ### Step 4: Recommendation ``` Adherence is inconsistent (Std Dev 0.432) -> Improve retrieval quality to avoid hallucinations -> Focus on samples with A=0.0 ``` ## Comparison with Previous Approach ### Before - Only overall averages shown - No distribution information - No consistency metrics - Empty RMSE/AUCROC in JSON ### After - Overall averages + statistical breakdown - Full distribution analysis (percentiles, quartiles) - Consistency measurement (standard deviation) - Populated RMSE and per-metric stats in JSON - Perfect/poor count indicators ## Technical Details ### RMSE Aggregation Formula For each metric: $$\text{Std Dev} = \sqrt{\frac{\sum(x_i - \mu)^2}{n}}$$ Where: - $x_i$ = metric value for evaluation $i$ - $\mu$ = mean metric value - $n$ = number of evaluations ### Per-Metric Statistics - **Percentile k**: Value below which k% of data falls - **Perfect Count**: Number of evaluations where metric >= 0.95 - **Poor Count**: Number of evaluations where metric < 0.3 ## Files Modified 1. **advanced_rag_evaluator.py** - Added `compute_rmse_aggregation_for_batch()` method - Added `compute_per_metric_statistics()` method - Updated `evaluate_batch()` to compute metrics 2. **streamlit_app.py** - Added RMSE Aggregation section to UI - Added Per-Metric Statistics section to UI - Updated JSON download to include both metrics ## Next Steps ### Visualization - Add charts showing metric distributions - Comparison plots across evaluations - Heatmaps for metric correlations ### Advanced Analysis - Metric trend analysis over time - Correlation between metrics - Root cause analysis for poor scores ### Optimization - Use insights to improve retrieval - Adjust chunk size/overlap based on metrics - Select embedding model based on metric performance