Spaces:
Sleeping
Sleeping
| # RMSE Metrics Implementation Guide | |
| ## Overview | |
| RMSE (Root Mean Squared Error) aggregation and per-metric statistics are now fully integrated into the evaluation system. These metrics are automatically computed during batch evaluation and included in both the UI display and JSON downloads. | |
| ## What Was Implemented | |
| ### 1. RMSE Aggregation for Batch Evaluation | |
| **Method**: `RMSECalculator.compute_rmse_aggregation_for_batch(results)` | |
| Computes consistency metrics for each TRACE metric across all evaluations. Shows how much each metric varies across the batch. | |
| **Output Structure**: | |
| ```json | |
| { | |
| "rmse_metrics": { | |
| "context_relevance": { | |
| "mean": 0.3500, | |
| "std_dev": 0.1225, | |
| "min": 0.2000, | |
| "max": 0.5000, | |
| "variance": 0.0150, | |
| "count": 3 | |
| }, | |
| "context_utilization": { | |
| "mean": 0.7500, | |
| "std_dev": 0.1225, | |
| "min": 0.6000, | |
| "max": 0.9000, | |
| "variance": 0.0150, | |
| "count": 3 | |
| }, | |
| "completeness": { ... }, | |
| "adherence": { ... } | |
| } | |
| } | |
| ``` | |
| **Interpretation**: | |
| - **Mean**: Average score for that metric across all evaluations | |
| - **Std Dev**: Variation (consistency) - lower is more consistent | |
| - **Min/Max**: Range of values observed | |
| - **Variance**: Squared standard deviation | |
| - **Count**: Number of evaluations | |
| ### 2. Per-Metric Statistics | |
| **Method**: `AUCROCCalculator.compute_per_metric_statistics(results)` | |
| Provides detailed statistical breakdown of each TRACE metric without requiring ground truth. | |
| **Output Structure**: | |
| ```json | |
| { | |
| "per_metric_statistics": { | |
| "context_relevance": { | |
| "mean": 0.3500, | |
| "median": 0.3500, | |
| "std_dev": 0.1225, | |
| "min": 0.2000, | |
| "max": 0.5000, | |
| "percentile_25": 0.2750, | |
| "percentile_75": 0.4250, | |
| "perfect_count": 0, | |
| "poor_count": 1, | |
| "sample_count": 3 | |
| }, | |
| "context_utilization": { ... }, | |
| "completeness": { ... }, | |
| "adherence": { ... } | |
| } | |
| } | |
| ``` | |
| **Interpretation**: | |
| - **Mean/Median**: Central tendency of metric values | |
| - **Percentile 25/75**: Distribution quartiles | |
| - **Perfect Count**: How many evaluations scored >= 0.95 | |
| - **Poor Count**: How many evaluations scored < 0.3 | |
| - **Sample Count**: Total number of evaluations | |
| ## UI Display | |
| ### RMSE Aggregation Metrics (Metric Consistency) | |
| Shows mean and standard deviation for each metric: | |
| ``` | |
| Relevance 0.350 ±0.123 | |
| Utilization 0.750 ±0.123 | |
| Completeness 0.717 ±0.125 | |
| Adherence 0.600 ±0.432 | |
| ``` | |
| **What it means**: | |
| - Lower Std Dev = More consistent metric | |
| - High Std Dev (like Adherence 0.432) = Metric varies significantly across evaluations | |
| ### Per-Metric Statistics (Distribution) | |
| Shows distribution characteristics: | |
| ``` | |
| Relevance Mean 0.350 (Median: 0.350) | |
| Utilization Mean 0.750 (Median: 0.750) | |
| Completeness Mean 0.717 (Median: 0.750) | |
| Adherence Mean 0.600 (Median: 0.800) | |
| ``` | |
| **Expandable Details Include**: | |
| - All percentiles | |
| - Perfect score count (>=0.95) | |
| - Poor score count (<0.3) | |
| - Min/max values | |
| ## JSON Download Structure | |
| ### Complete Results JSON | |
| All metrics are now included in the downloaded JSON: | |
| ```json | |
| { | |
| "evaluation_metadata": { | |
| "timestamp": "2025-12-27T...", | |
| "dataset": "...", | |
| "method": "gpt_labeling_prompts", | |
| "total_samples": 3, | |
| "embedding_model": "..." | |
| }, | |
| "aggregate_metrics": { | |
| "context_relevance": 0.35, | |
| "context_utilization": 0.75, | |
| "completeness": 0.717, | |
| "adherence": 0.60, | |
| "average": 0.595 | |
| }, | |
| "rmse_metrics": { | |
| "context_relevance": { "mean": 0.35, "std_dev": 0.1225, ... }, | |
| "context_utilization": { ... }, | |
| "completeness": { ... }, | |
| "adherence": { ... } | |
| }, | |
| "per_metric_statistics": { | |
| "context_relevance": { "mean": 0.35, "median": 0.35, ... }, | |
| "context_utilization": { ... }, | |
| "completeness": { ... }, | |
| "adherence": { ... } | |
| }, | |
| "detailed_results": [ ... ] | |
| } | |
| ``` | |
| ## How to Use These Metrics | |
| ### 1. Identify Inconsistent Metrics | |
| Look at RMSE Aggregation Std Dev: | |
| - Std Dev > 0.3 = High variance (unstable metric) | |
| - Std Dev < 0.1 = Low variance (stable metric) | |
| Example: | |
| ``` | |
| Adherence Std Dev: 0.432 <- Highly variable, evaluate consistency | |
| ``` | |
| ### 2. Find Problem Areas | |
| Look at Per-Metric Statistics: | |
| - Poor Count > 0 = Metric has low scores (< 0.3) | |
| - Perfect Count = 0 = No perfect scores | |
| Example: | |
| ``` | |
| Context Relevance Poor Count: 1 <- Some queries have low relevance | |
| Adherence Poor Count: 1 <- Some responses have hallucinations | |
| ``` | |
| ### 3. Distribution Analysis | |
| Compare Mean vs Median: | |
| - If Mean ≈ Median: Symmetric distribution | |
| - If Mean > Median: Right-skewed (some high values) | |
| - If Mean < Median: Left-skewed (some low values) | |
| Example: | |
| ``` | |
| Adherence Mean: 0.600, Median: 0.800 | |
| -> Left-skewed (pulled down by low values) | |
| ``` | |
| ### 4. Evaluate Percentile Range | |
| Use 25th and 75th percentiles to understand typical range: | |
| Example: | |
| ``` | |
| Context Relevance: 25th=0.275, 75th=0.425 | |
| -> Typical range is 0.275-0.425 (middle 50%) | |
| ``` | |
| ## Integration with Evaluation Process | |
| ### Automatic Computation | |
| RMSE and per-metric statistics are computed automatically during `evaluate_batch()`: | |
| ```python | |
| def evaluate_batch(self, test_cases): | |
| # ... evaluation code ... | |
| # Automatically compute metrics | |
| rmse_metrics = RMSECalculator.compute_rmse_aggregation_for_batch(detailed_results) | |
| per_metric_stats = AUCROCCalculator.compute_per_metric_statistics(detailed_results) | |
| results["rmse_metrics"] = rmse_metrics | |
| results["per_metric_statistics"] = per_metric_stats | |
| return results | |
| ``` | |
| ### No Ground Truth Required | |
| Unlike RMSE vs ground truth or AUCROC calculations: | |
| - **No ground truth needed** | |
| - Works with actual evaluation results | |
| - Provides consistency/distribution insights | |
| - Suitable for real-world evaluation | |
| ## Example Analysis Workflow | |
| ### Scenario: Evaluation Results | |
| ``` | |
| Sample 1: R=0.20, U=0.75, C=0.75, A=0.0 | |
| Sample 2: R=0.50, U=0.90, C=0.85, A=0.8 | |
| Sample 3: R=0.35, U=0.60, C=0.55, A=1.0 | |
| ``` | |
| ### Step 1: Check RMSE Aggregation | |
| ``` | |
| Adherence Std Dev: 0.432 (highest variability) | |
| -> Adherence scores vary widely (0.0 to 1.0) | |
| ``` | |
| ### Step 2: Check Per-Metric Statistics | |
| ``` | |
| Adherence: Mean=0.60, Median=0.80, Poor=1, Perfect=1 | |
| -> One perfect response, one with hallucinations | |
| ``` | |
| ### Step 3: Investigate Issues | |
| ``` | |
| Poor Adherence (0.0) appears in Sample 1 | |
| -> Investigate what caused the hallucination | |
| -> Check retrieved documents and response | |
| ``` | |
| ### Step 4: Recommendation | |
| ``` | |
| Adherence is inconsistent (Std Dev 0.432) | |
| -> Improve retrieval quality to avoid hallucinations | |
| -> Focus on samples with A=0.0 | |
| ``` | |
| ## Comparison with Previous Approach | |
| ### Before | |
| - Only overall averages shown | |
| - No distribution information | |
| - No consistency metrics | |
| - Empty RMSE/AUCROC in JSON | |
| ### After | |
| - Overall averages + statistical breakdown | |
| - Full distribution analysis (percentiles, quartiles) | |
| - Consistency measurement (standard deviation) | |
| - Populated RMSE and per-metric stats in JSON | |
| - Perfect/poor count indicators | |
| ## Technical Details | |
| ### RMSE Aggregation Formula | |
| For each metric: | |
| $$\text{Std Dev} = \sqrt{\frac{\sum(x_i - \mu)^2}{n}}$$ | |
| Where: | |
| - $x_i$ = metric value for evaluation $i$ | |
| - $\mu$ = mean metric value | |
| - $n$ = number of evaluations | |
| ### Per-Metric Statistics | |
| - **Percentile k**: Value below which k% of data falls | |
| - **Perfect Count**: Number of evaluations where metric >= 0.95 | |
| - **Poor Count**: Number of evaluations where metric < 0.3 | |
| ## Files Modified | |
| 1. **advanced_rag_evaluator.py** | |
| - Added `compute_rmse_aggregation_for_batch()` method | |
| - Added `compute_per_metric_statistics()` method | |
| - Updated `evaluate_batch()` to compute metrics | |
| 2. **streamlit_app.py** | |
| - Added RMSE Aggregation section to UI | |
| - Added Per-Metric Statistics section to UI | |
| - Updated JSON download to include both metrics | |
| ## Next Steps | |
| ### Visualization | |
| - Add charts showing metric distributions | |
| - Comparison plots across evaluations | |
| - Heatmaps for metric correlations | |
| ### Advanced Analysis | |
| - Metric trend analysis over time | |
| - Correlation between metrics | |
| - Root cause analysis for poor scores | |
| ### Optimization | |
| - Use insights to improve retrieval | |
| - Adjust chunk size/overlap based on metrics | |
| - Select embedding model based on metric performance | |