File size: 8,203 Bytes
1d10b0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
# RMSE Metrics Implementation Guide

## Overview

RMSE (Root Mean Squared Error) aggregation and per-metric statistics are now fully integrated into the evaluation system. These metrics are automatically computed during batch evaluation and included in both the UI display and JSON downloads.

## What Was Implemented

### 1. RMSE Aggregation for Batch Evaluation

**Method**: `RMSECalculator.compute_rmse_aggregation_for_batch(results)`

Computes consistency metrics for each TRACE metric across all evaluations. Shows how much each metric varies across the batch.

**Output Structure**:
```json
{
  "rmse_metrics": {
    "context_relevance": {
      "mean": 0.3500,
      "std_dev": 0.1225,
      "min": 0.2000,
      "max": 0.5000,
      "variance": 0.0150,
      "count": 3
    },
    "context_utilization": {
      "mean": 0.7500,
      "std_dev": 0.1225,
      "min": 0.6000,
      "max": 0.9000,
      "variance": 0.0150,
      "count": 3
    },
    "completeness": { ... },
    "adherence": { ... }
  }
}
```

**Interpretation**:
- **Mean**: Average score for that metric across all evaluations
- **Std Dev**: Variation (consistency) - lower is more consistent
- **Min/Max**: Range of values observed
- **Variance**: Squared standard deviation
- **Count**: Number of evaluations

### 2. Per-Metric Statistics

**Method**: `AUCROCCalculator.compute_per_metric_statistics(results)`

Provides detailed statistical breakdown of each TRACE metric without requiring ground truth.

**Output Structure**:
```json
{
  "per_metric_statistics": {
    "context_relevance": {
      "mean": 0.3500,
      "median": 0.3500,
      "std_dev": 0.1225,
      "min": 0.2000,
      "max": 0.5000,
      "percentile_25": 0.2750,
      "percentile_75": 0.4250,
      "perfect_count": 0,
      "poor_count": 1,
      "sample_count": 3
    },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  }
}
```

**Interpretation**:
- **Mean/Median**: Central tendency of metric values
- **Percentile 25/75**: Distribution quartiles
- **Perfect Count**: How many evaluations scored >= 0.95
- **Poor Count**: How many evaluations scored < 0.3
- **Sample Count**: Total number of evaluations

## UI Display

### RMSE Aggregation Metrics (Metric Consistency)

Shows mean and standard deviation for each metric:

```
Relevance      0.350 ±0.123
Utilization    0.750 ±0.123
Completeness   0.717 ±0.125
Adherence      0.600 ±0.432
```

**What it means**:
- Lower Std Dev = More consistent metric
- High Std Dev (like Adherence 0.432) = Metric varies significantly across evaluations

### Per-Metric Statistics (Distribution)

Shows distribution characteristics:

```
Relevance Mean       0.350 (Median: 0.350)
Utilization Mean     0.750 (Median: 0.750)
Completeness Mean    0.717 (Median: 0.750)
Adherence Mean       0.600 (Median: 0.800)
```

**Expandable Details Include**:
- All percentiles
- Perfect score count (>=0.95)
- Poor score count (<0.3)
- Min/max values

## JSON Download Structure

### Complete Results JSON

All metrics are now included in the downloaded JSON:

```json
{
  "evaluation_metadata": {
    "timestamp": "2025-12-27T...",
    "dataset": "...",
    "method": "gpt_labeling_prompts",
    "total_samples": 3,
    "embedding_model": "..."
  },
  "aggregate_metrics": {
    "context_relevance": 0.35,
    "context_utilization": 0.75,
    "completeness": 0.717,
    "adherence": 0.60,
    "average": 0.595
  },
  "rmse_metrics": {
    "context_relevance": { "mean": 0.35, "std_dev": 0.1225, ... },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  },
  "per_metric_statistics": {
    "context_relevance": { "mean": 0.35, "median": 0.35, ... },
    "context_utilization": { ... },
    "completeness": { ... },
    "adherence": { ... }
  },
  "detailed_results": [ ... ]
}
```

## How to Use These Metrics

### 1. Identify Inconsistent Metrics

Look at RMSE Aggregation Std Dev:
- Std Dev > 0.3 = High variance (unstable metric)
- Std Dev < 0.1 = Low variance (stable metric)

Example:
```
Adherence Std Dev: 0.432  <- Highly variable, evaluate consistency
```

### 2. Find Problem Areas

Look at Per-Metric Statistics:
- Poor Count > 0 = Metric has low scores (< 0.3)
- Perfect Count = 0 = No perfect scores

Example:
```
Context Relevance Poor Count: 1   <- Some queries have low relevance
Adherence Poor Count: 1           <- Some responses have hallucinations
```

### 3. Distribution Analysis

Compare Mean vs Median:
- If Mean ≈ Median: Symmetric distribution
- If Mean > Median: Right-skewed (some high values)
- If Mean < Median: Left-skewed (some low values)

Example:
```
Adherence Mean: 0.600, Median: 0.800
-> Left-skewed (pulled down by low values)
```

### 4. Evaluate Percentile Range

Use 25th and 75th percentiles to understand typical range:

Example:
```
Context Relevance: 25th=0.275, 75th=0.425
-> Typical range is 0.275-0.425 (middle 50%)
```

## Integration with Evaluation Process

### Automatic Computation

RMSE and per-metric statistics are computed automatically during `evaluate_batch()`:

```python
def evaluate_batch(self, test_cases):
    # ... evaluation code ...
    
    # Automatically compute metrics
    rmse_metrics = RMSECalculator.compute_rmse_aggregation_for_batch(detailed_results)
    per_metric_stats = AUCROCCalculator.compute_per_metric_statistics(detailed_results)
    
    results["rmse_metrics"] = rmse_metrics
    results["per_metric_statistics"] = per_metric_stats
    
    return results
```

### No Ground Truth Required

Unlike RMSE vs ground truth or AUCROC calculations:
- **No ground truth needed**
- Works with actual evaluation results
- Provides consistency/distribution insights
- Suitable for real-world evaluation

## Example Analysis Workflow

### Scenario: Evaluation Results
```
Sample 1: R=0.20, U=0.75, C=0.75, A=0.0
Sample 2: R=0.50, U=0.90, C=0.85, A=0.8
Sample 3: R=0.35, U=0.60, C=0.55, A=1.0
```

### Step 1: Check RMSE Aggregation
```
Adherence Std Dev: 0.432 (highest variability)
-> Adherence scores vary widely (0.0 to 1.0)
```

### Step 2: Check Per-Metric Statistics
```
Adherence: Mean=0.60, Median=0.80, Poor=1, Perfect=1
-> One perfect response, one with hallucinations
```

### Step 3: Investigate Issues
```
Poor Adherence (0.0) appears in Sample 1
-> Investigate what caused the hallucination
-> Check retrieved documents and response
```

### Step 4: Recommendation
```
Adherence is inconsistent (Std Dev 0.432)
-> Improve retrieval quality to avoid hallucinations
-> Focus on samples with A=0.0
```

## Comparison with Previous Approach

### Before
- Only overall averages shown
- No distribution information
- No consistency metrics
- Empty RMSE/AUCROC in JSON

### After
- Overall averages + statistical breakdown
- Full distribution analysis (percentiles, quartiles)
- Consistency measurement (standard deviation)
- Populated RMSE and per-metric stats in JSON
- Perfect/poor count indicators

## Technical Details

### RMSE Aggregation Formula

For each metric:
$$\text{Std Dev} = \sqrt{\frac{\sum(x_i - \mu)^2}{n}}$$

Where:
- $x_i$ = metric value for evaluation $i$
- $\mu$ = mean metric value
- $n$ = number of evaluations

### Per-Metric Statistics

- **Percentile k**: Value below which k% of data falls
- **Perfect Count**: Number of evaluations where metric >= 0.95
- **Poor Count**: Number of evaluations where metric < 0.3

## Files Modified

1. **advanced_rag_evaluator.py**
   - Added `compute_rmse_aggregation_for_batch()` method
   - Added `compute_per_metric_statistics()` method
   - Updated `evaluate_batch()` to compute metrics

2. **streamlit_app.py**
   - Added RMSE Aggregation section to UI
   - Added Per-Metric Statistics section to UI
   - Updated JSON download to include both metrics

## Next Steps

### Visualization
- Add charts showing metric distributions
- Comparison plots across evaluations
- Heatmaps for metric correlations

### Advanced Analysis
- Metric trend analysis over time
- Correlation between metrics
- Root cause analysis for poor scores

### Optimization
- Use insights to improve retrieval
- Adjust chunk size/overlap based on metrics
- Select embedding model based on metric performance