# TRACE RMSE Aggregation - System Architecture ## Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ TRACE RMSE AGGREGATION SYSTEM │ └─────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────┐ │ GPT Labeling Evaluation │ │ (advanced_rag_evaluator.py) │ └──────────────────────────────┘ │ ├─→ Compute 4 TRACE metrics: │ • Context Relevance (R) │ • Context Utilization (U) │ • Completeness (C) │ • Adherence (A) │ ↓ ┌──────────────────────────────────────────┐ │ AdvancedTRACEScores Class │ │ │ │ metrics: │ │ ├─ context_relevance: 0.85 │ │ ├─ context_utilization: 0.80 │ │ ├─ completeness: 0.88 │ │ └─ adherence: 0.84 │ │ │ │ New Methods: │ │ • average() → 0.8425 │ │ • rmse_aggregation() → 0.0247 │ └──────────────────────────────────────────┘ │ ↓ [JSON Output] { "context_relevance": 0.85, "context_utilization": 0.80, "completeness": 0.88, "adherence": 0.84, "average": 0.8425, "rmse_aggregation": 0.0247 ← NEW } ``` ## Three Operational Modes ``` MODE 1: Single Evaluation Consistency ═══════════════════════════════════════════════════════════ Input: One AdvancedTRACEScores object ├─ context_relevance: 0.95 ├─ context_utilization: 0.50 ← Very low! ├─ completeness: 0.85 └─ adherence: 0.70 Process: rmse_aggregation() μ = (0.95 + 0.50 + 0.85 + 0.70) / 4 = 0.75 MSE = ((0.20)² + (-0.25)² + (0.10)² + (-0.05)²) / 4 RMSE = √(0.02375) = 0.154 Output: 0.154 ↓ Interpretation: ⚠️ IMBALANCED Reason: High relevance but low utilization Action: Check if retrieval isn't being used MODE 2: Ground Truth Comparison ═══════════════════════════════════════════════════════════ Input: Predicted vs Ground Truth Predicted: Ground Truth: ├─ R: 0.85 ├─ R: 0.84 → error: 0.01 ├─ U: 0.80 ├─ U: 0.82 → error: 0.02 ├─ C: 0.88 ├─ C: 0.87 → error: 0.01 └─ A: 0.82 └─ A: 0.80 → error: 0.02 Process: compute_rmse_single_trace_evaluation() √(per-metric errors) Output: { "per_metric": { "context_relevance": 0.010, "context_utilization": 0.020, "completeness": 0.010, "adherence": 0.020 }, "aggregated_rmse": 0.0122 } ↓ Interpretation: ✓ ACCURATE All errors < 0.02 MODE 3: Batch Aggregation (50+ evaluations) ═══════════════════════════════════════════════════════════ Input: List of 50 evaluation results with ground truth [ { "metrics": {...}, "ground_truth_scores": {...} }, ... × 50 ] Process: compute_trace_rmse_aggregation() • Calculate RMSE for each metric across all 50 tests • Aggregate into consistency score Output: { "per_metric_rmse": { "context_relevance": 0.045, "context_utilization": 0.062, "completeness": 0.038, "adherence": 0.091 }, "aggregated_rmse": 0.058, "consistency_score": 0.942, ← 0-1 "num_evaluations": 50, "evaluated_metrics": [...] } ↓ Interpretation: ✓ EXCELLENT CONSISTENCY 94.2% consistency across 50 test cases ``` ## Data Flow Diagram ``` User Evaluation │ ↓ ┌─────────────────────────────┐ │ evaluator.evaluate() │ │ (GPT Labeling) │ └─────────────────────────────┘ │ ├─→ Generates 4 metrics │ (R, U, C, A) │ ↓ ┌──────────────────────────┐ │ AdvancedTRACEScores │ │ Created with metrics │ └──────────────────────────┘ │ ├─→ to_dict() │ ├─ context_relevance: 0.85 │ ├─ context_utilization: 0.80 │ ├─ completeness: 0.88 │ ├─ adherence: 0.84 │ ├─ average: 0.8425 │ └─ rmse_aggregation: 0.0247 ← AUTO │ ├─→ Single evaluation: │ rmse = scores.rmse_aggregation() │ └─→ Ground truth comparison: rmse_result = RMSECalculator.compute_rmse_single_trace_evaluation( predicted, ground_truth ) Batch Analysis │ ↓ ┌─────────────────────────────┐ │ Multiple Results │ │ [result1, result2, ...] │ └─────────────────────────────┘ │ ↓ ┌───────────────────────────────────────┐ │ RMSECalculator. │ │ compute_trace_rmse_aggregation() │ └───────────────────────────────────────┘ │ ├─→ Per-metric RMSE calculation ├─→ Aggregation & consistency score ├─→ Statistical summary │ ↓ ┌────────────────────────────────────┐ │ Quality Report │ │ ├─ consistency_score: 0.942 │ │ ├─ aggregated_rmse: 0.058 │ │ ├─ per_metric_rmse: {...} │ │ └─ num_evaluations: 50 │ └────────────────────────────────────┘ ``` ## Metric Calculation Flow ``` ┌─────────────────────────────────────────────────────────┐ │ 4 TRACE Metrics Computed │ └─────────────────────────────────────────────────────────┘ ↓ ├─ Context Relevance (R): 0.85 ├─ Context Utilization (U): 0.80 ├─ Completeness (C): 0.88 └─ Adherence (A): 0.84 ↓ ┌─────────────────────────────────────────────────────────┐ │ Calculate Mean (μ) │ │ μ = (0.85 + 0.80 + 0.88 + 0.84) / 4 │ │ μ = 0.8425 │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ Calculate Deviations from Mean │ │ R - μ = 0.85 - 0.8425 = +0.0075 │ │ U - μ = 0.80 - 0.8425 = -0.0425 │ │ C - μ = 0.88 - 0.8425 = +0.0375 │ │ A - μ = 0.84 - 0.8425 = -0.0025 │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ Square the Deviations │ │ (0.0075)² = 0.00005625 │ │ (-0.0425)² = 0.00180625 │ │ (0.0375)² = 0.00140625 │ │ (-0.0025)² = 0.00000625 │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ Calculate Mean Squared Error (MSE) │ │ MSE = (0.00005625 + │ │ 0.00180625 + │ │ 0.00140625 + │ │ 0.00000625) / 4 │ │ MSE = 0.000819 │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ Calculate RMSE │ │ RMSE = √MSE = √0.000819 = 0.0286 │ └─────────────────────────────────────────────────────────┘ ↓ Result: 0.0286 Status: ✓ Excellent consistency (< 0.10) ``` ## Integration Architecture ``` ┌──────────────────────────────────────────────────────────┐ │ Streamlit Application │ │ (streamlit_app.py) │ └──────────────────────────────────────────────────────────┘ │ │ │ ├─────────────┼─────────────┤ ↓ ↓ ↓ ┌─────────┐ ┌──────────┐ ┌────────────┐ │ Chat │ │ Upload │ │ Evaluate │ │ Section │ │ Section │ │ Section │ └────┬────┘ └──────────┘ └─────┬──────┘ │ │ │ ┌───────↓────────┐ │ │ Evaluator │ │ │ (evaluate) │ │ └────────┬───────┘ │ │ │ ┌───────↓─────────────┐ │ │ AdvancedTRACEScores │ │ └────────┬────────────┘ │ │ │ ┌───────────────┤ │ │ │ │ ┌───────↓─────┐ ┌─────↓───────────┐ │ │ to_dict() │ │ rmse_aggregation│ │ │ │ │ (NEW) │ │ └────┬────────┘ └────┬────────────┘ │ │ │ └─────────┼────────────────┘ │ ┌──────↓──────┐ │ JSON Data │ │ (BCD.JSON) │ └─────────────┘ │ ┌────────┴────────┐ ↓ ↓ ┌────────┐ ┌──────────┐ │ Metrics│ │ rmse_agg │ │ Tab │ │ Tab │ └────────┘ └──────────┘ ``` ## Quality Score Distribution ``` Perfect Consistency Perfect Imbalance (RMSE = 0) (RMSE = 0.5) │ │ ↓ ↓ ┌────────────────────────────────────────────────────┐ │ ████████ Excellent ████████ Good ███ Fair ██ Poor │ └────────────────────────────────────────────────────┘ 0 0.1 0.2 0.3 0.4 0.5 │ │ │ │ │ │ │ │ │ └─ No consistency │ │ │ └─────── Problematic │ │ └───────────── Acceptable │ └──────────────────── Good └─────────────────────────── Excellent ``` ## Use Case: Problem Diagnosis ``` Evaluation Result: ┌─────────────────────────────────┐ │ R: 0.95 (Retrieved well) │ │ U: 0.50 (Not using it!) ← LOW │ │ C: 0.85 (Some coverage) │ │ A: 0.70 (Grounded) │ │ │ │ RMSE: 0.19 ⚠️ │ └─────────────────────────────────┘ │ ↓ Problem Identified: High relevance but low utilization ↓ Root Cause Analysis: • Retrieval is working (R=0.95) • But response isn't using it (U=0.50) • Suggests: LLM isn't leveraging context ↓ Actions: • Improve prompt engineering • Add "Use the retrieved context" instructions • Test with better prompts ↓ Expected Result: R: 0.95, U: 0.90, C: 0.92, A: 0.91 RMSE: 0.02 ✓ ``` ## File Organization ``` RAG Capstone Project/ ├── advanced_rag_evaluator.py │ ├── RMSECalculator (enhanced) │ │ ├─ compute_rmse_for_metric() │ │ ├─ compute_rmse_single_trace_evaluation() ← NEW │ │ ├─ compute_trace_rmse_aggregation() ← NEW │ │ └─ compute_rmse_all_metrics() │ │ │ └── AdvancedTRACEScores (enhanced) │ ├─ to_dict() [includes rmse_aggregation] │ ├─ average() │ └─ rmse_aggregation() ← NEW │ ├── test_rmse_aggregation.py ← NEW │ ├─ Test 1: Perfect consistency │ ├─ Test 2: Imbalanced metrics │ ├─ Test 3: JSON output │ ├─ Test 4: Ground truth comparison │ └─ Test 5: Batch aggregation │ └── docs/ ├── TRACE_RMSE_AGGREGATION.md ← NEW (500+ lines) ├── TRACE_RMSE_QUICK_REFERENCE.md ← NEW └── TRACE_RMSE_IMPLEMENTATION.md ← NEW ``` ## Performance Characteristics ``` ┌────────────────────────────────────────────────┐ │ Performance Metrics │ ├────────────────────────────────────────────────┤ │ Operation │ Time │ Memory │ ├────────────────────────────────────────────────┤ │ rmse_aggregation() │ < 0.1ms │ 4 floats │ │ single evaluation │ < 0.2ms │ 8 floats │ │ batch (50 evals) │ < 10ms │ 400 floats │ ├────────────────────────────────────────────────┤ │ Total impact on │ │ │ │ evaluation pipeline │ < 1% │ Negligible │ └────────────────────────────────────────────────┘ ``` ## Quality Tiers ``` Score Range Status Action ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.00 - 0.10 ✓ Excellent No action 0.10 - 0.20 ✓ Good Monitor 0.20 - 0.30 ⚠️ Acceptable Investigate specific metrics 0.30 - 0.40 ❌ Poor Review RAG pipeline 0.40+ ❌ Critical Immediate action required ``` ## Summary The RMSE Aggregation System provides: - ✅ **Statistical Rigor**: Standard RMSE metric - ✅ **Automatic Integration**: No code changes needed - ✅ **Interpretability**: Clear quality tiers - ✅ **Problem Diagnosis**: Identifies specific metric imbalances - ✅ **Batch Analytics**: Consistency scoring across evaluations - ✅ **Performance**: < 1ms overhead per evaluation