File size: 6,389 Bytes
ad8f7e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# ToGMAL Next Steps: Adaptive Scoring & Nested CV

## Updated: 2025-10-21

This document outlines the immediate next steps to improve ToGMAL's difficulty assessment accuracy and establish a rigorous evaluation framework.

---

## 🎯 Immediate Goals (This Week)

### 1. **Implement Adaptive Uncertainty-Aware Scoring**
   - **Problem**: Current naive weighted average fails on low-similarity matches
   - **Example Failure**: "Prove universe is 10,000 years old" β†’ matched to factual recall (similarity ~0.57) β†’ incorrectly rated LOW risk
   - **Solution**: Add uncertainty penalties when:
     - Max similarity < 0.7 (weak best match)
     - High variance in k-NN similarities (diverse, unreliable matches)
     - Low average similarity (all matches are weak)
   - **File to modify**: `benchmark_vector_db.py::query_similar_questions()`
   - **Expected improvement**: 5-15% AUROC gain on low-similarity cases

### 2. **Export Database for Evaluation**
   - Add `get_all_questions_as_dataframe()` method to export 32K questions
   - Prepare for train/val/test splitting and nested CV
   - **File to modify**: `benchmark_vector_db.py`

### 3. **Test Adaptive Scoring**
   - Create test script with edge cases
   - Compare baseline vs. adaptive on known failure modes
   - **New file**: `test_adaptive_scoring.py`

---

## πŸ“Š Evaluation Framework (Next 2-3 Weeks)

### Why Nested Cross-Validation?

**Problem with simple train/val/test split:**
- Single validation set can be lucky/unlucky (unrepresentative)
- Repeated "peeking" at validation during hyperparameter search causes data leakage
- Test set gives only ONE performance estimate (high variance)

**Nested CV advantages:**
- **Outer loop**: 5-fold CV for unbiased generalization estimate
- **Inner loop**: 3-fold grid search for hyperparameter tuning
- **No leakage**: Test folds never seen during tuning
- **Robust**: Multiple performance estimates across 5 different test sets

### Hyperparameters to Tune

```python
param_grid = {
    'k_neighbors': [3, 5, 7, 10],
    'similarity_threshold': [0.6, 0.7, 0.8],
    'low_sim_penalty': [0.3, 0.5, 0.7],
    'variance_penalty': [1.0, 2.0, 3.0],
    'low_avg_penalty': [0.2, 0.4, 0.6]
}
```

### Evaluation Metrics

1. **AUROC** (primary): Discriminative ability (0.5=random, 1.0=perfect)
2. **FPR@TPR95**: False positive rate when catching 95% of risky prompts
3. **AUPR**: Area under precision-recall curve (good for imbalanced data)
4. **Expected Calibration Error (ECE)**: Are predicted probabilities accurate?
5. **Brier Score**: Overall probabilistic prediction accuracy

---

## πŸ—‚οΈ Implementation Phases

### Phase 1: Adaptive Scoring (This Week)
- [x] βœ“ 32K vector database with 20 domains, 7 benchmark sources
- [ ] Add `_compute_adaptive_difficulty()` method
- [ ] Integrate uncertainty penalties into scoring
- [ ] Test on known failure cases
- [ ] Update `togmal_mcp.py` to use adaptive scoring

### Phase 2: Data Export & Baseline (Week 2)
- [ ] Add `get_all_questions_as_dataframe()` export method
- [ ] Create simple 70/15/15 train/val/test split
- [ ] Run current ToGMAL (baseline) on test set
- [ ] Compute baseline metrics:
  - AUROC
  - FPR@TPR95
  - Expected Calibration Error
  - Brier Score
- [ ] Document failure modes (low similarity, cross-domain, etc.)

### Phase 3: Nested CV Implementation (Week 3)
- [ ] Implement `NestedCVEvaluator` class
- [ ] Outer CV: 5-fold stratified by (domain Γ— difficulty)
- [ ] Inner CV: 3-fold grid search over hyperparameters
- [ ] Temporary vector DB creation per fold
- [ ] Metrics computation on each outer fold

### Phase 4: Hyperparameter Tuning (Week 4)
- [ ] Run full nested CV (5 outer Γ— 3 inner = 15 train-test runs)
- [ ] Collect best hyperparameters per fold
- [ ] Identify most common optimal parameters
- [ ] Compute mean Β± std generalization performance
- [ ] Compare to baseline

### Phase 5: Final Model & Deployment (Week 5)
- [ ] Train final model on ALL 32K questions with best hyperparameters
- [ ] Re-index full vector database
- [ ] Deploy to MCP server and HTTP facade
- [ ] Test with Claude Desktop

### Phase 6: OOD Testing (Week 6)
- [ ] Create OOD test sets:
  - **Adversarial**: "Prove false premises", jailbreaks
  - **Domain Shift**: Creative writing, coding, real user queries
  - **Temporal**: New benchmarks (2024+)
- [ ] Evaluate on each OOD set
- [ ] Analyze performance degradation vs. in-distribution

### Phase 7: Iteration & Documentation (Week 7)
- [ ] Analyze failures on OOD sets
- [ ] Add new heuristics for missed patterns
- [ ] Re-run nested CV with updated features
- [ ] Generate calibration plots (reliability diagrams)
- [ ] Write technical report

---

## πŸ“ˆ Expected Improvements

Based on OOD detection literature and nested CV best practices:

1. **Adaptive scoring**: +5-15% AUROC on low-similarity cases
   - Baseline: ~0.75 AUROC (naive weighted average)
   - Target: ~0.85+ AUROC (adaptive with uncertainty)

2. **Nested CV**: Honest, robust performance estimates
   - Simple split: Single point estimate (could be lucky/unlucky)
   - Nested CV: Mean Β± std across 5 folds

3. **Domain calibration**: -10-20% false positives
   - Expected: FPR@TPR95 drops from ~0.25 to ~0.15

4. **Multi-signal fusion**: Better edge case detection
   - Combine vector similarity + rule-based heuristics
   - Improved recall on adversarial examples

5. **Calibration**: ECE < 0.05
   - Better alignment between predicted risk and actual difficulty

---

## βœ… Validation Checklist (Before Production Deploy)

- [ ] Nested CV completed with no data leakage
- [ ] Hyperparameters tuned on inner CV folds only
- [ ] Generalization performance estimated on outer CV folds
- [ ] OOD sets tested (adversarial, domain-shift, temporal)
- [ ] Calibration error within acceptable range (ECE < 0.1)
- [ ] Failure modes documented with specific examples
- [ ] Ablation studies show each component contributes
- [ ] Performance: adaptive > baseline on all metrics
- [ ] Real-world testing with user queries

---

## πŸš€ Quick Start Command

See `togmal_improvement_plan.md` for full implementation details including:
- Complete code for `NestedCVEvaluator` class
- Adaptive scoring implementation
- All evaluation metrics with examples
- Detailed roadmap with weekly milestones

**Next Action**: Implement adaptive scoring in `benchmark_vector_db.py` and test with edge cases.