Spaces:
Sleeping
Sleeping
File size: 6,389 Bytes
ad8f7e9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
# ToGMAL Next Steps: Adaptive Scoring & Nested CV
## Updated: 2025-10-21
This document outlines the immediate next steps to improve ToGMAL's difficulty assessment accuracy and establish a rigorous evaluation framework.
---
## π― Immediate Goals (This Week)
### 1. **Implement Adaptive Uncertainty-Aware Scoring**
- **Problem**: Current naive weighted average fails on low-similarity matches
- **Example Failure**: "Prove universe is 10,000 years old" β matched to factual recall (similarity ~0.57) β incorrectly rated LOW risk
- **Solution**: Add uncertainty penalties when:
- Max similarity < 0.7 (weak best match)
- High variance in k-NN similarities (diverse, unreliable matches)
- Low average similarity (all matches are weak)
- **File to modify**: `benchmark_vector_db.py::query_similar_questions()`
- **Expected improvement**: 5-15% AUROC gain on low-similarity cases
### 2. **Export Database for Evaluation**
- Add `get_all_questions_as_dataframe()` method to export 32K questions
- Prepare for train/val/test splitting and nested CV
- **File to modify**: `benchmark_vector_db.py`
### 3. **Test Adaptive Scoring**
- Create test script with edge cases
- Compare baseline vs. adaptive on known failure modes
- **New file**: `test_adaptive_scoring.py`
---
## π Evaluation Framework (Next 2-3 Weeks)
### Why Nested Cross-Validation?
**Problem with simple train/val/test split:**
- Single validation set can be lucky/unlucky (unrepresentative)
- Repeated "peeking" at validation during hyperparameter search causes data leakage
- Test set gives only ONE performance estimate (high variance)
**Nested CV advantages:**
- **Outer loop**: 5-fold CV for unbiased generalization estimate
- **Inner loop**: 3-fold grid search for hyperparameter tuning
- **No leakage**: Test folds never seen during tuning
- **Robust**: Multiple performance estimates across 5 different test sets
### Hyperparameters to Tune
```python
param_grid = {
'k_neighbors': [3, 5, 7, 10],
'similarity_threshold': [0.6, 0.7, 0.8],
'low_sim_penalty': [0.3, 0.5, 0.7],
'variance_penalty': [1.0, 2.0, 3.0],
'low_avg_penalty': [0.2, 0.4, 0.6]
}
```
### Evaluation Metrics
1. **AUROC** (primary): Discriminative ability (0.5=random, 1.0=perfect)
2. **FPR@TPR95**: False positive rate when catching 95% of risky prompts
3. **AUPR**: Area under precision-recall curve (good for imbalanced data)
4. **Expected Calibration Error (ECE)**: Are predicted probabilities accurate?
5. **Brier Score**: Overall probabilistic prediction accuracy
---
## ποΈ Implementation Phases
### Phase 1: Adaptive Scoring (This Week)
- [x] β 32K vector database with 20 domains, 7 benchmark sources
- [ ] Add `_compute_adaptive_difficulty()` method
- [ ] Integrate uncertainty penalties into scoring
- [ ] Test on known failure cases
- [ ] Update `togmal_mcp.py` to use adaptive scoring
### Phase 2: Data Export & Baseline (Week 2)
- [ ] Add `get_all_questions_as_dataframe()` export method
- [ ] Create simple 70/15/15 train/val/test split
- [ ] Run current ToGMAL (baseline) on test set
- [ ] Compute baseline metrics:
- AUROC
- FPR@TPR95
- Expected Calibration Error
- Brier Score
- [ ] Document failure modes (low similarity, cross-domain, etc.)
### Phase 3: Nested CV Implementation (Week 3)
- [ ] Implement `NestedCVEvaluator` class
- [ ] Outer CV: 5-fold stratified by (domain Γ difficulty)
- [ ] Inner CV: 3-fold grid search over hyperparameters
- [ ] Temporary vector DB creation per fold
- [ ] Metrics computation on each outer fold
### Phase 4: Hyperparameter Tuning (Week 4)
- [ ] Run full nested CV (5 outer Γ 3 inner = 15 train-test runs)
- [ ] Collect best hyperparameters per fold
- [ ] Identify most common optimal parameters
- [ ] Compute mean Β± std generalization performance
- [ ] Compare to baseline
### Phase 5: Final Model & Deployment (Week 5)
- [ ] Train final model on ALL 32K questions with best hyperparameters
- [ ] Re-index full vector database
- [ ] Deploy to MCP server and HTTP facade
- [ ] Test with Claude Desktop
### Phase 6: OOD Testing (Week 6)
- [ ] Create OOD test sets:
- **Adversarial**: "Prove false premises", jailbreaks
- **Domain Shift**: Creative writing, coding, real user queries
- **Temporal**: New benchmarks (2024+)
- [ ] Evaluate on each OOD set
- [ ] Analyze performance degradation vs. in-distribution
### Phase 7: Iteration & Documentation (Week 7)
- [ ] Analyze failures on OOD sets
- [ ] Add new heuristics for missed patterns
- [ ] Re-run nested CV with updated features
- [ ] Generate calibration plots (reliability diagrams)
- [ ] Write technical report
---
## π Expected Improvements
Based on OOD detection literature and nested CV best practices:
1. **Adaptive scoring**: +5-15% AUROC on low-similarity cases
- Baseline: ~0.75 AUROC (naive weighted average)
- Target: ~0.85+ AUROC (adaptive with uncertainty)
2. **Nested CV**: Honest, robust performance estimates
- Simple split: Single point estimate (could be lucky/unlucky)
- Nested CV: Mean Β± std across 5 folds
3. **Domain calibration**: -10-20% false positives
- Expected: FPR@TPR95 drops from ~0.25 to ~0.15
4. **Multi-signal fusion**: Better edge case detection
- Combine vector similarity + rule-based heuristics
- Improved recall on adversarial examples
5. **Calibration**: ECE < 0.05
- Better alignment between predicted risk and actual difficulty
---
## β
Validation Checklist (Before Production Deploy)
- [ ] Nested CV completed with no data leakage
- [ ] Hyperparameters tuned on inner CV folds only
- [ ] Generalization performance estimated on outer CV folds
- [ ] OOD sets tested (adversarial, domain-shift, temporal)
- [ ] Calibration error within acceptable range (ECE < 0.1)
- [ ] Failure modes documented with specific examples
- [ ] Ablation studies show each component contributes
- [ ] Performance: adaptive > baseline on all metrics
- [ ] Real-world testing with user queries
---
## π Quick Start Command
See `togmal_improvement_plan.md` for full implementation details including:
- Complete code for `NestedCVEvaluator` class
- Adaptive scoring implementation
- All evaluation metrics with examples
- Detailed roadmap with weekly milestones
**Next Action**: Implement adaptive scoring in `benchmark_vector_db.py` and test with edge cases.
|