Codette-Reasoning / EVALUATION_STRATEGY.md
Raiff1982's picture
Upload 78 files
d574a3d verified
# EVALUATION STRATEGY: Phase 6 Validation Framework
**Status**: Evaluation Sprint Framework Complete
**Created**: 2026-03-19
**Purpose**: Answer whether Phase 6 is actually better, not just more complex
---
## The Core Question
We have built something elegant. But:
**Q: Is Codette + Phase 6 measurably better than baseline?**
Not:
- Does it produce longer responses?
- Does it maintain higher coherence?
- Does it satisfy the mathematical framework?
Yes:
- **Does it get more questions right?**
- **Do debates actually improve reasoning?**
- **Does the system trust the wrong answers?** (false consensus)
- **Does each Phase 6 component add value?**
---
## Test Design: 4 Conditions × 25 Questions
### Conditions (What We're Comparing)
```
Condition 1: BASELINE LLAMA
- Plain Llama-3.1-8B, no routing, no debate
- Baseline: What does the model do naked?
- Cost: ~5 seconds per question
Condition 2: PHASE 1-5 (Debate System)
- Multi-round debate with conflict detection
- Memory weighting for adapter selection
- NO semantic tension (use heuristic opposition)
- NO specialization tracking
- NO preflight prediction
- Cost: ~30 seconds per question
Condition 3: PHASE 6 FULL (Semantic + All)
- Everything Phase 1-5 has PLUS:
* Semantic tension engine (Llama embeddings)
* Specialization tracking
* Pre-flight conflict prediction
- Cost: ~40 seconds per question
Condition 4: PHASE 6 -PREFLIGHT (Isolate Pre-Flight Value)
- Phase 6 full EXCEPT: disable preflight prediction
- Measures: Does pre-flight actually help?
- Cost: ~35 seconds per question
```
### Questions (What We're Testing)
**25 questions spanning 6 domains:**
| Domain | Easy | Medium | Hard | Topics |
|--------|------|--------|------|--------|
| Physics | 2 | 1 | 1 | Light, scattering, entropy |
| Ethics | 0 | 2 | 2 | Honesty, AI transparency, morality |
| Consciousness | 0 | 1 | 2 | Machine consciousness, mind-body |
| Creativity | 0 | 2 | 1 | Definition, AI creativity |
| Systems | 0 | 2 | 2 | Emergence, balance, feedback |
| Interdisciplinary | 0 | 0 | 3 | Free will, knowledge, time |
**Key Properties of Questions:**
- Ground truth varies (factual, rubric-based, multi-framework)
- Mix of objective (physics) and philosophical (consciousness)
- Different require different types of adaptation
- Difficulty scales: easy (1 perspective) → hard (5+ perspectives)
---
## Measurement: 5 Metrics Per Question
### 1. **Correctness Score** (0-1)
**What**: Does the final synthesis give the right answer?
**How to measure**:
- Factual questions (physics): Binary or near-binary (right/wrong)
- Rubric questions (ethics): 0 = missed key framework, 0.5 = partial, 1 = complete
- Multi-perspective (consciousness): % of expected perspectives identified
- Human evaluation needed for final calibration
**Expected Pattern**:
```
Baseline: 0.55 ± 0.20 (some questions, lucky)
Phase 1-5: 0.65 ± 0.18 (debate helps with reasoning)
Phase 6 Full: 0.72 ± 0.16 (semantic tension picks winners better)
```
### 2. **Reasoning Depth** (1-5 scale)
**What**: How many distinct perspectives did the system identify?
**How to measure**:
- Count unique agent positions in debate
- 1 = single perspective, 5 = 5+ integrated views
- Correlation with correctness (not all disagreement is useful)
**Expected Pattern**:
```
Baseline: 1.0 (single output)
Phase 1-5: 2.8 ± 1.2 (debate creates disagreement)
Phase 6 Full: 3.2 ± 1.1 (semantic tension balances high-value conflicts)
```
### 3. **Calibration Error** (0-1, lower=better)
**What**: |reported_confidence - actual_correctness|
Does Codette say "I'm confident" when it should?
**How to measure**:
- Extract coherence_score from metadata
- Compare to actual correctness_score
- 0 = perfectly calibrated, 1 = maximally miscalibrated
**Red Flag Pattern** (False Consensus):
```
High calibration error + High coherence = System is confident in wrong answer
Example:
Gamma = 0.85 (system thinks it's done well)
Actual correctness = 0.3 (it got it very wrong)
Calibration error = 0.55 (WARNING: MISCALIBRATION)
```
### 4. **Adapter Convergence** (0-1, lower=better)
**What**: Are all adapters giving similar outputs? (Monoculture risk)
**How to measure**:
- Semantic similarity between adapter outputs
- 0 = all completely different, 1 = all identical
- Danger zone: >0.85 indicates semantic collapse
**Expected Pattern**:
```
Baseline: 1.0 (only one adapter, by definition)
Phase 1-5: 0.65 ± 0.18 (diverse outputs through debate)
Phase 6 Full: 0.58 ± 0.16 (specialization prevents convergence)
Phase 6 -PF: 0.62 ± 0.17 (similar, preflight has small impact on diversity)
```
### 5. **Debate Efficiency** (1-3 round count)
**What**: How many rounds until the system converges?
**How to measure**:
- Count rounds until resolution_rate > 80%
- Lower = more efficient (waste less compute resolving noise)
- Phase 1-5 baseline for comparison
**Expected Pattern**:
```
Phase 1-5: 2.1 ± 0.8 rounds (typically needs 2 rounds)
Phase 6 Full: 1.8 ± 0.7 rounds (pre-flight reduces setup conflicts)
Phase 6 -PF: 2.0 ± 0.8 rounds (without preflight, more setup conflicts)
```
---
## Analysis: What We're Looking For
### Primary Success Metric
**Phase 6 Correctness > Phase 1-5 Correctness** (with statistical significance)
```
Phase 1-5: 70% mean correctness
Phase 6 Full: 78% mean correctness
Improvement: +8 percentage points
Significance: If std deviation < 3%, improvement is real
If std deviation > 10%, improvement might be noise
```
### Secondary Success Metrics
1. **Debate Actually Helps**
```
Phase 1-5 Correctness > Baseline Correctness
(If not, debate is waste)
```
2. **Semantic Tension > Heuristics**
```
Phase 6 Full Correctness > Phase 1-5 Correctness
(The main Phase 6 innovation)
```
3. **Pre-Flight Has Value**
```
Phase 6 Full Debate Efficiency > Phase 6 -PreFlight Efficiency
(Does pre-flight reduce wasted debate cycles?)
```
### Red Flags (What Could Go Wrong)
**RED FLAG 1: High Gamma, Low Correctness**
```
if mean(gamma_score) > 0.8 and mean(correctness) < 0.6:
ALERT: "System is overconfident in wrong answers"
Risk: False consensus masking errors
Action: Reduce gamma weight or add correctness feedback
```
**RED FLAG 2: Adapter Convergence > 0.85**
```
if mean(adapter_convergence) > 0.85:
ALERT: "Semantic monoculture detected"
Risk: Loss of perspective diversity
Action: Specialization tracker not working OR adapters optimizing same objective
```
**RED FLAG 3: Calibration Divergence**
```
if corr(confidence, correctness) < 0.3:
ALERT: "System can't tell when it's right or wrong"
Risk: Inability to know when to ask for help
Action: Need external ground truth signal feeding back
```
**RED FLAG 4: No Improvement Over Baseline**
```
if Phase_6_Full_Correctness <= Baseline_Correctness:
ALERT: "Phase 6 made things worse or did nothing"
Risk: Added complexity with no benefit
Action: Revert to simpler system OR debug where complexity fails
```
---
## Evaluation Sprint Timeline
### Week 1: Setup
- [ ] Finalize 25 questions with ground truth answers/rubrics
- [ ] Implement baseline (plain Llama) runner
- [ ] Implement Phase 1-5 runner (disable Phase 6 components)
- [ ] Test harness on 5 questions (smoke test)
### Week 2: Execution
- [ ] Run 25 × 4 conditions = 100 full debates
- [ ] Log all metadata (conflicts, coherence, specialization, etc.)
- [ ] Monitor for runtime errors or hangs
- [ ] Save intermediate results
### Week 3: Analysis
- [ ] Compute summary statistics (mean, std deviation)
- [ ] Check for Red Flag patterns
- [ ] Compute statistical significance (t-tests)
- [ ] Ablation analysis (value of each Phase 6 component)
### Week 4: Decisions
- **If results strong**: Launch Phase 6 to production
- **If results mixed**: Refine Phase 6 (tune weights, debug), retest
- **If results weak**: Either go back to Phase 1-5 OR pivot to Phase 7 (adaptive objective function)
---
## Expected Outcomes & Decisions
### Scenario A: Phase 6 Wins Decisively
```
Phase_1_5_Correctness: 68% ± 4%
Phase_6_Full_Correctness: 76% ± 3%
Improvement: +8% (p < 0.05, statistically significant)
Conclusion: Ship Phase 6
Next Step: Phase 7 research
```
### Scenario B: Phase 6 Wins But Weakly
```
Phase_1_5_Correctness: 68% ± 6%
Phase_6_Full_Correctness: 71% ± 5%
Improvement: +3% (p > 0.1, not significant)
Conclusion: Keep Phase 6, investigate bottlenecks
Next Step: Profile where Phase 6 fails, tune weights
```
### Scenario C: Phase 6 Breaks System
```
Phase_1_5_Correctness: 68% ± 4%
Phase_6_Full_Correctness: 61% ± 8%
Improvement: -7% (p < 0.05, significantly WORSE)
Conclusion: Phase 6 breaks something
Next Step: Debug (most likely: semantic tension too aggressive, killing useful conflicts)
```
### Scenario D: Evaluation Reveals False Consensus
```
Phase_6_Full correctness: 72%
Phase_6_Full gamma: 0.85 (high coherence reported)
Correlation(gamma, correctness): 0.15 (very weak)
Conclusion: System gamified coherence metric
Next Step: Need external ground truth feedback to Γ formula
```
---
## Code Structure
**Files Created**:
- `evaluation/test_suite_evaluation.py` — Test set + evaluation harness
- `evaluation/run_evaluation_sprint.py` — Runner script
- `evaluation/evaluation_results.json` — Output (raw results)
- `evaluation/evaluation_report.txt` — Output (human-readable)
**Usage**:
```bash
# Quick test (5 questions)
python evaluation/run_evaluation_sprint.py --questions 5
# Full evaluation (25 questions) - takes ~2-3 hours
python evaluation/run_evaluation_sprint.py --questions 25
# Custom output
python evaluation/run_evaluation_sprint.py --questions 15 \
--output-json my_results.json \
--output-report my_report.txt
```
---
## Key Insight
**This evaluation is not about proving elegance.**
It's about answering:
- "Does semantic tension actually improve reasoning?"
- "Does pre-flight prediction reduce wasted debate?"
- "Is the system gaming the coherence metric?"
- "When Phase 6 fails, why?"
These answers will inform **Phase 7 research** on adaptive objective functions.
If Phase 6 passes cleanly, we ship it.
If Phase 6 shows emergent pathologies, we learn what to fix.
If Phase 6 doesn't help, we avoid the sunk cost of shipping something that doesn't work.
This is how research systems mature: **measure ruthlessly**.
---
## Next Action
Ready to run the evaluation sprint?
```bash
cd J:\codette-training-lab
python evaluation/run_evaluation_sprint.py --questions 5 # Quick smoke test
```
This will take ~15 minutes and give us the first signal:
- Does the evaluator work?
- Do we see expected patterns?
- Are there implementation bugs?
Then scale to 25 questions for full decision-making power.