EVALUATION STRATEGY: Phase 6 Validation Framework
Status: Evaluation Sprint Framework Complete Created: 2026-03-19 Purpose: Answer whether Phase 6 is actually better, not just more complex
The Core Question
We have built something elegant. But:
Q: Is Codette + Phase 6 measurably better than baseline?
Not:
- Does it produce longer responses?
- Does it maintain higher coherence?
- Does it satisfy the mathematical framework?
Yes:
- Does it get more questions right?
- Do debates actually improve reasoning?
- Does the system trust the wrong answers? (false consensus)
- Does each Phase 6 component add value?
Test Design: 4 Conditions × 25 Questions
Conditions (What We're Comparing)
Condition 1: BASELINE LLAMA
- Plain Llama-3.1-8B, no routing, no debate
- Baseline: What does the model do naked?
- Cost: ~5 seconds per question
Condition 2: PHASE 1-5 (Debate System)
- Multi-round debate with conflict detection
- Memory weighting for adapter selection
- NO semantic tension (use heuristic opposition)
- NO specialization tracking
- NO preflight prediction
- Cost: ~30 seconds per question
Condition 3: PHASE 6 FULL (Semantic + All)
- Everything Phase 1-5 has PLUS:
* Semantic tension engine (Llama embeddings)
* Specialization tracking
* Pre-flight conflict prediction
- Cost: ~40 seconds per question
Condition 4: PHASE 6 -PREFLIGHT (Isolate Pre-Flight Value)
- Phase 6 full EXCEPT: disable preflight prediction
- Measures: Does pre-flight actually help?
- Cost: ~35 seconds per question
Questions (What We're Testing)
25 questions spanning 6 domains:
| Domain | Easy | Medium | Hard | Topics |
|---|---|---|---|---|
| Physics | 2 | 1 | 1 | Light, scattering, entropy |
| Ethics | 0 | 2 | 2 | Honesty, AI transparency, morality |
| Consciousness | 0 | 1 | 2 | Machine consciousness, mind-body |
| Creativity | 0 | 2 | 1 | Definition, AI creativity |
| Systems | 0 | 2 | 2 | Emergence, balance, feedback |
| Interdisciplinary | 0 | 0 | 3 | Free will, knowledge, time |
Key Properties of Questions:
- Ground truth varies (factual, rubric-based, multi-framework)
- Mix of objective (physics) and philosophical (consciousness)
- Different require different types of adaptation
- Difficulty scales: easy (1 perspective) → hard (5+ perspectives)
Measurement: 5 Metrics Per Question
1. Correctness Score (0-1)
What: Does the final synthesis give the right answer?
How to measure:
- Factual questions (physics): Binary or near-binary (right/wrong)
- Rubric questions (ethics): 0 = missed key framework, 0.5 = partial, 1 = complete
- Multi-perspective (consciousness): % of expected perspectives identified
- Human evaluation needed for final calibration
Expected Pattern:
Baseline: 0.55 ± 0.20 (some questions, lucky)
Phase 1-5: 0.65 ± 0.18 (debate helps with reasoning)
Phase 6 Full: 0.72 ± 0.16 (semantic tension picks winners better)
2. Reasoning Depth (1-5 scale)
What: How many distinct perspectives did the system identify?
How to measure:
- Count unique agent positions in debate
- 1 = single perspective, 5 = 5+ integrated views
- Correlation with correctness (not all disagreement is useful)
Expected Pattern:
Baseline: 1.0 (single output)
Phase 1-5: 2.8 ± 1.2 (debate creates disagreement)
Phase 6 Full: 3.2 ± 1.1 (semantic tension balances high-value conflicts)
3. Calibration Error (0-1, lower=better)
What: |reported_confidence - actual_correctness|
Does Codette say "I'm confident" when it should?
How to measure:
- Extract coherence_score from metadata
- Compare to actual correctness_score
- 0 = perfectly calibrated, 1 = maximally miscalibrated
Red Flag Pattern (False Consensus):
High calibration error + High coherence = System is confident in wrong answer
Example:
Gamma = 0.85 (system thinks it's done well)
Actual correctness = 0.3 (it got it very wrong)
Calibration error = 0.55 (WARNING: MISCALIBRATION)
4. Adapter Convergence (0-1, lower=better)
What: Are all adapters giving similar outputs? (Monoculture risk)
How to measure:
- Semantic similarity between adapter outputs
- 0 = all completely different, 1 = all identical
- Danger zone: >0.85 indicates semantic collapse
Expected Pattern:
Baseline: 1.0 (only one adapter, by definition)
Phase 1-5: 0.65 ± 0.18 (diverse outputs through debate)
Phase 6 Full: 0.58 ± 0.16 (specialization prevents convergence)
Phase 6 -PF: 0.62 ± 0.17 (similar, preflight has small impact on diversity)
5. Debate Efficiency (1-3 round count)
What: How many rounds until the system converges?
How to measure:
- Count rounds until resolution_rate > 80%
- Lower = more efficient (waste less compute resolving noise)
- Phase 1-5 baseline for comparison
Expected Pattern:
Phase 1-5: 2.1 ± 0.8 rounds (typically needs 2 rounds)
Phase 6 Full: 1.8 ± 0.7 rounds (pre-flight reduces setup conflicts)
Phase 6 -PF: 2.0 ± 0.8 rounds (without preflight, more setup conflicts)
Analysis: What We're Looking For
Primary Success Metric
Phase 6 Correctness > Phase 1-5 Correctness (with statistical significance)
Phase 1-5: 70% mean correctness
Phase 6 Full: 78% mean correctness
Improvement: +8 percentage points
Significance: If std deviation < 3%, improvement is real
If std deviation > 10%, improvement might be noise
Secondary Success Metrics
Debate Actually Helps
Phase 1-5 Correctness > Baseline Correctness (If not, debate is waste)Semantic Tension > Heuristics
Phase 6 Full Correctness > Phase 1-5 Correctness (The main Phase 6 innovation)Pre-Flight Has Value
Phase 6 Full Debate Efficiency > Phase 6 -PreFlight Efficiency (Does pre-flight reduce wasted debate cycles?)
Red Flags (What Could Go Wrong)
RED FLAG 1: High Gamma, Low Correctness
if mean(gamma_score) > 0.8 and mean(correctness) < 0.6:
ALERT: "System is overconfident in wrong answers"
Risk: False consensus masking errors
Action: Reduce gamma weight or add correctness feedback
RED FLAG 2: Adapter Convergence > 0.85
if mean(adapter_convergence) > 0.85:
ALERT: "Semantic monoculture detected"
Risk: Loss of perspective diversity
Action: Specialization tracker not working OR adapters optimizing same objective
RED FLAG 3: Calibration Divergence
if corr(confidence, correctness) < 0.3:
ALERT: "System can't tell when it's right or wrong"
Risk: Inability to know when to ask for help
Action: Need external ground truth signal feeding back
RED FLAG 4: No Improvement Over Baseline
if Phase_6_Full_Correctness <= Baseline_Correctness:
ALERT: "Phase 6 made things worse or did nothing"
Risk: Added complexity with no benefit
Action: Revert to simpler system OR debug where complexity fails
Evaluation Sprint Timeline
Week 1: Setup
- Finalize 25 questions with ground truth answers/rubrics
- Implement baseline (plain Llama) runner
- Implement Phase 1-5 runner (disable Phase 6 components)
- Test harness on 5 questions (smoke test)
Week 2: Execution
- Run 25 × 4 conditions = 100 full debates
- Log all metadata (conflicts, coherence, specialization, etc.)
- Monitor for runtime errors or hangs
- Save intermediate results
Week 3: Analysis
- Compute summary statistics (mean, std deviation)
- Check for Red Flag patterns
- Compute statistical significance (t-tests)
- Ablation analysis (value of each Phase 6 component)
Week 4: Decisions
- If results strong: Launch Phase 6 to production
- If results mixed: Refine Phase 6 (tune weights, debug), retest
- If results weak: Either go back to Phase 1-5 OR pivot to Phase 7 (adaptive objective function)
Expected Outcomes & Decisions
Scenario A: Phase 6 Wins Decisively
Phase_1_5_Correctness: 68% ± 4%
Phase_6_Full_Correctness: 76% ± 3%
Improvement: +8% (p < 0.05, statistically significant)
Conclusion: Ship Phase 6
Next Step: Phase 7 research
Scenario B: Phase 6 Wins But Weakly
Phase_1_5_Correctness: 68% ± 6%
Phase_6_Full_Correctness: 71% ± 5%
Improvement: +3% (p > 0.1, not significant)
Conclusion: Keep Phase 6, investigate bottlenecks
Next Step: Profile where Phase 6 fails, tune weights
Scenario C: Phase 6 Breaks System
Phase_1_5_Correctness: 68% ± 4%
Phase_6_Full_Correctness: 61% ± 8%
Improvement: -7% (p < 0.05, significantly WORSE)
Conclusion: Phase 6 breaks something
Next Step: Debug (most likely: semantic tension too aggressive, killing useful conflicts)
Scenario D: Evaluation Reveals False Consensus
Phase_6_Full correctness: 72%
Phase_6_Full gamma: 0.85 (high coherence reported)
Correlation(gamma, correctness): 0.15 (very weak)
Conclusion: System gamified coherence metric
Next Step: Need external ground truth feedback to Γ formula
Code Structure
Files Created:
evaluation/test_suite_evaluation.py— Test set + evaluation harnessevaluation/run_evaluation_sprint.py— Runner scriptevaluation/evaluation_results.json— Output (raw results)evaluation/evaluation_report.txt— Output (human-readable)
Usage:
# Quick test (5 questions)
python evaluation/run_evaluation_sprint.py --questions 5
# Full evaluation (25 questions) - takes ~2-3 hours
python evaluation/run_evaluation_sprint.py --questions 25
# Custom output
python evaluation/run_evaluation_sprint.py --questions 15 \
--output-json my_results.json \
--output-report my_report.txt
Key Insight
This evaluation is not about proving elegance.
It's about answering:
- "Does semantic tension actually improve reasoning?"
- "Does pre-flight prediction reduce wasted debate?"
- "Is the system gaming the coherence metric?"
- "When Phase 6 fails, why?"
These answers will inform Phase 7 research on adaptive objective functions.
If Phase 6 passes cleanly, we ship it. If Phase 6 shows emergent pathologies, we learn what to fix. If Phase 6 doesn't help, we avoid the sunk cost of shipping something that doesn't work.
This is how research systems mature: measure ruthlessly.
Next Action
Ready to run the evaluation sprint?
cd J:\codette-training-lab
python evaluation/run_evaluation_sprint.py --questions 5 # Quick smoke test
This will take ~15 minutes and give us the first signal:
- Does the evaluator work?
- Do we see expected patterns?
- Are there implementation bugs?
Then scale to 25 questions for full decision-making power.