| # EVALUATION STRATEGY: Phase 6 Validation Framework |
|
|
| **Status**: Evaluation Sprint Framework Complete |
| **Created**: 2026-03-19 |
| **Purpose**: Answer whether Phase 6 is actually better, not just more complex |
|
|
| --- |
|
|
| ## The Core Question |
|
|
| We have built something elegant. But: |
|
|
| **Q: Is Codette + Phase 6 measurably better than baseline?** |
|
|
| Not: |
| - Does it produce longer responses? |
| - Does it maintain higher coherence? |
| - Does it satisfy the mathematical framework? |
|
|
| Yes: |
| - **Does it get more questions right?** |
| - **Do debates actually improve reasoning?** |
| - **Does the system trust the wrong answers?** (false consensus) |
| - **Does each Phase 6 component add value?** |
|
|
| --- |
|
|
| ## Test Design: 4 Conditions × 25 Questions |
|
|
| ### Conditions (What We're Comparing) |
|
|
| ``` |
| Condition 1: BASELINE LLAMA |
| - Plain Llama-3.1-8B, no routing, no debate |
| - Baseline: What does the model do naked? |
| - Cost: ~5 seconds per question |
| |
| Condition 2: PHASE 1-5 (Debate System) |
| - Multi-round debate with conflict detection |
| - Memory weighting for adapter selection |
| - NO semantic tension (use heuristic opposition) |
| - NO specialization tracking |
| - NO preflight prediction |
| - Cost: ~30 seconds per question |
| |
| Condition 3: PHASE 6 FULL (Semantic + All) |
| - Everything Phase 1-5 has PLUS: |
| * Semantic tension engine (Llama embeddings) |
| * Specialization tracking |
| * Pre-flight conflict prediction |
| - Cost: ~40 seconds per question |
| |
| Condition 4: PHASE 6 -PREFLIGHT (Isolate Pre-Flight Value) |
| - Phase 6 full EXCEPT: disable preflight prediction |
| - Measures: Does pre-flight actually help? |
| - Cost: ~35 seconds per question |
| ``` |
|
|
| ### Questions (What We're Testing) |
|
|
| **25 questions spanning 6 domains:** |
|
|
| | Domain | Easy | Medium | Hard | Topics | |
| |--------|------|--------|------|--------| |
| | Physics | 2 | 1 | 1 | Light, scattering, entropy | |
| | Ethics | 0 | 2 | 2 | Honesty, AI transparency, morality | |
| | Consciousness | 0 | 1 | 2 | Machine consciousness, mind-body | |
| | Creativity | 0 | 2 | 1 | Definition, AI creativity | |
| | Systems | 0 | 2 | 2 | Emergence, balance, feedback | |
| | Interdisciplinary | 0 | 0 | 3 | Free will, knowledge, time | |
|
|
| **Key Properties of Questions:** |
| - Ground truth varies (factual, rubric-based, multi-framework) |
| - Mix of objective (physics) and philosophical (consciousness) |
| - Different require different types of adaptation |
| - Difficulty scales: easy (1 perspective) → hard (5+ perspectives) |
|
|
| --- |
|
|
| ## Measurement: 5 Metrics Per Question |
|
|
| ### 1. **Correctness Score** (0-1) |
| **What**: Does the final synthesis give the right answer? |
|
|
| **How to measure**: |
| - Factual questions (physics): Binary or near-binary (right/wrong) |
| - Rubric questions (ethics): 0 = missed key framework, 0.5 = partial, 1 = complete |
| - Multi-perspective (consciousness): % of expected perspectives identified |
| - Human evaluation needed for final calibration |
|
|
| **Expected Pattern**: |
| ``` |
| Baseline: 0.55 ± 0.20 (some questions, lucky) |
| Phase 1-5: 0.65 ± 0.18 (debate helps with reasoning) |
| Phase 6 Full: 0.72 ± 0.16 (semantic tension picks winners better) |
| ``` |
|
|
| ### 2. **Reasoning Depth** (1-5 scale) |
| **What**: How many distinct perspectives did the system identify? |
|
|
| **How to measure**: |
| - Count unique agent positions in debate |
| - 1 = single perspective, 5 = 5+ integrated views |
| - Correlation with correctness (not all disagreement is useful) |
|
|
| **Expected Pattern**: |
| ``` |
| Baseline: 1.0 (single output) |
| Phase 1-5: 2.8 ± 1.2 (debate creates disagreement) |
| Phase 6 Full: 3.2 ± 1.1 (semantic tension balances high-value conflicts) |
| ``` |
|
|
| ### 3. **Calibration Error** (0-1, lower=better) |
| **What**: |reported_confidence - actual_correctness| |
|
|
| Does Codette say "I'm confident" when it should? |
|
|
| **How to measure**: |
| - Extract coherence_score from metadata |
| - Compare to actual correctness_score |
| - 0 = perfectly calibrated, 1 = maximally miscalibrated |
|
|
| **Red Flag Pattern** (False Consensus): |
| ``` |
| High calibration error + High coherence = System is confident in wrong answer |
| Example: |
| Gamma = 0.85 (system thinks it's done well) |
| Actual correctness = 0.3 (it got it very wrong) |
| Calibration error = 0.55 (WARNING: MISCALIBRATION) |
| ``` |
|
|
| ### 4. **Adapter Convergence** (0-1, lower=better) |
| **What**: Are all adapters giving similar outputs? (Monoculture risk) |
|
|
| **How to measure**: |
| - Semantic similarity between adapter outputs |
| - 0 = all completely different, 1 = all identical |
| - Danger zone: >0.85 indicates semantic collapse |
|
|
| **Expected Pattern**: |
| ``` |
| Baseline: 1.0 (only one adapter, by definition) |
| Phase 1-5: 0.65 ± 0.18 (diverse outputs through debate) |
| Phase 6 Full: 0.58 ± 0.16 (specialization prevents convergence) |
| Phase 6 -PF: 0.62 ± 0.17 (similar, preflight has small impact on diversity) |
| ``` |
|
|
| ### 5. **Debate Efficiency** (1-3 round count) |
| **What**: How many rounds until the system converges? |
|
|
| **How to measure**: |
| - Count rounds until resolution_rate > 80% |
| - Lower = more efficient (waste less compute resolving noise) |
| - Phase 1-5 baseline for comparison |
| |
| **Expected Pattern**: |
| ``` |
| Phase 1-5: 2.1 ± 0.8 rounds (typically needs 2 rounds) |
| Phase 6 Full: 1.8 ± 0.7 rounds (pre-flight reduces setup conflicts) |
| Phase 6 -PF: 2.0 ± 0.8 rounds (without preflight, more setup conflicts) |
| ``` |
| |
| --- |
| |
| ## Analysis: What We're Looking For |
| |
| ### Primary Success Metric |
| |
| **Phase 6 Correctness > Phase 1-5 Correctness** (with statistical significance) |
| |
| ``` |
| Phase 1-5: 70% mean correctness |
| Phase 6 Full: 78% mean correctness |
| Improvement: +8 percentage points |
| |
| Significance: If std deviation < 3%, improvement is real |
| If std deviation > 10%, improvement might be noise |
| ``` |
| |
| ### Secondary Success Metrics |
| |
| 1. **Debate Actually Helps** |
| ``` |
| Phase 1-5 Correctness > Baseline Correctness |
| (If not, debate is waste) |
| ``` |
| |
| 2. **Semantic Tension > Heuristics** |
| ``` |
| Phase 6 Full Correctness > Phase 1-5 Correctness |
| (The main Phase 6 innovation) |
| ``` |
| |
| 3. **Pre-Flight Has Value** |
| ``` |
| Phase 6 Full Debate Efficiency > Phase 6 -PreFlight Efficiency |
| (Does pre-flight reduce wasted debate cycles?) |
| ``` |
| |
| ### Red Flags (What Could Go Wrong) |
| |
| **RED FLAG 1: High Gamma, Low Correctness** |
| ``` |
| if mean(gamma_score) > 0.8 and mean(correctness) < 0.6: |
| ALERT: "System is overconfident in wrong answers" |
| Risk: False consensus masking errors |
| Action: Reduce gamma weight or add correctness feedback |
| ``` |
| |
| **RED FLAG 2: Adapter Convergence > 0.85** |
| ``` |
| if mean(adapter_convergence) > 0.85: |
| ALERT: "Semantic monoculture detected" |
| Risk: Loss of perspective diversity |
| Action: Specialization tracker not working OR adapters optimizing same objective |
| ``` |
|
|
| **RED FLAG 3: Calibration Divergence** |
| ``` |
| if corr(confidence, correctness) < 0.3: |
| ALERT: "System can't tell when it's right or wrong" |
| Risk: Inability to know when to ask for help |
| Action: Need external ground truth signal feeding back |
| ``` |
|
|
| **RED FLAG 4: No Improvement Over Baseline** |
| ``` |
| if Phase_6_Full_Correctness <= Baseline_Correctness: |
| ALERT: "Phase 6 made things worse or did nothing" |
| Risk: Added complexity with no benefit |
| Action: Revert to simpler system OR debug where complexity fails |
| ``` |
|
|
| --- |
|
|
| ## Evaluation Sprint Timeline |
|
|
| ### Week 1: Setup |
| - [ ] Finalize 25 questions with ground truth answers/rubrics |
| - [ ] Implement baseline (plain Llama) runner |
| - [ ] Implement Phase 1-5 runner (disable Phase 6 components) |
| - [ ] Test harness on 5 questions (smoke test) |
|
|
| ### Week 2: Execution |
| - [ ] Run 25 × 4 conditions = 100 full debates |
| - [ ] Log all metadata (conflicts, coherence, specialization, etc.) |
| - [ ] Monitor for runtime errors or hangs |
| - [ ] Save intermediate results |
|
|
| ### Week 3: Analysis |
| - [ ] Compute summary statistics (mean, std deviation) |
| - [ ] Check for Red Flag patterns |
| - [ ] Compute statistical significance (t-tests) |
| - [ ] Ablation analysis (value of each Phase 6 component) |
|
|
| ### Week 4: Decisions |
| - **If results strong**: Launch Phase 6 to production |
| - **If results mixed**: Refine Phase 6 (tune weights, debug), retest |
| - **If results weak**: Either go back to Phase 1-5 OR pivot to Phase 7 (adaptive objective function) |
|
|
| --- |
|
|
| ## Expected Outcomes & Decisions |
|
|
| ### Scenario A: Phase 6 Wins Decisively |
| ``` |
| Phase_1_5_Correctness: 68% ± 4% |
| Phase_6_Full_Correctness: 76% ± 3% |
| Improvement: +8% (p < 0.05, statistically significant) |
| Conclusion: Ship Phase 6 |
| Next Step: Phase 7 research |
| ``` |
|
|
| ### Scenario B: Phase 6 Wins But Weakly |
| ``` |
| Phase_1_5_Correctness: 68% ± 6% |
| Phase_6_Full_Correctness: 71% ± 5% |
| Improvement: +3% (p > 0.1, not significant) |
| Conclusion: Keep Phase 6, investigate bottlenecks |
| Next Step: Profile where Phase 6 fails, tune weights |
| ``` |
|
|
| ### Scenario C: Phase 6 Breaks System |
| ``` |
| Phase_1_5_Correctness: 68% ± 4% |
| Phase_6_Full_Correctness: 61% ± 8% |
| Improvement: -7% (p < 0.05, significantly WORSE) |
| Conclusion: Phase 6 breaks something |
| Next Step: Debug (most likely: semantic tension too aggressive, killing useful conflicts) |
| ``` |
|
|
| ### Scenario D: Evaluation Reveals False Consensus |
| ``` |
| Phase_6_Full correctness: 72% |
| Phase_6_Full gamma: 0.85 (high coherence reported) |
| Correlation(gamma, correctness): 0.15 (very weak) |
| Conclusion: System gamified coherence metric |
| Next Step: Need external ground truth feedback to Γ formula |
| ``` |
|
|
| --- |
|
|
| ## Code Structure |
|
|
| **Files Created**: |
| - `evaluation/test_suite_evaluation.py` — Test set + evaluation harness |
| - `evaluation/run_evaluation_sprint.py` — Runner script |
| - `evaluation/evaluation_results.json` — Output (raw results) |
| - `evaluation/evaluation_report.txt` — Output (human-readable) |
|
|
| **Usage**: |
| ```bash |
| # Quick test (5 questions) |
| python evaluation/run_evaluation_sprint.py --questions 5 |
| |
| # Full evaluation (25 questions) - takes ~2-3 hours |
| python evaluation/run_evaluation_sprint.py --questions 25 |
| |
| # Custom output |
| python evaluation/run_evaluation_sprint.py --questions 15 \ |
| --output-json my_results.json \ |
| --output-report my_report.txt |
| ``` |
|
|
| --- |
|
|
| ## Key Insight |
|
|
| **This evaluation is not about proving elegance.** |
|
|
| It's about answering: |
|
|
| - "Does semantic tension actually improve reasoning?" |
| - "Does pre-flight prediction reduce wasted debate?" |
| - "Is the system gaming the coherence metric?" |
| - "When Phase 6 fails, why?" |
|
|
| These answers will inform **Phase 7 research** on adaptive objective functions. |
|
|
| If Phase 6 passes cleanly, we ship it. |
| If Phase 6 shows emergent pathologies, we learn what to fix. |
| If Phase 6 doesn't help, we avoid the sunk cost of shipping something that doesn't work. |
|
|
| This is how research systems mature: **measure ruthlessly**. |
|
|
| --- |
|
|
| ## Next Action |
|
|
| Ready to run the evaluation sprint? |
|
|
| ```bash |
| cd J:\codette-training-lab |
| python evaluation/run_evaluation_sprint.py --questions 5 # Quick smoke test |
| ``` |
|
|
| This will take ~15 minutes and give us the first signal: |
| - Does the evaluator work? |
| - Do we see expected patterns? |
| - Are there implementation bugs? |
|
|
| Then scale to 25 questions for full decision-making power. |
|
|