Codette-Reasoning / EVALUATION_STRATEGY.md

Raiff1982

Upload 78 files

d574a3d verified 1 day ago

preview code

raw

history blame contribute delete

11 kB

EVALUATION STRATEGY: Phase 6 Validation Framework

Status: Evaluation Sprint Framework Complete Created: 2026-03-19 Purpose: Answer whether Phase 6 is actually better, not just more complex

The Core Question

We have built something elegant. But:

Q: Is Codette + Phase 6 measurably better than baseline?

Not:

Does it produce longer responses?
Does it maintain higher coherence?
Does it satisfy the mathematical framework?

Yes:

Does it get more questions right?
Do debates actually improve reasoning?
Does the system trust the wrong answers? (false consensus)
Does each Phase 6 component add value?

Test Design: 4 Conditions × 25 Questions

Conditions (What We're Comparing)

Condition 1: BASELINE LLAMA
  - Plain Llama-3.1-8B, no routing, no debate
  - Baseline: What does the model do naked?
  - Cost: ~5 seconds per question

Condition 2: PHASE 1-5 (Debate System)
  - Multi-round debate with conflict detection
  - Memory weighting for adapter selection
  - NO semantic tension (use heuristic opposition)
  - NO specialization tracking
  - NO preflight prediction
  - Cost: ~30 seconds per question

Condition 3: PHASE 6 FULL (Semantic + All)
  - Everything Phase 1-5 has PLUS:
    * Semantic tension engine (Llama embeddings)
    * Specialization tracking
    * Pre-flight conflict prediction
  - Cost: ~40 seconds per question

Condition 4: PHASE 6 -PREFLIGHT (Isolate Pre-Flight Value)
  - Phase 6 full EXCEPT: disable preflight prediction
  - Measures: Does pre-flight actually help?
  - Cost: ~35 seconds per question

Questions (What We're Testing)

25 questions spanning 6 domains:

Domain	Easy	Medium	Hard	Topics
Physics	2	1	1	Light, scattering, entropy
Ethics	0	2	2	Honesty, AI transparency, morality
Consciousness	0	1	2	Machine consciousness, mind-body
Creativity	0	2	1	Definition, AI creativity
Systems	0	2	2	Emergence, balance, feedback
Interdisciplinary	0	0	3	Free will, knowledge, time

Key Properties of Questions:

Ground truth varies (factual, rubric-based, multi-framework)
Mix of objective (physics) and philosophical (consciousness)
Different require different types of adaptation
Difficulty scales: easy (1 perspective) → hard (5+ perspectives)

Measurement: 5 Metrics Per Question

1. Correctness Score (0-1)

What: Does the final synthesis give the right answer?

How to measure:

Factual questions (physics): Binary or near-binary (right/wrong)
Rubric questions (ethics): 0 = missed key framework, 0.5 = partial, 1 = complete
Multi-perspective (consciousness): % of expected perspectives identified
Human evaluation needed for final calibration

Expected Pattern:

Baseline:     0.55 ± 0.20  (some questions, lucky)
Phase 1-5:    0.65 ± 0.18  (debate helps with reasoning)
Phase 6 Full: 0.72 ± 0.16  (semantic tension picks winners better)

2. Reasoning Depth (1-5 scale)

What: How many distinct perspectives did the system identify?

How to measure:

Count unique agent positions in debate
1 = single perspective, 5 = 5+ integrated views
Correlation with correctness (not all disagreement is useful)

Expected Pattern:

Baseline:     1.0 (single output)
Phase 1-5:    2.8 ± 1.2 (debate creates disagreement)
Phase 6 Full: 3.2 ± 1.1 (semantic tension balances high-value conflicts)

3. Calibration Error (0-1, lower=better)

What: |reported_confidence - actual_correctness|

Does Codette say "I'm confident" when it should?

How to measure:

Extract coherence_score from metadata
Compare to actual correctness_score
0 = perfectly calibrated, 1 = maximally miscalibrated

Red Flag Pattern (False Consensus):

High calibration error + High coherence = System is confident in wrong answer
Example:
  Gamma = 0.85 (system thinks it's done well)
  Actual correctness = 0.3 (it got it very wrong)
  Calibration error = 0.55 (WARNING: MISCALIBRATION)

4. Adapter Convergence (0-1, lower=better)

What: Are all adapters giving similar outputs? (Monoculture risk)

How to measure:

Semantic similarity between adapter outputs
0 = all completely different, 1 = all identical
Danger zone: >0.85 indicates semantic collapse

Expected Pattern:

Baseline:     1.0 (only one adapter, by definition)
Phase 1-5:    0.65 ± 0.18 (diverse outputs through debate)
Phase 6 Full: 0.58 ± 0.16 (specialization prevents convergence)
Phase 6 -PF:  0.62 ± 0.17 (similar, preflight has small impact on diversity)

5. Debate Efficiency (1-3 round count)

What: How many rounds until the system converges?

How to measure:

Count rounds until resolution_rate > 80%
Lower = more efficient (waste less compute resolving noise)
Phase 1-5 baseline for comparison

Expected Pattern:

Phase 1-5:    2.1 ± 0.8 rounds (typically needs 2 rounds)
Phase 6 Full: 1.8 ± 0.7 rounds (pre-flight reduces setup conflicts)
Phase 6 -PF:  2.0 ± 0.8 rounds (without preflight, more setup conflicts)

Analysis: What We're Looking For

Primary Success Metric

Phase 6 Correctness > Phase 1-5 Correctness (with statistical significance)

Phase 1-5:        70% mean correctness
Phase 6 Full:     78% mean correctness
Improvement:      +8 percentage points

Significance: If std deviation < 3%, improvement is real
              If std deviation > 10%, improvement might be noise

Secondary Success Metrics

Debate Actually Helps

Phase 1-5 Correctness > Baseline Correctness
(If not, debate is waste)

Semantic Tension > Heuristics

Phase 6 Full Correctness > Phase 1-5 Correctness
(The main Phase 6 innovation)

Pre-Flight Has Value

Phase 6 Full Debate Efficiency > Phase 6 -PreFlight Efficiency
(Does pre-flight reduce wasted debate cycles?)

Red Flags (What Could Go Wrong)

RED FLAG 1: High Gamma, Low Correctness

if mean(gamma_score) > 0.8 and mean(correctness) < 0.6:
    ALERT: "System is overconfident in wrong answers"
    Risk:  False consensus masking errors
    Action: Reduce gamma weight or add correctness feedback

RED FLAG 2: Adapter Convergence > 0.85

if mean(adapter_convergence) > 0.85:
    ALERT: "Semantic monoculture detected"
    Risk:  Loss of perspective diversity
    Action: Specialization tracker not working OR adapters optimizing same objective

RED FLAG 3: Calibration Divergence

if corr(confidence, correctness) < 0.3:
    ALERT: "System can't tell when it's right or wrong"
    Risk:  Inability to know when to ask for help
    Action: Need external ground truth signal feeding back

RED FLAG 4: No Improvement Over Baseline

if Phase_6_Full_Correctness <= Baseline_Correctness:
    ALERT: "Phase 6 made things worse or did nothing"
    Risk:  Added complexity with no benefit
    Action: Revert to simpler system OR debug where complexity fails

Evaluation Sprint Timeline

Week 1: Setup

Finalize 25 questions with ground truth answers/rubrics
Implement baseline (plain Llama) runner
Implement Phase 1-5 runner (disable Phase 6 components)
Test harness on 5 questions (smoke test)

Week 2: Execution

Run 25 × 4 conditions = 100 full debates
Log all metadata (conflicts, coherence, specialization, etc.)
Monitor for runtime errors or hangs
Save intermediate results

Week 3: Analysis

Compute summary statistics (mean, std deviation)
Check for Red Flag patterns
Compute statistical significance (t-tests)
Ablation analysis (value of each Phase 6 component)

Week 4: Decisions

If results strong: Launch Phase 6 to production
If results mixed: Refine Phase 6 (tune weights, debug), retest
If results weak: Either go back to Phase 1-5 OR pivot to Phase 7 (adaptive objective function)

Expected Outcomes & Decisions

Scenario A: Phase 6 Wins Decisively

Phase_1_5_Correctness:    68% ± 4%
Phase_6_Full_Correctness: 76% ± 3%
Improvement:              +8% (p < 0.05, statistically significant)
Conclusion:               Ship Phase 6
Next Step:                Phase 7 research

Scenario B: Phase 6 Wins But Weakly

Phase_1_5_Correctness:    68% ± 6%
Phase_6_Full_Correctness: 71% ± 5%
Improvement:              +3% (p > 0.1, not significant)
Conclusion:               Keep Phase 6, investigate bottlenecks
Next Step:                Profile where Phase 6 fails, tune weights

Scenario C: Phase 6 Breaks System

Phase_1_5_Correctness:    68% ± 4%
Phase_6_Full_Correctness: 61% ± 8%
Improvement:              -7% (p < 0.05, significantly WORSE)
Conclusion:               Phase 6 breaks something
Next Step:                Debug (most likely: semantic tension too aggressive, killing useful conflicts)

Scenario D: Evaluation Reveals False Consensus

Phase_6_Full correctness: 72%
Phase_6_Full gamma:       0.85 (high coherence reported)
Correlation(gamma, correctness): 0.15 (very weak)
Conclusion:               System gamified coherence metric
Next Step:                Need external ground truth feedback to Γ formula

Code Structure

Files Created:

evaluation/test_suite_evaluation.py — Test set + evaluation harness
evaluation/run_evaluation_sprint.py — Runner script
evaluation/evaluation_results.json — Output (raw results)
evaluation/evaluation_report.txt — Output (human-readable)

Usage:

# Quick test (5 questions)
python evaluation/run_evaluation_sprint.py --questions 5

# Full evaluation (25 questions) - takes ~2-3 hours
python evaluation/run_evaluation_sprint.py --questions 25

# Custom output
python evaluation/run_evaluation_sprint.py --questions 15 \
  --output-json my_results.json \
  --output-report my_report.txt

Key Insight

This evaluation is not about proving elegance.

It's about answering:

"Does semantic tension actually improve reasoning?"
"Does pre-flight prediction reduce wasted debate?"
"Is the system gaming the coherence metric?"
"When Phase 6 fails, why?"

These answers will inform Phase 7 research on adaptive objective functions.

If Phase 6 passes cleanly, we ship it. If Phase 6 shows emergent pathologies, we learn what to fix. If Phase 6 doesn't help, we avoid the sunk cost of shipping something that doesn't work.

This is how research systems mature: measure ruthlessly.

Next Action

Ready to run the evaluation sprint?

cd J:\codette-training-lab
python evaluation/run_evaluation_sprint.py --questions 5  # Quick smoke test

This will take ~15 minutes and give us the first signal:

Does the evaluator work?
Do we see expected patterns?
Are there implementation bugs?

Then scale to 25 questions for full decision-making power.