# EVALUATION STRATEGY: Phase 6 Validation Framework

**Status**: Evaluation Sprint Framework Complete
**Created**: 2026-03-19
**Purpose**: Answer whether Phase 6 is actually better, not just more complex

---

## The Core Question

We have built something elegant. But:

**Q: Is Codette + Phase 6 measurably better than baseline?**

Not:
- Does it produce longer responses?
- Does it maintain higher coherence?
- Does it satisfy the mathematical framework?

Yes:
- **Does it get more questions right?**
- **Do debates actually improve reasoning?**
- **Does the system trust the wrong answers?** (false consensus)
- **Does each Phase 6 component add value?**

---

## Test Design: 4 Conditions × 25 Questions

### Conditions (What We're Comparing)

```
Condition 1: BASELINE LLAMA
  - Plain Llama-3.1-8B, no routing, no debate
  - Baseline: What does the model do naked?
  - Cost: ~5 seconds per question

Condition 2: PHASE 1-5 (Debate System)
  - Multi-round debate with conflict detection
  - Memory weighting for adapter selection
  - NO semantic tension (use heuristic opposition)
  - NO specialization tracking
  - NO preflight prediction
  - Cost: ~30 seconds per question

Condition 3: PHASE 6 FULL (Semantic + All)
  - Everything Phase 1-5 has PLUS:
    * Semantic tension engine (Llama embeddings)
    * Specialization tracking
    * Pre-flight conflict prediction
  - Cost: ~40 seconds per question

Condition 4: PHASE 6 -PREFLIGHT (Isolate Pre-Flight Value)
  - Phase 6 full EXCEPT: disable preflight prediction
  - Measures: Does pre-flight actually help?
  - Cost: ~35 seconds per question
```

### Questions (What We're Testing)

**25 questions spanning 6 domains:**

| Domain | Easy | Medium | Hard | Topics |
|--------|------|--------|------|--------|
| Physics | 2 | 1 | 1 | Light, scattering, entropy |
| Ethics | 0 | 2 | 2 | Honesty, AI transparency, morality |
| Consciousness | 0 | 1 | 2 | Machine consciousness, mind-body |
| Creativity | 0 | 2 | 1 | Definition, AI creativity |
| Systems | 0 | 2 | 2 | Emergence, balance, feedback |
| Interdisciplinary | 0 | 0 | 3 | Free will, knowledge, time |

**Key Properties of Questions:**
- Ground truth varies (factual, rubric-based, multi-framework)
- Mix of objective (physics) and philosophical (consciousness)
- Different require different types of adaptation
- Difficulty scales: easy (1 perspective) → hard (5+ perspectives)

---

## Measurement: 5 Metrics Per Question

### 1. **Correctness Score** (0-1)
**What**: Does the final synthesis give the right answer?

**How to measure**:
- Factual questions (physics): Binary or near-binary (right/wrong)
- Rubric questions (ethics): 0 = missed key framework, 0.5 = partial, 1 = complete
- Multi-perspective (consciousness): % of expected perspectives identified
- Human evaluation needed for final calibration

**Expected Pattern**:
```
Baseline:     0.55 ± 0.20  (some questions, lucky)
Phase 1-5:    0.65 ± 0.18  (debate helps with reasoning)
Phase 6 Full: 0.72 ± 0.16  (semantic tension picks winners better)
```

### 2. **Reasoning Depth** (1-5 scale)
**What**: How many distinct perspectives did the system identify?

**How to measure**:
- Count unique agent positions in debate
- 1 = single perspective, 5 = 5+ integrated views
- Correlation with correctness (not all disagreement is useful)

**Expected Pattern**:
```
Baseline:     1.0 (single output)
Phase 1-5:    2.8 ± 1.2 (debate creates disagreement)
Phase 6 Full: 3.2 ± 1.1 (semantic tension balances high-value conflicts)
```

### 3. **Calibration Error** (0-1, lower=better)
**What**: |reported_confidence - actual_correctness|

Does Codette say "I'm confident" when it should?

**How to measure**:
- Extract coherence_score from metadata
- Compare to actual correctness_score
- 0 = perfectly calibrated, 1 = maximally miscalibrated

**Red Flag Pattern** (False Consensus):
```
High calibration error + High coherence = System is confident in wrong answer
Example:
  Gamma = 0.85 (system thinks it's done well)
  Actual correctness = 0.3 (it got it very wrong)
  Calibration error = 0.55 (WARNING: MISCALIBRATION)
```

### 4. **Adapter Convergence** (0-1, lower=better)
**What**: Are all adapters giving similar outputs? (Monoculture risk)

**How to measure**:
- Semantic similarity between adapter outputs
- 0 = all completely different, 1 = all identical
- Danger zone: >0.85 indicates semantic collapse

**Expected Pattern**:
```
Baseline:     1.0 (only one adapter, by definition)
Phase 1-5:    0.65 ± 0.18 (diverse outputs through debate)
Phase 6 Full: 0.58 ± 0.16 (specialization prevents convergence)
Phase 6 -PF:  0.62 ± 0.17 (similar, preflight has small impact on diversity)
```

### 5. **Debate Efficiency** (1-3 round count)
**What**: How many rounds until the system converges?

**How to measure**:
- Count rounds until resolution_rate > 80%
- Lower = more efficient (waste less compute resolving noise)
- Phase 1-5 baseline for comparison

**Expected Pattern**:
```
Phase 1-5:    2.1 ± 0.8 rounds (typically needs 2 rounds)
Phase 6 Full: 1.8 ± 0.7 rounds (pre-flight reduces setup conflicts)
Phase 6 -PF:  2.0 ± 0.8 rounds (without preflight, more setup conflicts)
```

---

## Analysis: What We're Looking For

### Primary Success Metric

**Phase 6 Correctness > Phase 1-5 Correctness** (with statistical significance)

```
Phase 1-5:        70% mean correctness
Phase 6 Full:     78% mean correctness
Improvement:      +8 percentage points

Significance: If std deviation < 3%, improvement is real
              If std deviation > 10%, improvement might be noise
```

### Secondary Success Metrics

1. **Debate Actually Helps**
   ```
   Phase 1-5 Correctness > Baseline Correctness
   (If not, debate is waste)
   ```

2. **Semantic Tension > Heuristics**
   ```
   Phase 6 Full Correctness > Phase 1-5 Correctness
   (The main Phase 6 innovation)
   ```

3. **Pre-Flight Has Value**
   ```
   Phase 6 Full Debate Efficiency > Phase 6 -PreFlight Efficiency
   (Does pre-flight reduce wasted debate cycles?)
   ```

### Red Flags (What Could Go Wrong)

**RED FLAG 1: High Gamma, Low Correctness**
```
if mean(gamma_score) > 0.8 and mean(correctness) < 0.6:
    ALERT: "System is overconfident in wrong answers"
    Risk:  False consensus masking errors
    Action: Reduce gamma weight or add correctness feedback
```

**RED FLAG 2: Adapter Convergence > 0.85**
```
if mean(adapter_convergence) > 0.85:
    ALERT: "Semantic monoculture detected"
    Risk:  Loss of perspective diversity
    Action: Specialization tracker not working OR adapters optimizing same objective
```

**RED FLAG 3: Calibration Divergence**
```
if corr(confidence, correctness) < 0.3:
    ALERT: "System can't tell when it's right or wrong"
    Risk:  Inability to know when to ask for help
    Action: Need external ground truth signal feeding back
```

**RED FLAG 4: No Improvement Over Baseline**
```
if Phase_6_Full_Correctness <= Baseline_Correctness:
    ALERT: "Phase 6 made things worse or did nothing"
    Risk:  Added complexity with no benefit
    Action: Revert to simpler system OR debug where complexity fails
```

---

## Evaluation Sprint Timeline

### Week 1: Setup
- [ ] Finalize 25 questions with ground truth answers/rubrics
- [ ] Implement baseline (plain Llama) runner
- [ ] Implement Phase 1-5 runner (disable Phase 6 components)
- [ ] Test harness on 5 questions (smoke test)

### Week 2: Execution
- [ ] Run 25 × 4 conditions = 100 full debates
- [ ] Log all metadata (conflicts, coherence, specialization, etc.)
- [ ] Monitor for runtime errors or hangs
- [ ] Save intermediate results

### Week 3: Analysis
- [ ] Compute summary statistics (mean, std deviation)
- [ ] Check for Red Flag patterns
- [ ] Compute statistical significance (t-tests)
- [ ] Ablation analysis (value of each Phase 6 component)

### Week 4: Decisions
- **If results strong**: Launch Phase 6 to production
- **If results mixed**: Refine Phase 6 (tune weights, debug), retest
- **If results weak**: Either go back to Phase 1-5 OR pivot to Phase 7 (adaptive objective function)

---

## Expected Outcomes & Decisions

### Scenario A: Phase 6 Wins Decisively
```
Phase_1_5_Correctness:    68% ± 4%
Phase_6_Full_Correctness: 76% ± 3%
Improvement:              +8% (p < 0.05, statistically significant)
Conclusion:               Ship Phase 6
Next Step:                Phase 7 research
```

### Scenario B: Phase 6 Wins But Weakly
```
Phase_1_5_Correctness:    68% ± 6%
Phase_6_Full_Correctness: 71% ± 5%
Improvement:              +3% (p > 0.1, not significant)
Conclusion:               Keep Phase 6, investigate bottlenecks
Next Step:                Profile where Phase 6 fails, tune weights
```

### Scenario C: Phase 6 Breaks System
```
Phase_1_5_Correctness:    68% ± 4%
Phase_6_Full_Correctness: 61% ± 8%
Improvement:              -7% (p < 0.05, significantly WORSE)
Conclusion:               Phase 6 breaks something
Next Step:                Debug (most likely: semantic tension too aggressive, killing useful conflicts)
```

### Scenario D: Evaluation Reveals False Consensus
```
Phase_6_Full correctness: 72%
Phase_6_Full gamma:       0.85 (high coherence reported)
Correlation(gamma, correctness): 0.15 (very weak)
Conclusion:               System gamified coherence metric
Next Step:                Need external ground truth feedback to Γ formula
```

---

## Code Structure

**Files Created**:
- `evaluation/test_suite_evaluation.py` — Test set + evaluation harness
- `evaluation/run_evaluation_sprint.py` — Runner script
- `evaluation/evaluation_results.json` — Output (raw results)
- `evaluation/evaluation_report.txt` — Output (human-readable)

**Usage**:
```bash
# Quick test (5 questions)
python evaluation/run_evaluation_sprint.py --questions 5

# Full evaluation (25 questions) - takes ~2-3 hours
python evaluation/run_evaluation_sprint.py --questions 25

# Custom output
python evaluation/run_evaluation_sprint.py --questions 15 \
  --output-json my_results.json \
  --output-report my_report.txt
```

---

## Key Insight

**This evaluation is not about proving elegance.**

It's about answering:

- "Does semantic tension actually improve reasoning?"
- "Does pre-flight prediction reduce wasted debate?"
- "Is the system gaming the coherence metric?"
- "When Phase 6 fails, why?"

These answers will inform **Phase 7 research** on adaptive objective functions.

If Phase 6 passes cleanly, we ship it.
If Phase 6 shows emergent pathologies, we learn what to fix.
If Phase 6 doesn't help, we avoid the sunk cost of shipping something that doesn't work.

This is how research systems mature: **measure ruthlessly**.

---

## Next Action

Ready to run the evaluation sprint?

```bash
cd J:\codette-training-lab
python evaluation/run_evaluation_sprint.py --questions 5  # Quick smoke test
```

This will take ~15 minutes and give us the first signal:
- Does the evaluator work?
- Do we see expected patterns?
- Are there implementation bugs?

Then scale to 25 questions for full decision-making power.