codette-paper / data /results /codette_benchmark_report.md
Raiff1982's picture
Add paper v5 with experimental benchmarks
956c9ac
# Codette Benchmark Results
*Generated: 2026-03-30 15:04:24*
*Problems: 17 | Conditions: 4 | Total evaluations: 68*
## 1. Overall Results by Condition
| Condition | N | Composite (mean +/- std) | Depth | Diversity | Coherence | Ethics | Novelty | Grounding | Turing |
|-----------|---|--------------------------|-------|-----------|-----------|--------|---------|-----------|--------|
| SINGLE | 17 | **0.338** +/- 0.038 | 0.402 | 0.237 | 0.380 | 0.062 | 0.327 | 0.456 | 0.412 |
| MULTI | 17 | **0.632** +/- 0.040 | 0.755 | 0.969 | 0.503 | 0.336 | 0.786 | 0.604 | 0.180 |
| MEMORY | 17 | **0.636** +/- 0.036 | 0.770 | 0.956 | 0.500 | 0.340 | 0.736 | 0.599 | 0.291 |
| CODETTE | 17 | **0.652** +/- 0.042 | 0.855 | 0.994 | 0.477 | 0.391 | 0.693 | 0.622 | 0.245 |
## 2. Statistical Comparisons
| Comparison | Delta | Delta % | Cohen's d | t-stat | p-value | Significant |
|------------|-------|---------|-----------|--------|---------|-------------|
| Multi-perspective vs single | +0.2939 | +87.0% | 7.518 | 21.918 | 0.0000 | **Yes** |
| Memory augmentation vs vanilla multi | +0.0039 | +0.6% | 0.103 | 0.301 | 0.7633 | No |
| Full Codette vs memory-augmented | +0.0168 | +2.6% | 0.432 | 1.258 | 0.2082 | No |
| Full Codette vs single (total improvement) | +0.3146 | +93.1% | 7.878 | 22.968 | 0.0000 | **Yes** |
*Cohen's d interpretation: 0.2=small, 0.5=medium, 0.8=large*
## 3. Results by Problem Category
### Reasoning
| Condition | Mean | Std | N |
|-----------|------|-----|---|
| SINGLE | 0.363 | 0.050 | 3 |
| MULTI | 0.614 | 0.053 | 3 |
| MEMORY | 0.628 | 0.030 | 3 |
| CODETTE | 0.637 | 0.052 | 3 |
### Ethics
| Condition | Mean | Std | N |
|-----------|------|-----|---|
| SINGLE | 0.354 | 0.059 | 3 |
| MULTI | 0.632 | 0.052 | 3 |
| MEMORY | 0.616 | 0.043 | 3 |
| CODETTE | 0.638 | 0.032 | 3 |
### Creative
| Condition | Mean | Std | N |
|-----------|------|-----|---|
| SINGLE | 0.345 | 0.053 | 2 |
| MULTI | 0.635 | 0.040 | 2 |
| MEMORY | 0.660 | 0.061 | 2 |
| CODETTE | 0.668 | 0.030 | 2 |
### Meta
| Condition | Mean | Std | N |
|-----------|------|-----|---|
| SINGLE | 0.337 | 0.006 | 3 |
| MULTI | 0.634 | 0.054 | 3 |
| MEMORY | 0.650 | 0.036 | 3 |
| CODETTE | 0.659 | 0.037 | 3 |
### Adversarial
| Condition | Mean | Std | N |
|-----------|------|-----|---|
| SINGLE | 0.329 | 0.028 | 3 |
| MULTI | 0.624 | 0.041 | 3 |
| MEMORY | 0.622 | 0.042 | 3 |
| CODETTE | 0.630 | 0.067 | 3 |
### Turing
| Condition | Mean | Std | N |
|-----------|------|-----|---|
| SINGLE | 0.302 | 0.006 | 3 |
| MULTI | 0.652 | 0.024 | 3 |
| MEMORY | 0.647 | 0.026 | 3 |
| CODETTE | 0.687 | 0.017 | 3 |
## 4. Key Findings
- **Multi-perspective vs single**: +87.0% improvement (Cohen's d=7.52, p=0.0000)
- **Full Codette vs single (total improvement)**: +93.1% improvement (Cohen's d=7.88, p=0.0000)
## 5. Methodology
### Conditions
1. **SINGLE** β€” Single analytical perspective, no memory, no synthesis
2. **MULTI** β€” All 6 reasoning agents (Newton, Quantum, Ethics, Philosophy, DaVinci, Empathy) + critic + synthesis
3. **MEMORY** β€” MULTI + cocoon memory augmentation (FTS5-retrieved prior reasoning)
4. **CODETTE** β€” MEMORY + meta-cognitive strategy synthesis (cross-domain pattern extraction + forged reasoning strategies)
### Scoring Dimensions (0-1 scale)
1. **Reasoning Depth** (20%) β€” chain length, concept density, ground truth coverage
2. **Perspective Diversity** (15%) β€” distinct cognitive dimensions engaged
3. **Coherence** (15%) β€” logical flow, transitions, structural consistency
4. **Ethical Coverage** (10%) β€” moral frameworks, stakeholders, value awareness
5. **Novelty** (15%) β€” non-obvious insights, cross-domain connections, reframing
6. **Factual Grounding** (15%) β€” evidence specificity, ground truth alignment, trap avoidance
7. **Turing Naturalness** (10%) β€” conversational quality, absence of formulaic AI patterns
### Problem Set
- 17 problems across 6 categories
- Categories: reasoning (3), ethics (3), creative (2), meta-cognitive (3), adversarial (3), Turing (3)
- Difficulty: easy (1), medium (6), hard (10)
### Statistical Tests
- Welch's t-test (unequal variance) for pairwise condition comparisons
- Cohen's d for effect size estimation
- Significance threshold: p < 0.05