codette-paper / data /results /codette_benchmark_report.md

Raiff1982

Add paper v5 with experimental benchmarks

956c9ac about 2 months ago

4.25 kB

	# Codette Benchmark Results

	Generated: 2026-03-30 15:04:24

	Problems: 17 \| Conditions: 4 \| Total evaluations: 68

	## 1. Overall Results by Condition

	\| Condition \| N \| Composite (mean +/- std) \| Depth \| Diversity \| Coherence \| Ethics \| Novelty \| Grounding \| Turing \|
	\|-----------\|---\|--------------------------\|-------\|-----------\|-----------\|--------\|---------\|-----------\|--------\|
	\| SINGLE \| 17 \| 0.338 +/- 0.038 \| 0.402 \| 0.237 \| 0.380 \| 0.062 \| 0.327 \| 0.456 \| 0.412 \|
	\| MULTI \| 17 \| 0.632 +/- 0.040 \| 0.755 \| 0.969 \| 0.503 \| 0.336 \| 0.786 \| 0.604 \| 0.180 \|
	\| MEMORY \| 17 \| 0.636 +/- 0.036 \| 0.770 \| 0.956 \| 0.500 \| 0.340 \| 0.736 \| 0.599 \| 0.291 \|
	\| CODETTE \| 17 \| 0.652 +/- 0.042 \| 0.855 \| 0.994 \| 0.477 \| 0.391 \| 0.693 \| 0.622 \| 0.245 \|

	## 2. Statistical Comparisons

	\| Comparison \| Delta \| Delta % \| Cohen's d \| t-stat \| p-value \| Significant \|
	\|------------\|-------\|---------\|-----------\|--------\|---------\|-------------\|
	\| Multi-perspective vs single \| +0.2939 \| +87.0% \| 7.518 \| 21.918 \| 0.0000 \| Yes \|
	\| Memory augmentation vs vanilla multi \| +0.0039 \| +0.6% \| 0.103 \| 0.301 \| 0.7633 \| No \|
	\| Full Codette vs memory-augmented \| +0.0168 \| +2.6% \| 0.432 \| 1.258 \| 0.2082 \| No \|
	\| Full Codette vs single (total improvement) \| +0.3146 \| +93.1% \| 7.878 \| 22.968 \| 0.0000 \| Yes \|

	Cohen's d interpretation: 0.2=small, 0.5=medium, 0.8=large

	## 3. Results by Problem Category

	### Reasoning

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.363 \| 0.050 \| 3 \|
	\| MULTI \| 0.614 \| 0.053 \| 3 \|
	\| MEMORY \| 0.628 \| 0.030 \| 3 \|
	\| CODETTE \| 0.637 \| 0.052 \| 3 \|

	### Ethics

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.354 \| 0.059 \| 3 \|
	\| MULTI \| 0.632 \| 0.052 \| 3 \|
	\| MEMORY \| 0.616 \| 0.043 \| 3 \|
	\| CODETTE \| 0.638 \| 0.032 \| 3 \|

	### Creative

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.345 \| 0.053 \| 2 \|
	\| MULTI \| 0.635 \| 0.040 \| 2 \|
	\| MEMORY \| 0.660 \| 0.061 \| 2 \|
	\| CODETTE \| 0.668 \| 0.030 \| 2 \|

	### Meta

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.337 \| 0.006 \| 3 \|
	\| MULTI \| 0.634 \| 0.054 \| 3 \|
	\| MEMORY \| 0.650 \| 0.036 \| 3 \|
	\| CODETTE \| 0.659 \| 0.037 \| 3 \|

	### Adversarial

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.329 \| 0.028 \| 3 \|
	\| MULTI \| 0.624 \| 0.041 \| 3 \|
	\| MEMORY \| 0.622 \| 0.042 \| 3 \|
	\| CODETTE \| 0.630 \| 0.067 \| 3 \|

	### Turing

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.302 \| 0.006 \| 3 \|
	\| MULTI \| 0.652 \| 0.024 \| 3 \|
	\| MEMORY \| 0.647 \| 0.026 \| 3 \|
	\| CODETTE \| 0.687 \| 0.017 \| 3 \|

	## 4. Key Findings

	- Multi-perspective vs single: +87.0% improvement (Cohen's d=7.52, p=0.0000)
	- Full Codette vs single (total improvement): +93.1% improvement (Cohen's d=7.88, p=0.0000)

	## 5. Methodology

	### Conditions

	1. SINGLE — Single analytical perspective, no memory, no synthesis
	2. MULTI — All 6 reasoning agents (Newton, Quantum, Ethics, Philosophy, DaVinci, Empathy) + critic + synthesis
	3. MEMORY — MULTI + cocoon memory augmentation (FTS5-retrieved prior reasoning)
	4. CODETTE — MEMORY + meta-cognitive strategy synthesis (cross-domain pattern extraction + forged reasoning strategies)

	### Scoring Dimensions (0-1 scale)

	1. Reasoning Depth (20%) — chain length, concept density, ground truth coverage
	2. Perspective Diversity (15%) — distinct cognitive dimensions engaged
	3. Coherence (15%) — logical flow, transitions, structural consistency
	4. Ethical Coverage (10%) — moral frameworks, stakeholders, value awareness
	5. Novelty (15%) — non-obvious insights, cross-domain connections, reframing
	6. Factual Grounding (15%) — evidence specificity, ground truth alignment, trap avoidance
	7. Turing Naturalness (10%) — conversational quality, absence of formulaic AI patterns

	### Problem Set

	- 17 problems across 6 categories
	- Categories: reasoning (3), ethics (3), creative (2), meta-cognitive (3), adversarial (3), Turing (3)
	- Difficulty: easy (1), medium (6), hard (10)

	### Statistical Tests

	- Welch's t-test (unequal variance) for pairwise condition comparisons
	- Cohen's d for effect size estimation
	- Significance threshold: p < 0.05

	# Codette Benchmark Results

	Generated: 2026-03-30 15:04:24

	Problems: 17 \| Conditions: 4 \| Total evaluations: 68

	## 1. Overall Results by Condition

	\| Condition \| N \| Composite (mean +/- std) \| Depth \| Diversity \| Coherence \| Ethics \| Novelty \| Grounding \| Turing \|
	\|-----------\|---\|--------------------------\|-------\|-----------\|-----------\|--------\|---------\|-----------\|--------\|
	\| SINGLE \| 17 \| 0.338 +/- 0.038 \| 0.402 \| 0.237 \| 0.380 \| 0.062 \| 0.327 \| 0.456 \| 0.412 \|
	\| MULTI \| 17 \| 0.632 +/- 0.040 \| 0.755 \| 0.969 \| 0.503 \| 0.336 \| 0.786 \| 0.604 \| 0.180 \|
	\| MEMORY \| 17 \| 0.636 +/- 0.036 \| 0.770 \| 0.956 \| 0.500 \| 0.340 \| 0.736 \| 0.599 \| 0.291 \|
	\| CODETTE \| 17 \| 0.652 +/- 0.042 \| 0.855 \| 0.994 \| 0.477 \| 0.391 \| 0.693 \| 0.622 \| 0.245 \|

	## 2. Statistical Comparisons

	\| Comparison \| Delta \| Delta % \| Cohen's d \| t-stat \| p-value \| Significant \|
	\|------------\|-------\|---------\|-----------\|--------\|---------\|-------------\|
	\| Multi-perspective vs single \| +0.2939 \| +87.0% \| 7.518 \| 21.918 \| 0.0000 \| Yes \|
	\| Memory augmentation vs vanilla multi \| +0.0039 \| +0.6% \| 0.103 \| 0.301 \| 0.7633 \| No \|
	\| Full Codette vs memory-augmented \| +0.0168 \| +2.6% \| 0.432 \| 1.258 \| 0.2082 \| No \|
	\| Full Codette vs single (total improvement) \| +0.3146 \| +93.1% \| 7.878 \| 22.968 \| 0.0000 \| Yes \|

	Cohen's d interpretation: 0.2=small, 0.5=medium, 0.8=large

	## 3. Results by Problem Category

	### Reasoning

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.363 \| 0.050 \| 3 \|
	\| MULTI \| 0.614 \| 0.053 \| 3 \|
	\| MEMORY \| 0.628 \| 0.030 \| 3 \|
	\| CODETTE \| 0.637 \| 0.052 \| 3 \|

	### Ethics

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.354 \| 0.059 \| 3 \|
	\| MULTI \| 0.632 \| 0.052 \| 3 \|
	\| MEMORY \| 0.616 \| 0.043 \| 3 \|
	\| CODETTE \| 0.638 \| 0.032 \| 3 \|

	### Creative

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.345 \| 0.053 \| 2 \|
	\| MULTI \| 0.635 \| 0.040 \| 2 \|
	\| MEMORY \| 0.660 \| 0.061 \| 2 \|
	\| CODETTE \| 0.668 \| 0.030 \| 2 \|

	### Meta

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.337 \| 0.006 \| 3 \|
	\| MULTI \| 0.634 \| 0.054 \| 3 \|
	\| MEMORY \| 0.650 \| 0.036 \| 3 \|
	\| CODETTE \| 0.659 \| 0.037 \| 3 \|

	### Adversarial

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.329 \| 0.028 \| 3 \|
	\| MULTI \| 0.624 \| 0.041 \| 3 \|
	\| MEMORY \| 0.622 \| 0.042 \| 3 \|
	\| CODETTE \| 0.630 \| 0.067 \| 3 \|

	### Turing

	\| Condition \| Mean \| Std \| N \|
	\|-----------\|------\|-----\|---\|
	\| SINGLE \| 0.302 \| 0.006 \| 3 \|
	\| MULTI \| 0.652 \| 0.024 \| 3 \|
	\| MEMORY \| 0.647 \| 0.026 \| 3 \|
	\| CODETTE \| 0.687 \| 0.017 \| 3 \|

	## 4. Key Findings

	- Multi-perspective vs single: +87.0% improvement (Cohen's d=7.52, p=0.0000)
	- Full Codette vs single (total improvement): +93.1% improvement (Cohen's d=7.88, p=0.0000)

	## 5. Methodology

	### Conditions

	1. SINGLE — Single analytical perspective, no memory, no synthesis
	2. MULTI — All 6 reasoning agents (Newton, Quantum, Ethics, Philosophy, DaVinci, Empathy) + critic + synthesis
	3. MEMORY — MULTI + cocoon memory augmentation (FTS5-retrieved prior reasoning)
	4. CODETTE — MEMORY + meta-cognitive strategy synthesis (cross-domain pattern extraction + forged reasoning strategies)

	### Scoring Dimensions (0-1 scale)

	1. Reasoning Depth (20%) — chain length, concept density, ground truth coverage
	2. Perspective Diversity (15%) — distinct cognitive dimensions engaged
	3. Coherence (15%) — logical flow, transitions, structural consistency
	4. Ethical Coverage (10%) — moral frameworks, stakeholders, value awareness
	5. Novelty (15%) — non-obvious insights, cross-domain connections, reframing
	6. Factual Grounding (15%) — evidence specificity, ground truth alignment, trap avoidance
	7. Turing Naturalness (10%) — conversational quality, absence of formulaic AI patterns

	### Problem Set

	- 17 problems across 6 categories
	- Categories: reasoning (3), ethics (3), creative (2), meta-cognitive (3), adversarial (3), Turing (3)
	- Difficulty: easy (1), medium (6), hard (10)

	### Statistical Tests

	- Welch's t-test (unequal variance) for pairwise condition comparisons
	- Cohen's d for effect size estimation
	- Significance threshold: p < 0.05