File size: 4,384 Bytes
420b441
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: llama3.1
tags:
  - codette
  - reasoning
  - multi-perspective
  - training-data
  - synthetic
language:
  - en
pipeline_tag: text-generation
---

# Codette Reasoning - Training Datasets

Synthetic training datasets for the **Codette Multi-Perspective Reasoning System**.

Each dataset contains instruction-tuning examples designed to teach a specific cognitive reasoning perspective to Llama 3.1 8B Instruct via LoRA fine-tuning.

## Datasets

| Dataset | Adapter | Examples | Description |
|---|---|---|---|
| newton_reasoning.jsonl | Newton | 3000 | Analytical physics, systematic reasoning, empirical evidence |
| davinci_reasoning.jsonl | DaVinci | 2500 | Creative invention, cross-domain connections, visual thinking |
| empathy_reasoning.jsonl | Empathy | 2500 | Emotional intelligence, human experience, compassion |
| philosophy_reasoning.jsonl | Philosophy | 2000 | Conceptual analysis, ethical reasoning, fundamental questions |
| quantum_reasoning.jsonl | Quantum | 2000 | Probabilistic thinking, superposition, complementarity |
| consciousness_reasoning.jsonl | Consciousness | 3000 | Recursive cognition (RC+xi), meta-cognition, epistemic tension |
| multi_perspective_reasoning.jsonl | Multi-Perspective | 2500 | Cross-lens synthesis, integrative reasoning |
| systems_architecture_reasoning.jsonl | Systems Architecture | 2000 | Modularity, scalability, engineering principles |
| orchestrator_reasoning.jsonl | Orchestrator | 4000 | Query routing, debate coordination, coherence monitoring |

**Total: ~24,500 training examples**

## Format

Each JSONL file contains records in chat-completion format:

```json
{
  "messages": [
    {"role": "system", "content": "You are Codette, reasoning with Newtonian analytical precision."},
    {"role": "user", "content": "Explain the relationship between force and acceleration."},
    {"role": "assistant", "content": "From an analytical physics perspective..."}
  ]
}
```

## Generation Method

Datasets are generated using a pure-Python template engine (no model inference required):

1. **Template Registry**: 30-60 question templates per adapter with variable slots
2. **Topic Engine**: 40-80 topics with subtopics for domain-specific coverage
3. **Answer Generator**: Structured educational answers (80-200 words) with perspective-specific framing
4. **Counterexamples**: 12% of examples include counterexample reasoning for robustness
5. **Phase 6+ Awareness**: All templates incorporate semantic tension, coherence field, and AEGIS concepts

## Phase 6+ Framework Coverage

The datasets teach these framework concepts across all perspectives:

- **Semantic Tension (xi)**: Measuring and working with epistemic disagreement
- **Coherence Field (Gamma)**: Monitoring reasoning health and detecting collapse
- **Quantum Spiderweb**: Belief propagation and perspective interconnection
- **AEGIS Governance**: Ethical validation across 6 frameworks (utilitarian, deontological, virtue, care, justice, rights)
- **Specialization Tracking**: Domain expertise development and confidence calibration
- **Pre-flight Prediction**: Anticipating conflicts before multi-agent debate

## Usage

### Load with HuggingFace Datasets
```python
from datasets import load_dataset

ds = load_dataset("Raiff1982/Codette-Reasoning", data_files="newton_reasoning.jsonl")
```

### Train a LoRA Adapter
```python
from trl import SFTTrainer
from peft import LoraConfig

lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    model=base_model,
    train_dataset=ds["train"],
    peft_config=lora_config,
    max_seq_length=2048,
    num_train_epochs=3,
)
trainer.train()
```

## Related Repos

- [Raiff1982/codette-llama-3.1-8b-gguf](https://huggingface.co/Raiff1982/codette-llama-3.1-8b-gguf) - Quantized GGUF model
- [Raiff1982/codette-lora-adapters](https://huggingface.co/Raiff1982/codette-lora-adapters) - Trained LoRA adapters
- [Raiff1982/codette-llama-3.1-8b-merged](https://huggingface.co/Raiff1982/codette-llama-3.1-8b-merged) - Merged orchestrator model

## License

Datasets are released under the same terms as the Llama 3.1 model they are designed to fine-tune.
Subject to the [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE).