# RAE Training Methodology
## Recursive Abstraction Engine as Training-Time Cognitive Installation

> **The handwriting principle**: Slow, multi-representational, generative reconstruction
> during training installs richer internal representations — producing fast, effortless
> retrieval at inference. The hand was slow so the mind could be fast later.

---

## Core Thesis

Standard fine-tuning trains models on flat `input → output` pairs. This is **typing** — 
discriminative lookup from heavy context. RAE Training forces models through **multi-phase 
generative reconstruction**, creating the neural equivalent of handwriting:

| Property | Handwriting | RAE Training |
|----------|------------|--------------|
| Forced sequential reconstruction | Must regenerate each letter from memory | Must generate each cognitive phase from internal state |
| Multi-pathway co-firing | Motor + visual + spatial + linguistic | Saturation + abstraction + descent + integration |
| Temporal bottleneck | Slowness forces deeper encoding | Multi-phase chain forces richer weight geometry |
| Variability | No two handwritten letters identical | Stochastic phase generation prevents rote memorization |
| Closed-loop embodiment | Proprioceptive error correction | Phase-to-phase coherence loss creates self-correction |

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                  RAE TRAINING PIPELINE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │SATURATION│───►│ABSTRACTION│───►│ DESCENT  │───►│INTEGRATE │  │
│  │  tokens  │    │  tokens   │    │  tokens  │    │  tokens  │  │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│       ▲                                               │         │
│       └───────────────────────────────────────────────┘         │
│                                                                  │
│  Loss = λ₁·L_sat + λ₂·L_abs + λ₃·L_desc + λ₄·L_int           │
│       + λ_coh·L_coherence + λ_comp·L_compression                │
│                                                                  │
│  Key: ALL phases contribute to loss, not just final answer      │
└─────────────────────────────────────────────────────────────────┘
```

## Training Objectives (Multi-Objective Co-Training)

1. **Phase Generation Loss** — Each RAE phase must be generated correctly
2. **Cross-Phase Coherence Loss** — Abstractions must logically follow from saturation
3. **Compression Loss** — Abstraction phase penalized for being longer than saturation
4. **Prediction Accuracy Loss** — Descent-phase predictions evaluated against ground truth
5. **Integration Quality Loss** — Final synthesis must incorporate phase outputs

## Quick Start

### Option A: AutoTrain (No-Code)
```bash
pip install autotrain-advanced
autotrain --config configs/autotrain_rae_sft.yaml
```

### Option B: Custom Trainer (Full Control)
```bash
pip install -r requirements.txt
python src/train_rae.py --config configs/rae_training_config.json
```

### Option C: HuggingFace Spaces
Upload to a Space with GPU — see `scripts/deploy_to_hf_space.sh`

## Dataset Format

RAE training data uses JSONL with structured multi-phase reasoning:

```json
{
  "messages": [
    {"role": "system", "content": "You are an RAE-trained reasoner..."},
    {"role": "user", "content": "<problem>"},
    {"role": "assistant", "content": "<SATURATION>...</SATURATION><ABSTRACTION>...</ABSTRACTION><DESCENT>...</DESCENT><INTEGRATION>...</INTEGRATION>"}
  ]
}
```

## Files

```
rae-training/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
├── configs/
│   ├── autotrain_rae_sft.yaml        # AutoTrain config (no-code path)
│   ├── rae_training_config.json      # Custom trainer config
│   └── base_models.json              # Tested base model registry
├── src/
│   ├── dataset_generator.py          # Generates RAE-structured training data
│   ├── rae_data_formatter.py         # Formats raw data into RAE phases
│   ├── train_rae.py                  # Custom RAE trainer with multi-phase loss
│   ├── rae_loss.py                   # Multi-objective loss functions
│   └── rae_tokenizer_utils.py        # Phase-aware tokenization
├── evaluation/
│   ├── eval_rae_model.py             # Evaluation harness
│   └── benchmarks.json               # Test problems for before/after comparison
├── data/
│   └── seed_problems.jsonl           # Seed problems for dataset generation
└── scripts/
    ├── generate_dataset.sh           # End-to-end dataset generation
    ├── run_training.sh               # Training launcher
    └── deploy_to_hf_space.sh         # HF Spaces deployment
```

## Theory: Why This Works

See the companion document `THEORY.md` for the full neuroscience-to-ML mapping.

**TL;DR**: Handwriting activates widespread brain connectivity because it forces
*generative reconstruction through multiple representational modalities simultaneously
under a temporal bottleneck*. RAE training replicates this by forcing the model to 
traverse Saturation → Abstraction → Descent → Integration phases, with loss computed 
on ALL phases — meaning the model cannot shortcut to the answer. The multi-phase 
structure installs richer weight geometry that persists as faster, more capable 
inference after training.