# Training Guide

## Prerequisites

- Python 3.10+
- PyTorch 2.0+ with CUDA
- 6GB+ GPU memory (for batch_size=4)
- ~200MB disk space (enwik8 dataset)

## Quick Start

### 1. Clone & Install
```bash
git clone https://github.com/inkbytefo/agi-former.git
cd agi-former
pip install -r requirements.txt
```

### 2. Run Training
```bash
python train.py
```

**Expected Output:**
```
Step 10: Loss = 2.8451 | BPC = 4.1056 | LR = 3.00e-05
Step 20: Loss = 2.5123 | BPC = 3.6246 | LR = 6.00e-05
...
Step 5000: Loss = 1.3988 | BPC = 2.0181 | LR = 3.00e-04
-- VALIDATION: Loss = 1.5650 | BPC = 2.2578 --
Saved best_model.pth
```

**Training Time:** ~15 minutes (T4 GPU, 5000 steps)

---

## Configuration

Edit hyperparameters in `train.py`:

```python
# Model
D_MODEL = 512
N_LAYERS = 6
NUM_HEADS = 8
PATCH_SIZE = 4
WINDOW_SIZE = 128
THINKING_STEPS = 3  # System 2 iterations

# Training
BATCH_SIZE = 4
MAX_STEPS = 5000
BASE_LR = 3e-4
WARMUP_STEPS = 100
GRAD_CLIP = 0.5
```

### Hyperparameter Guide

#### Model Size
- **Small:** `d_model=256, n_layers=4` → Fast, lower quality
- **Medium:** `d_model=512, n_layers=6` → **Default** (balanced)
- **Large:** `d_model=768, n_layers=8` → Better BPC, slower

#### System 2
- `thinking_steps=0` → Disable (baseline)
- `thinking_steps=3` → **Default** (active reasoning)
- `thinking_steps=5+` → More refinement, higher compute

---

## Dataset

### Enwik8
**Source:** First 100MB of English Wikipedia XML  
**Size:** 100,000,000 bytes  
**Split:**
- Train: 90MB
- Validation: 5MB
- Test: 5MB

**Auto-download:** Dataset downloads automatically on first run to `./data/enwik8`.

### Custom Data

To train on your own data:

```python
# In train.py, replace:
from src.data.real_data import get_enwik8_dataloader

# With your custom loader:
def get_custom_dataloader(batch_size, seq_len):
    # Your implementation
    # Must return: (batch, seq_len) tensors of bytes (0-255)
    pass
```

---

## Training Process

### 1. Initialization
```
[*] Creating AGIFORMER model...
    - d_model=512, n_layers=6, thinking_steps=3
[*] Parameters: ~50M
[*] Downloading enwik8... (if first run)
```

### 2. Warmup Phase (Steps 0-100)
```
Step 10: Loss = 2.8451 | BPC = 4.1056 | LR = 3.00e-05
```
- Linear LR ramp: `0 → 3e-4`
- High loss expected (model random)

### 3. Learning Phase (Steps 100-5000)
```
Step 1000: Loss = 1.9234 | BPC = 2.7745 | LR = 3.00e-04
Step 2000: Loss = 1.7123 | BPC = 2.4701 | LR = 3.00e-04
Step 3000: Loss = 1.6234 | BPC = 2.3418 | LR = 3.00e-04
```
- Loss decreases steadily
- Validation every 200 steps

### 4. Checkpointing
```
-- VALIDATION: Loss = 1.5650 | BPC = 2.2578 --
Saved best_model.pth
```
- `best_model.pth` → Lowest validation loss
- `last_model.pth` → Final checkpoint

---

## Monitoring

### Metrics

**Loss:** Cross-entropy (lower is better)
```
Loss = -log P(next_byte | context)
```

**BPC (Bits Per Character):** 
```
BPC = Loss / ln(2)
```
- Random baseline: 8.0 BPC
- Character-level models: 1.2-1.5 BPC
- AGIFORMER (5k steps): 2.26 BPC

### Expected Progress

| Steps | BPC | Status |
|-------|-----|--------|
| 0-100 | 4.0-3.5 | Warmup |
| 500 | 3.0-2.8 | Learning syntax |
| 1000 | 2.8-2.6 | Basic patterns |
| 3000 | 2.5-2.3 | Word structure |
| 5000 | 2.3-2.2 | ✅ Proof of concept |
| 20k+ | <2.0 | Production quality |

---

## Troubleshooting

### NaN Loss
**Symptoms:**
```
Step 150: Loss = nan | BPC = nan
```

**Causes:**
1. Learning rate too high
2. Gradient explosion
3. Numerical instability in attention

**Solutions:**
- ✅ Already fixed in code (stability patches)
- If persists: Lower `BASE_LR` to `1e-4`
- Increase `GRAD_CLIP` to `1.0`

### Out of Memory
**Error:**
```
CUDA out of memory
```

**Solutions:**
- Reduce `BATCH_SIZE` (4 → 2 → 1)
- Reduce `d_model` (512 → 256)
- Reduce `n_layers` (6 → 4)

### Slow Training
**<100 steps/min:**

**Solutions:**
- Use GPU (not CPU): `DEVICE = 'cuda'`
- Enable mixed precision: `torch.cuda.amp.autocast()`
- Reduce `thinking_steps` (3 → 1)

---

## Advanced: Multi-GPU

For distributed training:

```python
# In train.py
import torch.distributed as dist

# Wrap model
model = torch.nn.parallel.DistributedDataParallel(model)

# Launch
torchrun --nproc_per_node=4 train.py
```

**Expected Speedup:** ~3.5× on 4 GPUs

---

## Resuming Training

To continue from checkpoint:

```python
# In train.py, after model creation:
if os.path.exists("last_model.pth"):
    model.load_state_dict(torch.load("last_model.pth"))
    print("Resumed from checkpoint")
```

---

## Hyperparameter Tuning

### Learning Rate
- **Too High (>5e-4):** Loss spikes, NaN
- **Too Low (<1e-5):** Slow convergence
- **Sweet Spot:** `3e-4` with warmup

### Gradient Clipping
- **Too Aggressive (<0.1):** Slow learning
- **Too Loose (>2.0):** Instability
- **Default:** `0.5`

### System 2 Steps
- `0`: Baseline (no thinking)
- `1-3`: **Recommended** (active reasoning)
- `5+`: Diminishing returns (expensive)

---

## Export to Hugging Face

```bash
python upload_to_hf.py --repo YOUR_USERNAME/agiformer --token YOUR_HF_TOKEN
```

Uploads:
- `best_model.pth`
- Source code (`src/`)
- Documentation

---

## Next Steps

After training:
1. **Test Generation:** `python generate.py`
2. **Inspect System 2:** `python inspect_reasoning.py`
3. **Extend Training:** Increase `MAX_STEPS` to 20k+
4. **Fine-tune:** Change dataset to your domain