agiformer / docs /training.md
tefoteknik's picture
Update AGIFORMER with Turkish benchmark
a4d5b05 verified
# Training Guide
## Prerequisites
- Python 3.10+
- PyTorch 2.0+ with CUDA
- 6GB+ GPU memory (for batch_size=4)
- ~200MB disk space (enwik8 dataset)
## Quick Start
### 1. Clone & Install
```bash
git clone https://github.com/inkbytefo/agi-former.git
cd agi-former
pip install -r requirements.txt
```
### 2. Run Training
```bash
python train.py
```
**Expected Output:**
```
Step 10: Loss = 2.8451 | BPC = 4.1056 | LR = 3.00e-05
Step 20: Loss = 2.5123 | BPC = 3.6246 | LR = 6.00e-05
...
Step 5000: Loss = 1.3988 | BPC = 2.0181 | LR = 3.00e-04
-- VALIDATION: Loss = 1.5650 | BPC = 2.2578 --
Saved best_model.pth
```
**Training Time:** ~15 minutes (T4 GPU, 5000 steps)
---
## Configuration
Edit hyperparameters in `train.py`:
```python
# Model
D_MODEL = 512
N_LAYERS = 6
NUM_HEADS = 8
PATCH_SIZE = 4
WINDOW_SIZE = 128
THINKING_STEPS = 3 # System 2 iterations
# Training
BATCH_SIZE = 4
MAX_STEPS = 5000
BASE_LR = 3e-4
WARMUP_STEPS = 100
GRAD_CLIP = 0.5
```
### Hyperparameter Guide
#### Model Size
- **Small:** `d_model=256, n_layers=4` → Fast, lower quality
- **Medium:** `d_model=512, n_layers=6` → **Default** (balanced)
- **Large:** `d_model=768, n_layers=8` → Better BPC, slower
#### System 2
- `thinking_steps=0` → Disable (baseline)
- `thinking_steps=3`**Default** (active reasoning)
- `thinking_steps=5+` → More refinement, higher compute
---
## Dataset
### Enwik8
**Source:** First 100MB of English Wikipedia XML
**Size:** 100,000,000 bytes
**Split:**
- Train: 90MB
- Validation: 5MB
- Test: 5MB
**Auto-download:** Dataset downloads automatically on first run to `./data/enwik8`.
### Custom Data
To train on your own data:
```python
# In train.py, replace:
from src.data.real_data import get_enwik8_dataloader
# With your custom loader:
def get_custom_dataloader(batch_size, seq_len):
# Your implementation
# Must return: (batch, seq_len) tensors of bytes (0-255)
pass
```
---
## Training Process
### 1. Initialization
```
[*] Creating AGIFORMER model...
- d_model=512, n_layers=6, thinking_steps=3
[*] Parameters: ~50M
[*] Downloading enwik8... (if first run)
```
### 2. Warmup Phase (Steps 0-100)
```
Step 10: Loss = 2.8451 | BPC = 4.1056 | LR = 3.00e-05
```
- Linear LR ramp: `0 → 3e-4`
- High loss expected (model random)
### 3. Learning Phase (Steps 100-5000)
```
Step 1000: Loss = 1.9234 | BPC = 2.7745 | LR = 3.00e-04
Step 2000: Loss = 1.7123 | BPC = 2.4701 | LR = 3.00e-04
Step 3000: Loss = 1.6234 | BPC = 2.3418 | LR = 3.00e-04
```
- Loss decreases steadily
- Validation every 200 steps
### 4. Checkpointing
```
-- VALIDATION: Loss = 1.5650 | BPC = 2.2578 --
Saved best_model.pth
```
- `best_model.pth` → Lowest validation loss
- `last_model.pth` → Final checkpoint
---
## Monitoring
### Metrics
**Loss:** Cross-entropy (lower is better)
```
Loss = -log P(next_byte | context)
```
**BPC (Bits Per Character):**
```
BPC = Loss / ln(2)
```
- Random baseline: 8.0 BPC
- Character-level models: 1.2-1.5 BPC
- AGIFORMER (5k steps): 2.26 BPC
### Expected Progress
| Steps | BPC | Status |
|-------|-----|--------|
| 0-100 | 4.0-3.5 | Warmup |
| 500 | 3.0-2.8 | Learning syntax |
| 1000 | 2.8-2.6 | Basic patterns |
| 3000 | 2.5-2.3 | Word structure |
| 5000 | 2.3-2.2 | ✅ Proof of concept |
| 20k+ | <2.0 | Production quality |
---
## Troubleshooting
### NaN Loss
**Symptoms:**
```
Step 150: Loss = nan | BPC = nan
```
**Causes:**
1. Learning rate too high
2. Gradient explosion
3. Numerical instability in attention
**Solutions:**
- ✅ Already fixed in code (stability patches)
- If persists: Lower `BASE_LR` to `1e-4`
- Increase `GRAD_CLIP` to `1.0`
### Out of Memory
**Error:**
```
CUDA out of memory
```
**Solutions:**
- Reduce `BATCH_SIZE` (4 → 2 → 1)
- Reduce `d_model` (512 → 256)
- Reduce `n_layers` (6 → 4)
### Slow Training
**<100 steps/min:**
**Solutions:**
- Use GPU (not CPU): `DEVICE = 'cuda'`
- Enable mixed precision: `torch.cuda.amp.autocast()`
- Reduce `thinking_steps` (3 → 1)
---
## Advanced: Multi-GPU
For distributed training:
```python
# In train.py
import torch.distributed as dist
# Wrap model
model = torch.nn.parallel.DistributedDataParallel(model)
# Launch
torchrun --nproc_per_node=4 train.py
```
**Expected Speedup:** ~3.5× on 4 GPUs
---
## Resuming Training
To continue from checkpoint:
```python
# In train.py, after model creation:
if os.path.exists("last_model.pth"):
model.load_state_dict(torch.load("last_model.pth"))
print("Resumed from checkpoint")
```
---
## Hyperparameter Tuning
### Learning Rate
- **Too High (>5e-4):** Loss spikes, NaN
- **Too Low (<1e-5):** Slow convergence
- **Sweet Spot:** `3e-4` with warmup
### Gradient Clipping
- **Too Aggressive (<0.1):** Slow learning
- **Too Loose (>2.0):** Instability
- **Default:** `0.5`
### System 2 Steps
- `0`: Baseline (no thinking)
- `1-3`: **Recommended** (active reasoning)
- `5+`: Diminishing returns (expensive)
---
## Export to Hugging Face
```bash
python upload_to_hf.py --repo YOUR_USERNAME/agiformer --token YOUR_HF_TOKEN
```
Uploads:
- `best_model.pth`
- Source code (`src/`)
- Documentation
---
## Next Steps
After training:
1. **Test Generation:** `python generate.py`
2. **Inspect System 2:** `python inspect_reasoning.py`
3. **Extend Training:** Increase `MAX_STEPS` to 20k+
4. **Fine-tune:** Change dataset to your domain