agiformer / docs /training.md

Update AGIFORMER with Turkish benchmark

a4d5b05 verified 2 months ago

5.37 kB

	# Training Guide

	## Prerequisites

	- Python 3.10+
	- PyTorch 2.0+ with CUDA
	- 6GB+ GPU memory (for batch_size=4)
	- ~200MB disk space (enwik8 dataset)

	## Quick Start

	### 1. Clone & Install
	```bash
	git clone https://github.com/inkbytefo/agi-former.git
	cd agi-former
	pip install -r requirements.txt
	```

	### 2. Run Training
	```bash
	python train.py
	```

	Expected Output:
	```
	Step 10: Loss = 2.8451 \| BPC = 4.1056 \| LR = 3.00e-05
	Step 20: Loss = 2.5123 \| BPC = 3.6246 \| LR = 6.00e-05
	...
	Step 5000: Loss = 1.3988 \| BPC = 2.0181 \| LR = 3.00e-04
	-- VALIDATION: Loss = 1.5650 \| BPC = 2.2578 --
	Saved best_model.pth
	```

	Training Time: ~15 minutes (T4 GPU, 5000 steps)

	---

	## Configuration

	Edit hyperparameters in `train.py`:

	```python
	# Model
	D_MODEL = 512
	N_LAYERS = 6
	NUM_HEADS = 8
	PATCH_SIZE = 4
	WINDOW_SIZE = 128
	THINKING_STEPS = 3 # System 2 iterations

	# Training
	BATCH_SIZE = 4
	MAX_STEPS = 5000
	BASE_LR = 3e-4
	WARMUP_STEPS = 100
	GRAD_CLIP = 0.5
	```

	### Hyperparameter Guide

	#### Model Size
	- Small: `d_model=256, n_layers=4` → Fast, lower quality
	- Medium: `d_model=512, n_layers=6` → Default (balanced)
	- Large: `d_model=768, n_layers=8` → Better BPC, slower

	#### System 2
	- `thinking_steps=0` → Disable (baseline)
	- `thinking_steps=3` → Default (active reasoning)
	- `thinking_steps=5+` → More refinement, higher compute

	---

	## Dataset

	### Enwik8
	Source: First 100MB of English Wikipedia XML
	Size: 100,000,000 bytes
	Split:
	- Train: 90MB
	- Validation: 5MB
	- Test: 5MB

	Auto-download: Dataset downloads automatically on first run to `./data/enwik8`.

	### Custom Data

	To train on your own data:

	```python
	# In train.py, replace:
	from src.data.real_data import get_enwik8_dataloader

	# With your custom loader:
	def get_custom_dataloader(batch_size, seq_len):
	# Your implementation
	# Must return: (batch, seq_len) tensors of bytes (0-255)
	pass
	```

	---

	## Training Process

	### 1. Initialization
	```
	[*] Creating AGIFORMER model...
	- d_model=512, n_layers=6, thinking_steps=3
	[*] Parameters: ~50M
	[*] Downloading enwik8... (if first run)
	```

	### 2. Warmup Phase (Steps 0-100)
	```
	Step 10: Loss = 2.8451 \| BPC = 4.1056 \| LR = 3.00e-05
	```
	- Linear LR ramp: `0 → 3e-4`
	- High loss expected (model random)

	### 3. Learning Phase (Steps 100-5000)
	```
	Step 1000: Loss = 1.9234 \| BPC = 2.7745 \| LR = 3.00e-04
	Step 2000: Loss = 1.7123 \| BPC = 2.4701 \| LR = 3.00e-04
	Step 3000: Loss = 1.6234 \| BPC = 2.3418 \| LR = 3.00e-04
	```
	- Loss decreases steadily
	- Validation every 200 steps

	### 4. Checkpointing
	```
	-- VALIDATION: Loss = 1.5650 \| BPC = 2.2578 --
	Saved best_model.pth
	```
	- `best_model.pth` → Lowest validation loss
	- `last_model.pth` → Final checkpoint

	---

	## Monitoring

	### Metrics

	Loss: Cross-entropy (lower is better)
	```
	Loss = -log P(next_byte \| context)
	```

	BPC (Bits Per Character):
	```
	BPC = Loss / ln(2)
	```
	- Random baseline: 8.0 BPC
	- Character-level models: 1.2-1.5 BPC
	- AGIFORMER (5k steps): 2.26 BPC

	### Expected Progress

	\| Steps \| BPC \| Status \|
	\|-------\|-----\|--------\|
	\| 0-100 \| 4.0-3.5 \| Warmup \|
	\| 500 \| 3.0-2.8 \| Learning syntax \|
	\| 1000 \| 2.8-2.6 \| Basic patterns \|
	\| 3000 \| 2.5-2.3 \| Word structure \|
	\| 5000 \| 2.3-2.2 \| ✅ Proof of concept \|
	\| 20k+ \| <2.0 \| Production quality \|

	---

	## Troubleshooting

	### NaN Loss
	Symptoms:
	```
	Step 150: Loss = nan \| BPC = nan
	```

	Causes:
	1. Learning rate too high
	2. Gradient explosion
	3. Numerical instability in attention

	Solutions:
	- ✅ Already fixed in code (stability patches)
	- If persists: Lower `BASE_LR` to `1e-4`
	- Increase `GRAD_CLIP` to `1.0`

	### Out of Memory
	Error:
	```
	CUDA out of memory
	```

	Solutions:
	- Reduce `BATCH_SIZE` (4 → 2 → 1)
	- Reduce `d_model` (512 → 256)
	- Reduce `n_layers` (6 → 4)

	### Slow Training
	<100 steps/min:

	Solutions:
	- Use GPU (not CPU): `DEVICE = 'cuda'`
	- Enable mixed precision: `torch.cuda.amp.autocast()`
	- Reduce `thinking_steps` (3 → 1)

	---

	## Advanced: Multi-GPU

	For distributed training:

	```python
	# In train.py
	import torch.distributed as dist

	# Wrap model
	model = torch.nn.parallel.DistributedDataParallel(model)

	# Launch
	torchrun --nproc_per_node=4 train.py
	```

	Expected Speedup: ~3.5× on 4 GPUs

	---

	## Resuming Training

	To continue from checkpoint:

	```python
	# In train.py, after model creation:
	if os.path.exists("last_model.pth"):
	model.load_state_dict(torch.load("last_model.pth"))
	print("Resumed from checkpoint")
	```

	---

	## Hyperparameter Tuning

	### Learning Rate
	- Too High (>5e-4): Loss spikes, NaN
	- Too Low (<1e-5): Slow convergence
	- Sweet Spot: `3e-4` with warmup

	### Gradient Clipping
	- Too Aggressive (<0.1): Slow learning
	- Too Loose (>2.0): Instability
	- Default: `0.5`

	### System 2 Steps
	- `0`: Baseline (no thinking)
	- `1-3`: Recommended (active reasoning)
	- `5+`: Diminishing returns (expensive)

	---

	## Export to Hugging Face

	```bash
	python upload_to_hf.py --repo YOUR_USERNAME/agiformer --token YOUR_HF_TOKEN
	```

	Uploads:
	- `best_model.pth`
	- Source code (`src/`)
	- Documentation

	---

	## Next Steps

	After training:
	1. Test Generation: `python generate.py`
	2. Inspect System 2: `python inspect_reasoning.py`
	3. Extend Training: Increase `MAX_STEPS` to 20k+
	4. Fine-tune: Change dataset to your domain