| # Training Guide | |
| ## Prerequisites | |
| - Python 3.10+ | |
| - PyTorch 2.0+ with CUDA | |
| - 6GB+ GPU memory (for batch_size=4) | |
| - ~200MB disk space (enwik8 dataset) | |
| ## Quick Start | |
| ### 1. Clone & Install | |
| ```bash | |
| git clone https://github.com/inkbytefo/agi-former.git | |
| cd agi-former | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2. Run Training | |
| ```bash | |
| python train.py | |
| ``` | |
| **Expected Output:** | |
| ``` | |
| Step 10: Loss = 2.8451 | BPC = 4.1056 | LR = 3.00e-05 | |
| Step 20: Loss = 2.5123 | BPC = 3.6246 | LR = 6.00e-05 | |
| ... | |
| Step 5000: Loss = 1.3988 | BPC = 2.0181 | LR = 3.00e-04 | |
| -- VALIDATION: Loss = 1.5650 | BPC = 2.2578 -- | |
| Saved best_model.pth | |
| ``` | |
| **Training Time:** ~15 minutes (T4 GPU, 5000 steps) | |
| --- | |
| ## Configuration | |
| Edit hyperparameters in `train.py`: | |
| ```python | |
| # Model | |
| D_MODEL = 512 | |
| N_LAYERS = 6 | |
| NUM_HEADS = 8 | |
| PATCH_SIZE = 4 | |
| WINDOW_SIZE = 128 | |
| THINKING_STEPS = 3 # System 2 iterations | |
| # Training | |
| BATCH_SIZE = 4 | |
| MAX_STEPS = 5000 | |
| BASE_LR = 3e-4 | |
| WARMUP_STEPS = 100 | |
| GRAD_CLIP = 0.5 | |
| ``` | |
| ### Hyperparameter Guide | |
| #### Model Size | |
| - **Small:** `d_model=256, n_layers=4` → Fast, lower quality | |
| - **Medium:** `d_model=512, n_layers=6` → **Default** (balanced) | |
| - **Large:** `d_model=768, n_layers=8` → Better BPC, slower | |
| #### System 2 | |
| - `thinking_steps=0` → Disable (baseline) | |
| - `thinking_steps=3` → **Default** (active reasoning) | |
| - `thinking_steps=5+` → More refinement, higher compute | |
| --- | |
| ## Dataset | |
| ### Enwik8 | |
| **Source:** First 100MB of English Wikipedia XML | |
| **Size:** 100,000,000 bytes | |
| **Split:** | |
| - Train: 90MB | |
| - Validation: 5MB | |
| - Test: 5MB | |
| **Auto-download:** Dataset downloads automatically on first run to `./data/enwik8`. | |
| ### Custom Data | |
| To train on your own data: | |
| ```python | |
| # In train.py, replace: | |
| from src.data.real_data import get_enwik8_dataloader | |
| # With your custom loader: | |
| def get_custom_dataloader(batch_size, seq_len): | |
| # Your implementation | |
| # Must return: (batch, seq_len) tensors of bytes (0-255) | |
| pass | |
| ``` | |
| --- | |
| ## Training Process | |
| ### 1. Initialization | |
| ``` | |
| [*] Creating AGIFORMER model... | |
| - d_model=512, n_layers=6, thinking_steps=3 | |
| [*] Parameters: ~50M | |
| [*] Downloading enwik8... (if first run) | |
| ``` | |
| ### 2. Warmup Phase (Steps 0-100) | |
| ``` | |
| Step 10: Loss = 2.8451 | BPC = 4.1056 | LR = 3.00e-05 | |
| ``` | |
| - Linear LR ramp: `0 → 3e-4` | |
| - High loss expected (model random) | |
| ### 3. Learning Phase (Steps 100-5000) | |
| ``` | |
| Step 1000: Loss = 1.9234 | BPC = 2.7745 | LR = 3.00e-04 | |
| Step 2000: Loss = 1.7123 | BPC = 2.4701 | LR = 3.00e-04 | |
| Step 3000: Loss = 1.6234 | BPC = 2.3418 | LR = 3.00e-04 | |
| ``` | |
| - Loss decreases steadily | |
| - Validation every 200 steps | |
| ### 4. Checkpointing | |
| ``` | |
| -- VALIDATION: Loss = 1.5650 | BPC = 2.2578 -- | |
| Saved best_model.pth | |
| ``` | |
| - `best_model.pth` → Lowest validation loss | |
| - `last_model.pth` → Final checkpoint | |
| --- | |
| ## Monitoring | |
| ### Metrics | |
| **Loss:** Cross-entropy (lower is better) | |
| ``` | |
| Loss = -log P(next_byte | context) | |
| ``` | |
| **BPC (Bits Per Character):** | |
| ``` | |
| BPC = Loss / ln(2) | |
| ``` | |
| - Random baseline: 8.0 BPC | |
| - Character-level models: 1.2-1.5 BPC | |
| - AGIFORMER (5k steps): 2.26 BPC | |
| ### Expected Progress | |
| | Steps | BPC | Status | | |
| |-------|-----|--------| | |
| | 0-100 | 4.0-3.5 | Warmup | | |
| | 500 | 3.0-2.8 | Learning syntax | | |
| | 1000 | 2.8-2.6 | Basic patterns | | |
| | 3000 | 2.5-2.3 | Word structure | | |
| | 5000 | 2.3-2.2 | ✅ Proof of concept | | |
| | 20k+ | <2.0 | Production quality | | |
| --- | |
| ## Troubleshooting | |
| ### NaN Loss | |
| **Symptoms:** | |
| ``` | |
| Step 150: Loss = nan | BPC = nan | |
| ``` | |
| **Causes:** | |
| 1. Learning rate too high | |
| 2. Gradient explosion | |
| 3. Numerical instability in attention | |
| **Solutions:** | |
| - ✅ Already fixed in code (stability patches) | |
| - If persists: Lower `BASE_LR` to `1e-4` | |
| - Increase `GRAD_CLIP` to `1.0` | |
| ### Out of Memory | |
| **Error:** | |
| ``` | |
| CUDA out of memory | |
| ``` | |
| **Solutions:** | |
| - Reduce `BATCH_SIZE` (4 → 2 → 1) | |
| - Reduce `d_model` (512 → 256) | |
| - Reduce `n_layers` (6 → 4) | |
| ### Slow Training | |
| **<100 steps/min:** | |
| **Solutions:** | |
| - Use GPU (not CPU): `DEVICE = 'cuda'` | |
| - Enable mixed precision: `torch.cuda.amp.autocast()` | |
| - Reduce `thinking_steps` (3 → 1) | |
| --- | |
| ## Advanced: Multi-GPU | |
| For distributed training: | |
| ```python | |
| # In train.py | |
| import torch.distributed as dist | |
| # Wrap model | |
| model = torch.nn.parallel.DistributedDataParallel(model) | |
| # Launch | |
| torchrun --nproc_per_node=4 train.py | |
| ``` | |
| **Expected Speedup:** ~3.5× on 4 GPUs | |
| --- | |
| ## Resuming Training | |
| To continue from checkpoint: | |
| ```python | |
| # In train.py, after model creation: | |
| if os.path.exists("last_model.pth"): | |
| model.load_state_dict(torch.load("last_model.pth")) | |
| print("Resumed from checkpoint") | |
| ``` | |
| --- | |
| ## Hyperparameter Tuning | |
| ### Learning Rate | |
| - **Too High (>5e-4):** Loss spikes, NaN | |
| - **Too Low (<1e-5):** Slow convergence | |
| - **Sweet Spot:** `3e-4` with warmup | |
| ### Gradient Clipping | |
| - **Too Aggressive (<0.1):** Slow learning | |
| - **Too Loose (>2.0):** Instability | |
| - **Default:** `0.5` | |
| ### System 2 Steps | |
| - `0`: Baseline (no thinking) | |
| - `1-3`: **Recommended** (active reasoning) | |
| - `5+`: Diminishing returns (expensive) | |
| --- | |
| ## Export to Hugging Face | |
| ```bash | |
| python upload_to_hf.py --repo YOUR_USERNAME/agiformer --token YOUR_HF_TOKEN | |
| ``` | |
| Uploads: | |
| - `best_model.pth` | |
| - Source code (`src/`) | |
| - Documentation | |
| --- | |
| ## Next Steps | |
| After training: | |
| 1. **Test Generation:** `python generate.py` | |
| 2. **Inspect System 2:** `python inspect_reasoning.py` | |
| 3. **Extend Training:** Increase `MAX_STEPS` to 20k+ | |
| 4. **Fine-tune:** Change dataset to your domain | |