Phase 7: Curriculum Learning (20K steps, BPC 1.78)
Browse files
README.md
CHANGED
|
@@ -1,75 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
```
|
| 2 |
|
| 3 |
-
### Training
|
| 4 |
```bash
|
| 5 |
-
python
|
| 6 |
```
|
| 7 |
|
| 8 |
### Inference
|
| 9 |
```bash
|
| 10 |
-
python generate.py
|
| 11 |
```
|
| 12 |
|
| 13 |
-
###
|
| 14 |
```bash
|
| 15 |
-
python
|
|
|
|
| 16 |
```
|
| 17 |
|
| 18 |
## Architecture
|
| 19 |
|
| 20 |
```
|
| 21 |
-
Bytes → Encoder (RoPE) →
|
| 22 |
-
(Patches) (
|
| 23 |
```
|
| 24 |
|
| 25 |
-
### Components
|
|
|
|
| 26 |
- **ByteLatentEncoder:** Patches bytes into latent vectors with RoPE
|
| 27 |
-
- **
|
| 28 |
- **RecurrentReasoningBlock:** 3-step iterative thinking loop (System 2)
|
| 29 |
- **LocalAutoregressiveHead:** GRU-based byte decoder
|
| 30 |
|
| 31 |
-
See [docs/architecture.md](docs/architecture.md) for details.
|
| 32 |
|
| 33 |
## Features
|
| 34 |
|
| 35 |
✅ **No Tokenization** - Universal byte-level processing
|
| 36 |
-
✅ **Linear Complexity** -
|
|
|
|
|
|
|
| 37 |
✅ **Active Reasoning** - Verified thinking loop (Δz = 12.7)
|
| 38 |
-
✅ **
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
## Documentation
|
| 42 |
|
| 43 |
- [Architecture Guide](docs/architecture.md) - Technical deep dive
|
| 44 |
-
- [Training Guide](docs/training.md) -
|
| 45 |
- [Inference Guide](docs/inference.md) - Generation and sampling
|
| 46 |
- [API Reference](docs/api.md) - Code documentation
|
|
|
|
| 47 |
|
| 48 |
-
##
|
| 49 |
|
| 50 |
-
|
| 51 |
-
-
|
| 52 |
-
-
|
| 53 |
-
- **Stability:** 0 NaN occurrences
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
-
|
| 63 |
-
-
|
| 64 |
-
-
|
|
|
|
| 65 |
|
| 66 |
## Citation
|
| 67 |
|
| 68 |
```bibtex
|
| 69 |
@software{agiformer2025,
|
| 70 |
-
title={AGIFORMER: Byte-Level Language Model with
|
| 71 |
author={inkbytefo},
|
| 72 |
year={2025},
|
|
|
|
| 73 |
url={https://github.com/inkbytefo/agi-former}
|
| 74 |
}
|
| 75 |
```
|
|
@@ -81,11 +149,13 @@ MIT License - see [LICENSE](LICENSE) file for details.
|
|
| 81 |
## Acknowledgments
|
| 82 |
|
| 83 |
- Built with PyTorch
|
| 84 |
-
-
|
| 85 |
-
-
|
|
|
|
| 86 |
|
| 87 |
---
|
| 88 |
|
| 89 |
**Developer:** inkbytefo
|
| 90 |
-
**
|
| 91 |
-
**Status:**
|
|
|
|
|
|
| 1 |
+
# AGIFORMER: Byte-Level Language Model with Neuroplasticity
|
| 2 |
+
|
| 3 |
+
> **Status:** Phase 7 - Curriculum Learning ✅ **Complete**
|
| 4 |
+
> **Latest Achievement:** 20K curriculum training with 77% BPC reduction
|
| 5 |
+
|
| 6 |
+
A research implementation of a byte-level language model featuring:
|
| 7 |
+
- 🧠 **Hebbian Memory** with dynamic neuroplasticity
|
| 8 |
+
- 📚 **Curriculum Learning** (3-stage developmental approach)
|
| 9 |
+
- 🔄 **System 2 Reasoning** (iterative thinking loop)
|
| 10 |
+
- 🚀 **Linear Complexity** attention mechanism
|
| 11 |
+
|
| 12 |
+
## Quick Start
|
| 13 |
+
|
| 14 |
+
### Installation
|
| 15 |
+
```bash
|
| 16 |
+
pip install torch datasets tqdm
|
| 17 |
```
|
| 18 |
|
| 19 |
+
### Training (Curriculum Learning)
|
| 20 |
```bash
|
| 21 |
+
python train_curriculum.py # 20K steps, 3 curriculum stages
|
| 22 |
```
|
| 23 |
|
| 24 |
### Inference
|
| 25 |
```bash
|
| 26 |
+
python generate.py best_model_curriculum.pth
|
| 27 |
```
|
| 28 |
|
| 29 |
+
### Testing
|
| 30 |
```bash
|
| 31 |
+
python test_recall.py best_model_curriculum.pth # Memory test
|
| 32 |
+
python inspect_reasoning.py # System 2 diagnostics
|
| 33 |
```
|
| 34 |
|
| 35 |
## Architecture
|
| 36 |
|
| 37 |
```
|
| 38 |
+
Bytes → Encoder (RoPE) → Hebbian Memory → Reasoning Loop → Local RNN → Bytes
|
| 39 |
+
(Patches) (Dynamic λ) (3 steps) (Autoregressive)
|
| 40 |
```
|
| 41 |
|
| 42 |
+
### Core Components
|
| 43 |
+
|
| 44 |
- **ByteLatentEncoder:** Patches bytes into latent vectors with RoPE
|
| 45 |
+
- **HebbianMemory:** Fast weights with learnable decay + neuroplasticity (α)
|
| 46 |
- **RecurrentReasoningBlock:** 3-step iterative thinking loop (System 2)
|
| 47 |
- **LocalAutoregressiveHead:** GRU-based byte decoder
|
| 48 |
|
| 49 |
+
See [docs/architecture.md](docs/architecture.md) for technical details.
|
| 50 |
|
| 51 |
## Features
|
| 52 |
|
| 53 |
✅ **No Tokenization** - Universal byte-level processing
|
| 54 |
+
✅ **Linear Complexity** - O(N) attention with Hebbian memory
|
| 55 |
+
✅ **Neuroplasticity** - Dynamic memory consolidation (α: 0.1 → 0.99)
|
| 56 |
+
✅ **Curriculum Learning** - 3-stage developmental training
|
| 57 |
✅ **Active Reasoning** - Verified thinking loop (Δz = 12.7)
|
| 58 |
+
✅ **AMP Compatible** - Mixed precision training with stability fixes
|
| 59 |
+
|
| 60 |
+
## Curriculum Learning (Phase 7)
|
| 61 |
+
|
| 62 |
+
### Training Stages
|
| 63 |
+
|
| 64 |
+
| Stage | Steps | Plasticity (α) | Data | Purpose |
|
| 65 |
+
|-------|-------|----------------|------|---------|
|
| 66 |
+
| **1. Childhood** | 0-3K | 0.10 | Dictionary | Lexical grounding |
|
| 67 |
+
| **2. Youth** | 3K-8K | 0.50 | Stories | Syntactic scaffolding |
|
| 68 |
+
| **3. Adulthood** | 8K-20K | 0.99 | Wikipedia | Semantic expansion |
|
| 69 |
+
|
| 70 |
+
### Results (20K Steps - Turkish Training)
|
| 71 |
+
|
| 72 |
+
**Metrics:**
|
| 73 |
+
- **Final BPC:** 1.85 (↓77% from initialization)
|
| 74 |
+
- **Best Val BPC:** 1.78
|
| 75 |
+
- **Training Time:** ~50 minutes (CUDA GPU)
|
| 76 |
+
- **Stability:** 0 NaN occurrences across 20K steps
|
| 77 |
+
|
| 78 |
+
**Progress:**
|
| 79 |
+
```
|
| 80 |
+
Step 0: BPC = 8.04 (Random initialization)
|
| 81 |
+
Step 5K: BPC = 2.23 (Initial curriculum complete)
|
| 82 |
+
Step 10K: BPC = 1.98 (Mid-training)
|
| 83 |
+
Step 20K: BPC = 1.85 (Final)
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
**Improvement:** **6.19 BPC reduction** (77% improvement)
|
| 87 |
+
|
| 88 |
+
## Critical Fix: AMP Stability
|
| 89 |
+
|
| 90 |
+
**Problem:** Float16 overflow in Hebbian Memory with low plasticity (α=0.1)
|
| 91 |
+
**Solution:** Force float32 computation for memory module
|
| 92 |
+
|
| 93 |
+
```python
|
| 94 |
+
@torch.amp.autocast('cuda', enabled=False)
|
| 95 |
+
def forward(self, x):
|
| 96 |
+
x = x.float() # Bypass AMP for numerical stability
|
| 97 |
+
# ... Hebbian computation ...
|
| 98 |
+
return out.to(input_dtype)
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
This fix enables stable 20K+ step training with AMP enabled.
|
| 102 |
|
| 103 |
## Documentation
|
| 104 |
|
| 105 |
- [Architecture Guide](docs/architecture.md) - Technical deep dive
|
| 106 |
+
- [Training Guide](docs/training.md) - Training from scratch
|
| 107 |
- [Inference Guide](docs/inference.md) - Generation and sampling
|
| 108 |
- [API Reference](docs/api.md) - Code documentation
|
| 109 |
+
- [RFC 007: Curriculum Learning](docs/RFC_007_Curriculum_Learning.md) - Phase 7 design
|
| 110 |
|
| 111 |
+
## Model Files
|
| 112 |
|
| 113 |
+
- `best_model_curriculum.pth` - Best checkpoint (Val BPC: 1.78)
|
| 114 |
+
- `last_model_curriculum.pth` - Final model state (20K steps)
|
| 115 |
+
- `metrics_curriculum.json` - Full training metrics
|
|
|
|
| 116 |
|
| 117 |
+
## Next Steps
|
| 118 |
+
|
| 119 |
+
### Recommended Improvements
|
| 120 |
+
|
| 121 |
+
1. **Extended Training:** 30K-50K steps for further convergence
|
| 122 |
+
2. **Larger Model:** Increase d_model=768, n_layers=8
|
| 123 |
+
3. **Longer Context:** Extend to 2048 token window
|
| 124 |
+
4. **Fine-tuning:** Domain-specific Turkish datasets
|
| 125 |
+
|
| 126 |
+
### Research Directions
|
| 127 |
|
| 128 |
+
- Adaptive plasticity scheduling
|
| 129 |
+
- Multi-stage curriculum optimization
|
| 130 |
+
- Cross-lingual transfer learning
|
| 131 |
+
- Sparse Hebbian memory
|
| 132 |
|
| 133 |
## Citation
|
| 134 |
|
| 135 |
```bibtex
|
| 136 |
@software{agiformer2025,
|
| 137 |
+
title={AGIFORMER: Byte-Level Language Model with Hebbian Memory and Neuroplasticity},
|
| 138 |
author={inkbytefo},
|
| 139 |
year={2025},
|
| 140 |
+
note={Phase 7: Curriculum Learning with Dynamic Plasticity},
|
| 141 |
url={https://github.com/inkbytefo/agi-former}
|
| 142 |
}
|
| 143 |
```
|
|
|
|
| 149 |
## Acknowledgments
|
| 150 |
|
| 151 |
- Built with PyTorch
|
| 152 |
+
- Turkish Wikipedia dataset (trwiki)
|
| 153 |
+
- Turkish Dictionary dataset (TDK)
|
| 154 |
+
- Inspired by Fast Weights, Linear Transformers, and developmental neuroscience
|
| 155 |
|
| 156 |
---
|
| 157 |
|
| 158 |
**Developer:** inkbytefo
|
| 159 |
+
**Phase:** 7 (Curriculum Learning & Neuroplasticity)
|
| 160 |
+
**Status:** Production Ready ✅
|
| 161 |
+
**Last Updated:** 2025-11-23
|