| --- |
| library_name: pytorch |
| tags: |
| - transformer |
| - language-model |
| - long-context |
| - agillm |
| - experimental |
| --- |
| |
| # AGILLM-4 |
|
|
| AGILLM-4 is the next training target after AGILLM-3. The current code is a |
| production-oriented starting point, copied from the proven single-file trainer |
| and extended for: |
|
|
| - ~1.5B parameter main preset (`agillm4_main`) |
| - 100 tokens per parameter target ratio |
| - longer block-size work on 24GB, B200, and B300 class GPUs |
| - AR+SAT every step with sequential backward to reduce peak VRAM |
| - SDPA and experimental sublinear local+landmark attention backends |
| - exact M-fold expansion attention harvested from n1.py, with local verifier |
| - fused QKV projection harvested from n1.py, with legacy checkpoint loading |
| - profiling tools for memory, throughput, AR cost, SAT cost, and optimizer cost |
| - synthetic long-context curriculum generation for recall and multi-hop tests |
|
|
| Start with [AGILLM-4.md](AGILLM-4.md) for the training plan and command |
| recipes. The current sublinear backend is intentionally experimental: profile it |
| against SDPA before using it for a real run. |
|
|
| Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md). |
|
|