| --- |
| library_name: pytorch |
| tags: |
| - pytorch |
| - transformer |
| - language-model |
| - long-context |
| - agillm |
| - dblock |
| - single-file |
| - experimental |
| --- |
| |
| # AGILLM-4 dblock single-file |
|
|
| This repo packages the live AGILLM-4 dblock trainer as one runnable Python file: |
|
|
| - `agillm4_dblock_single_file.py` |
|
|
| It was regenerated on `2026-05-31T16:07:54Z` by mechanically inlining the live VastAI training sources: |
|
|
| - `fused_ce.py` |
| - `anchor_memory.py` |
| - `dblocks_train.py` |
| - `nB300_agillm4.py` |
|
|
| The original live command uses `nB300_agillm4.py train`. This single-file build keeps that CLI surface, registers in-memory shims for the former helper modules, and disables helper-module smoke tests that would otherwise fire because the packed file is `__main__`. |
|
|
| See `single_file_manifest.json` for source hashes from the generated build. |
|
|
| Example training shape: |
|
|
| ```bash |
| python agillm4_dblock_single_file.py train --preset agillm4_floor --dblock ... |
| ``` |
|
|
| This is experimental training code, not a polished inference package. |
|
|
| ## Inference Smoke Test |
|
|
| Validated on the live VastAI training box against `/workspace/agillm4_4090_ckpts/pretrain_step01176781.pt` using CPU-only AR inference: |
|
|
| ```bash |
| CUDA_VISIBLE_DEVICES= python agillm4_dblock_single_file.py infer \ |
| --mode ar \ |
| --ckpt /workspace/agillm4_4090_ckpts/pretrain_step01176781.pt \ |
| --prompt "User: Say hello in one short sentence. Assistant:" \ |
| --max_new 8 --greedy --plain-output --attn_backend manual |
| ``` |
|
|
| The trainer zero-fills missing SAT/NAT bias keys during inference compatibility loading, which lets older full checkpoints run without leaving newly introduced bias tensors random. |
|
|
|
|
| ## NAT Decode Notes |
|
|
| The packed trainer includes the same NAT inference anti-collapse changes as the live trainer. NAT now applies repetition/frequency/presence penalties and sampler controls while committing masked positions, rather than filling every blank with an unconstrained argmax. |
|
|
| Smoke result on , CPU-only, : about 67 tok/s and no all-token collapse. Output quality is still early-training rough; this is a decoding stability improvement, not a solved NAT head. |
|
|