OpenTransformer's picture
Improve NAT decoding in single-file trainer
d3330b1 verified
---
library_name: pytorch
tags:
- pytorch
- transformer
- language-model
- long-context
- agillm
- dblock
- single-file
- experimental
---
# AGILLM-4 dblock single-file
This repo packages the live AGILLM-4 dblock trainer as one runnable Python file:
- `agillm4_dblock_single_file.py`
It was regenerated on `2026-05-31T16:07:54Z` by mechanically inlining the live VastAI training sources:
- `fused_ce.py`
- `anchor_memory.py`
- `dblocks_train.py`
- `nB300_agillm4.py`
The original live command uses `nB300_agillm4.py train`. This single-file build keeps that CLI surface, registers in-memory shims for the former helper modules, and disables helper-module smoke tests that would otherwise fire because the packed file is `__main__`.
See `single_file_manifest.json` for source hashes from the generated build.
Example training shape:
```bash
python agillm4_dblock_single_file.py train --preset agillm4_floor --dblock ...
```
This is experimental training code, not a polished inference package.
## Inference Smoke Test
Validated on the live VastAI training box against `/workspace/agillm4_4090_ckpts/pretrain_step01176781.pt` using CPU-only AR inference:
```bash
CUDA_VISIBLE_DEVICES= python agillm4_dblock_single_file.py infer \
--mode ar \
--ckpt /workspace/agillm4_4090_ckpts/pretrain_step01176781.pt \
--prompt "User: Say hello in one short sentence. Assistant:" \
--max_new 8 --greedy --plain-output --attn_backend manual
```
The trainer zero-fills missing SAT/NAT bias keys during inference compatibility loading, which lets older full checkpoints run without leaving newly introduced bias tensors random.
## NAT Decode Notes
The packed trainer includes the same NAT inference anti-collapse changes as the live trainer. NAT now applies repetition/frequency/presence penalties and sampler controls while committing masked positions, rather than filling every blank with an unconstrained argmax.
Smoke result on , CPU-only, : about 67 tok/s and no all-token collapse. Output quality is still early-training rough; this is a decoding stability improvement, not a solved NAT head.