OpenTransformer
/

agillm4-dblock-single-file

Model card Files Files and versions

agillm4-dblock-single-file / README.md

OpenTransformer's picture

OpenTransformer

Improve NAT decoding in single-file trainer

d3330b1 verified 8 days ago

|

history blame contribute delete

2.11 kB

	---
	library_name: pytorch
	tags:
	- pytorch
	- transformer
	- language-model
	- long-context
	- agillm
	- dblock
	- single-file
	- experimental
	---

	# AGILLM-4 dblock single-file

	This repo packages the live AGILLM-4 dblock trainer as one runnable Python file:

	- `agillm4_dblock_single_file.py`

	It was regenerated on `2026-05-31T16:07:54Z` by mechanically inlining the live VastAI training sources:

	- `fused_ce.py`
	- `anchor_memory.py`
	- `dblocks_train.py`
	- `nB300_agillm4.py`

	The original live command uses `nB300_agillm4.py train`. This single-file build keeps that CLI surface, registers in-memory shims for the former helper modules, and disables helper-module smoke tests that would otherwise fire because the packed file is `__main__`.

	See `single_file_manifest.json` for source hashes from the generated build.

	Example training shape:

	```bash
	python agillm4_dblock_single_file.py train --preset agillm4_floor --dblock ...
	```

	This is experimental training code, not a polished inference package.

	## Inference Smoke Test

	Validated on the live VastAI training box against `/workspace/agillm4_4090_ckpts/pretrain_step01176781.pt` using CPU-only AR inference:

	```bash
	CUDA_VISIBLE_DEVICES= python agillm4_dblock_single_file.py infer \
	--mode ar \
	--ckpt /workspace/agillm4_4090_ckpts/pretrain_step01176781.pt \
	--prompt "User: Say hello in one short sentence. Assistant:" \
	--max_new 8 --greedy --plain-output --attn_backend manual
	```

	The trainer zero-fills missing SAT/NAT bias keys during inference compatibility loading, which lets older full checkpoints run without leaving newly introduced bias tensors random.


	## NAT Decode Notes

	The packed trainer includes the same NAT inference anti-collapse changes as the live trainer. NAT now applies repetition/frequency/presence penalties and sampler controls while committing masked positions, rather than filling every blank with an unconstrained argmax.

	Smoke result on , CPU-only, : about 67 tok/s and no all-token collapse. Output quality is still early-training rough; this is a decoding stability improvement, not a solved NAT head.