File size: 2,108 Bytes
e47d112
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3330b1
e47d112
 
 
 
 
 
dd1ec59
e47d112
 
 
e0f67f9
e47d112
 
 
 
 
 
b70c65e
e0f67f9
b70c65e
e0f67f9
b70c65e
e0f67f9
 
 
 
 
 
 
b70c65e
dd1ec59
d3330b1
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
library_name: pytorch
tags:
- pytorch
- transformer
- language-model
- long-context
- agillm
- dblock
- single-file
- experimental
---

# AGILLM-4 dblock single-file

This repo packages the live AGILLM-4 dblock trainer as one runnable Python file:

- `agillm4_dblock_single_file.py`

It was regenerated on `2026-05-31T16:07:54Z` by mechanically inlining the live VastAI training sources:

- `fused_ce.py`
- `anchor_memory.py`
- `dblocks_train.py`
- `nB300_agillm4.py`

The original live command uses `nB300_agillm4.py train`. This single-file build keeps that CLI surface, registers in-memory shims for the former helper modules, and disables helper-module smoke tests that would otherwise fire because the packed file is `__main__`.

See `single_file_manifest.json` for source hashes from the generated build.

Example training shape:

```bash
python agillm4_dblock_single_file.py train --preset agillm4_floor --dblock ...
```

This is experimental training code, not a polished inference package.

## Inference Smoke Test

Validated on the live VastAI training box against `/workspace/agillm4_4090_ckpts/pretrain_step01176781.pt` using CPU-only AR inference:

```bash
CUDA_VISIBLE_DEVICES= python agillm4_dblock_single_file.py infer \
  --mode ar \
  --ckpt /workspace/agillm4_4090_ckpts/pretrain_step01176781.pt \
  --prompt "User: Say hello in one short sentence. Assistant:" \
  --max_new 8 --greedy --plain-output --attn_backend manual
```

The trainer zero-fills missing SAT/NAT bias keys during inference compatibility loading, which lets older full checkpoints run without leaving newly introduced bias tensors random.


## NAT Decode Notes

The packed trainer includes the same NAT inference anti-collapse changes as the live trainer. NAT now applies repetition/frequency/presence penalties and sampler controls while committing masked positions, rather than filling every blank with an unconstrained argmax.

Smoke result on , CPU-only, : about 67 tok/s and no all-token  collapse. Output quality is still early-training rough; this is a decoding stability improvement, not a solved NAT head.