Buckets:

OpenTransformer
/

agillm42-checkpoints

Files

xet

OpenTransformer/agillm42-checkpoints / code /DIFFUSIONBLOCKS_INDEPENDENT.md

OpenTransformer

9 days ago

preview code

download

raw

3.12 kB

AGILLM-4.3 Independent DiffusionBlocks

AGILLM-4.3 has two different DiffusionBlocks-related paths:

Live trainer path: agillm41.py train --dblock ... runs one selected block objective per step inside the normal monolithic trainer. It already combines with --grad_checkpoint / --dblock_checkpoint_stride, so it mainly reduces activation pressure for the selected block but the process still owns the full model, optimizer, and checkpoint loop.
Independent path: agillm43_diffusionblocks_independent.py trains one block as a standalone resident model with a frozen shared stem. This is the paper-style memory trade-off: each worker only needs its own block params, grads, optimizer state, and activations, while all blocks can run independently and later compose back into a normal AGILLM checkpoint.

Do not replace a healthy live trainer with the independent path without a quality gate. The independent path is mainline-ready as an offline/side-training and low-VRAM worker path first, because the shared stem is frozen during independent block training.

Real-checkpoint workflow

Use the current full checkpoint from latest.json:

CKPT=$(python3 - <<'PY'
import json
print(json.load(open('/workspace/agillm4_4090_ckpts_active/latest.json'))['path'])
PY
)

Check the memory model against the live config:

python3 /workspace/agillm41-mainline/agillm43_diffusionblocks_independent.py \
  mem-report --ckpt "$CKPT"

Create a frozen shared stem from a full checkpoint:

python3 /workspace/agillm41-mainline/agillm43_diffusionblocks_independent.py \
  stem-from-ckpt --ckpt "$CKPT" --out /workspace/dbi_stem.pt

Train one initialized block from a full checkpoint:

python3 /workspace/agillm41-mainline/agillm43_diffusionblocks_independent.py \
  train-block --init-ckpt "$CKPT" --B 4 --block 0 \
  --steps 200 --batch 4 --seqlen 64 --device cpu \
  --out /workspace/dbi_block0.pt

Run blocks 0..B-1 independently on separate machines, then compose them back into a normal AGILLM checkpoint schema:

python3 /workspace/agillm41-mainline/agillm43_diffusionblocks_independent.py \
  compose-into-ckpt --base-ckpt "$CKPT" \
  --blocks /workspace/dbi_block0.pt /workspace/dbi_block1.pt /workspace/dbi_block2.pt /workspace/dbi_block3.pt \
  --out /workspace/dbi_composed_full.pt

The composed checkpoint contains standard core/ar fields plus a diffusionblocks_independent metadata record.

Verified on 2026-06-17

python3 -m py_compile agillm43_diffusionblocks_independent.py
CLI exposes stem-from-ckpt, train-block --init-ckpt, and compose-into-ckpt
mem-report --ckpt /workspace/agillm4_4090_ckpts_active/pretrain_step02294806.pt read the live 4.3 cfg/vocab and reported B=4 one-block resident memory of about 4.89GB versus 19.55GB monolithic 4P memory before activations
Tiny synthetic checkpoint smoke: exported four initialized blocks and compose-into-ckpt restored the checkpoint with max_core_diff == 0.0
Existing selftest still passes with ~4x optimizer-state reduction at B=4 and exact compose round-trip

Xet Storage Details

Size:: 3.12 kB
Xet hash:: 09279c2ccd9a4cd08a58050c43e898b81a38f31adfb0b1ee291a9833c3026c5c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.