Buckets:

OpenTransformer
/

agillm41-checkpoints

Files

xet

OpenTransformer/agillm41-checkpoints / code /agillm4 /training_bench /README.md

OpenTransformer

25 days ago

preview code

download

raw

1.99 kB

	# AGILLM4 Distributed Training Benchmark

	This folder contains a non-destructive benchmark harness for the real AGILLM4
	DiffusionBlock training path.

	It exports copied block-slice leases from a full AGILLM4 checkpoint, runs each
	lease on a worker using the live AGILLM4 `Block`, MoE FFN, sublinear attention
	mask path, and V4-Pro tokenizer vocab, then collects timing summaries. It does
	not modify the source checkpoint.

	## Files

	- `agillm4_export_bench_packages.py` - export shared frozen tensors and one
	DBlock lease per worker from a full AGILLM4 checkpoint.
	- `agillm4_slice_bench_worker.py` - run one worker lease with the live
	`nB300_agillm4.py` runtime plus its companion `dblocks_train.py`,
	`fused_ce.py`, and `anchor_memory.py` sidecars.
	- `agillm4_collect_bench_results.py` - collect update `.pt` files into a
	combined throughput summary.

	## Example

	```bash
	python agillm4_export_bench_packages.py \
	--ckpt /workspace/agillm4_4090_ckpts/pretrain_step01317993.pt \
	--out-dir /workspace/agillm4_dist_bench/export_test \
	--workers geth:0,mcp:1,prime:2,communist-web:3 \
	--dblock-blocks 4 \
	--steps 1 \
	--batch-size 1 \
	--block-size 128 \
	--attn-backend sublinear \
	--sublinear-window 128 \
	--sublinear-stride 128 \
	--sublinear-max-anchors 128 \
	--sublinear-chunk 128
	```

	Each worker then runs its assigned lease:

	```bash
	python agillm4_slice_bench_worker.py \
	--package lease_geth_block0_agillm4bench.pt \
	--shared shared_frozen.pt \
	--runtime /root/agillm4_worker/runtime/nB300_agillm4.py \
	--out agillm4_bench_update_geth.pt \
	--device cpu \
	--threads 2
	```

	The first live all-node run on 2026-06-02 used four 7-layer block leases from
	`pretrain_step01317993.pt` and completed 512 unique block tokens in 106.724 s,
	or 4.797 combined tok/s, versus a 4090 monolithic baseline of 3357.167 tok/s.
	This is a correctness/architecture benchmark for CPU worker slices, not a claim
	that the CPU mesh is competitive with the 4090 trainer.

Xet Storage Details

Size:: 1.99 kB
Xet hash:: 4f9efce08a7ce3cbb31cce7b64bc441de38b7a02f5a93d58dd73a30c1b6fbf63

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.