rev2: correct root cause (--agillm3_compat / AR+SAT head-schema), not attention

0c3be72 verified about 1 month ago

6.43 kB

	# Diffusion-Block Disaggregated Training & a Catastrophic Port Bug — Research Notes

	Date: 2026-06-01 (rev 2 — corrected root cause) · Status: method validated; a catastrophic port bug found, mis-diagnosed twice, then properly isolated and fixed.

	A study of whether diffusion-block (dblock) training — training a local window of transformer layers from a detached snapshot instead of holding the full-depth backward graph — can decouple model training across heterogeneous hardware and slow networks. Includes an honest account of root-causing a catastrophic loss bug while porting an older checkpoint onto a newer framework.

	## TL;DR
	dblock makes the backward graph local, decoupling training across depth. That gives real activation-memory savings, low cross-node communication, and tolerance to rare synchronisation — the ingredients for heterogeneous / over-the-internet training. We measured the benefits and the costs. Net: the method is real and democratising (lets weak/cheap/mixed hardware participate), but does not beat a modern GPU on efficiency — it changes what hardware can take part, not the physics of matmul-per-watt.

	## What we measured
	1. Activation-memory reduction is real. 1-layer active window: ~6x@16 layers up to ~20x@64 layers; sparse top-1 MoE ~40x activation ratio. Total VRAM reduction bounded (~4-4.6x) by resident weights + optimizer state.

	2. Compression "quality loss" is mostly under-training. Short runs (180 steps) suggested more compression = worse; trained to 960 steps, 93-97% of the gap closed. The memory lever is more compression, not necessarily MoE.

	3. MoE's edge narrows with training. 21% win at 180 steps -> ~10% by 960, at +49% resident FFN params. Experimental, not a default.

	4. Parallel/decoupled training: throughput rises, pay a convergence tax. Aggregate throughput does increase with workers, but matched on updates, fully-parallel "Jacobi" needed ~2.7x more steps than sequential "Gauss-Seidel". Raw throughput != speedup. From a warm checkpoint the penalty shrinks -> ~1.3-1.6x real speedup.

	5. The synchronisation-frequency law (key for internet training). Quality tax vs sync interval H: {4:2%, 8:6%, 16:13%, 64:26%, 128:27%, 256:300%, never:collapse}. Wide usable band -> gentle slope -> sharp cliff once you sync only ~once per run. Communicate rarely (DiLoCo-style), but never never.

	6. Heterogeneous hardware works mechanically. The same decoupled trainer ran with layer-windows split across a CPU, an NVIDIA consumer GPU, and an Intel integrated GPU in one process; sequential trajectory matched single-GPU exactly. (Some backends need a finite large-negative attention mask, not -inf.)

	7. Economics, anchored to a measured number. A single RTX 4090 training a ~700M model measured ~3,136 tok/s (this corrected an earlier extrapolation that was ~25-30x too high — measure, don't guess). Cheap used consumer GPUs match/beat a 4090 on compute-per-£ (3070 ~0.40, 3060 ~0.26, 4090 ~0.25 TFLOPS/£); a CPU pool is 10-50x worse on £ and energy/token. So "any hardware can join" is real, and the efficient version is pooling cheap small-VRAM GPUs (each a slice over a normal network — no NVLink), not CPUs. CPUs only make sense when genuinely free/idle at volunteer scale.

	## The bug worth sharing — and how NOT to diagnose it
	A mature ~700M model (older checkpoint head-schema) was ported onto a newer training framework. Result: catastrophic loss ~10x random — the model appeared "loaded and training" but produced garbage and degraded its own weights.

	The instructive part is the diagnostic failure. We proposed two confident root causes and both were wrong:
	1. "It's the attention algorithm (windowed vs full)." Wrong — logs showed both the broken and the working run used the same (dense) attention backend.
	2. "It's a degraded checkpoint." Wrong — that checkpoint loaded fine once the right flag was set.

	Both errors came from the same methodological sin: changing several variables at once (checkpoint, sequence length, a compatibility flag, the attention backend) and attributing the outcome to one of them.

	Actual root cause (found by a clean single-variable test): the run omitted a checkpoint-compatibility flag that tells the new framework how to interpret the old model's head schema on load. Without it, the heads/weights are mis-mapped -> confidently-wrong outputs -> ~10x-random loss. Holding everything else constant and toggling only that flag moved loss from ~113 to ~8. Every run with the flag set landed at sane loss (8-19) regardless of attention/checkpoint/length/dblock; the only catastrophic run was the one missing it.

	Generalisable lessons:
	1. Gate any checkpoint migration on the loss value, not "it ran." A from-scratch or mis-loaded run also "trains and saves cleanly."
	2. Random-init cross-entropy is ln(vocab). A loss far above that means weights are loaded but mis-mapped (confidently wrong) — a different failure than "didn't load."
	3. Isolate ONE variable per test. We violated this three times and mis-attributed the cause twice; the answer only appeared from a strict single-variable comparison. An independent reviewer rejecting the first wrong claim is what forced the proper isolation — adversarial review works.
	4. Cross-framework checkpoint ports hinge on schema/compatibility flags (head layout, tokenizer, optimizer-state format), not just tensor shapes. Optimizer formats differ across versions (e.g. Adafactor vs AdamW) — reset on migration; keep weights + step/token counters.

	## Open
	- A complete multi-node dblock run with end-to-end quality measurement — next experiment now the baseline is fixed.
	- dblock at the model's native context length at full attention is memory-heavy on CPU — better on a GPU.
	- Whether a disaggregated cheap-hardware swarm ever beats one good GPU — for owned hardware, a research demo, not a throughput win.

	## Bottom line
	Diffusion-block training is a legitimate primitive for heterogeneous, low-communication, fault-tolerant training — the family of DiLoCo, SWARM and Hivemind. It democratises participation within a bounded sync band, at a measured convergence tax. It does not repeal the efficiency advantage of purpose-built accelerators. And the debugging story is its own lesson: gate on loss, and change one variable at a time.

	# Diffusion-Block Disaggregated Training & a Catastrophic Port Bug — Research Notes

	Date: 2026-06-01 (rev 2 — corrected root cause) · Status: method validated; a catastrophic port bug found, mis-diagnosed twice, then properly isolated and fixed.

	A study of whether diffusion-block (dblock) training — training a local window of transformer layers from a detached snapshot instead of holding the full-depth backward graph — can decouple model training across heterogeneous hardware and slow networks. Includes an honest account of root-causing a catastrophic loss bug while porting an older checkpoint onto a newer framework.

	## TL;DR
	dblock makes the backward graph local, decoupling training across depth. That gives real activation-memory savings, low cross-node communication, and tolerance to rare synchronisation — the ingredients for heterogeneous / over-the-internet training. We measured the benefits and the costs. Net: the method is real and democratising (lets weak/cheap/mixed hardware participate), but does not beat a modern GPU on efficiency — it changes what hardware can take part, not the physics of matmul-per-watt.

	## What we measured
	1. Activation-memory reduction is real. 1-layer active window: ~6x@16 layers up to ~20x@64 layers; sparse top-1 MoE ~40x activation ratio. Total VRAM reduction bounded (~4-4.6x) by resident weights + optimizer state.

	2. Compression "quality loss" is mostly under-training. Short runs (180 steps) suggested more compression = worse; trained to 960 steps, 93-97% of the gap closed. The memory lever is more compression, not necessarily MoE.

	3. MoE's edge narrows with training. 21% win at 180 steps -> ~10% by 960, at +49% resident FFN params. Experimental, not a default.

	4. Parallel/decoupled training: throughput rises, pay a convergence tax. Aggregate throughput does increase with workers, but matched on updates, fully-parallel "Jacobi" needed ~2.7x more steps than sequential "Gauss-Seidel". Raw throughput != speedup. From a warm checkpoint the penalty shrinks -> ~1.3-1.6x real speedup.

	5. The synchronisation-frequency law (key for internet training). Quality tax vs sync interval H: {4:2%, 8:6%, 16:13%, 64:26%, 128:27%, 256:300%, never:collapse}. Wide usable band -> gentle slope -> sharp cliff once you sync only ~once per run. Communicate rarely (DiLoCo-style), but never never.

	6. Heterogeneous hardware works mechanically. The same decoupled trainer ran with layer-windows split across a CPU, an NVIDIA consumer GPU, and an Intel integrated GPU in one process; sequential trajectory matched single-GPU exactly. (Some backends need a finite large-negative attention mask, not -inf.)

	7. Economics, anchored to a measured number. A single RTX 4090 training a ~700M model measured ~3,136 tok/s (this corrected an earlier extrapolation that was ~25-30x too high — measure, don't guess). Cheap used consumer GPUs match/beat a 4090 on compute-per-£ (3070 ~0.40, 3060 ~0.26, 4090 ~0.25 TFLOPS/£); a CPU pool is 10-50x worse on £ and energy/token. So "any hardware can join" is real, and the efficient version is pooling cheap small-VRAM GPUs (each a slice over a normal network — no NVLink), not CPUs. CPUs only make sense when genuinely free/idle at volunteer scale.

	## The bug worth sharing — and how NOT to diagnose it
	A mature ~700M model (older checkpoint head-schema) was ported onto a newer training framework. Result: catastrophic loss ~10x random — the model appeared "loaded and training" but produced garbage and degraded its own weights.

	The instructive part is the diagnostic failure. We proposed two confident root causes and both were wrong:
	1. "It's the attention algorithm (windowed vs full)." Wrong — logs showed both the broken and the working run used the same (dense) attention backend.
	2. "It's a degraded checkpoint." Wrong — that checkpoint loaded fine once the right flag was set.

	Both errors came from the same methodological sin: changing several variables at once (checkpoint, sequence length, a compatibility flag, the attention backend) and attributing the outcome to one of them.

	Actual root cause (found by a clean single-variable test): the run omitted a checkpoint-compatibility flag that tells the new framework how to interpret the old model's head schema on load. Without it, the heads/weights are mis-mapped -> confidently-wrong outputs -> ~10x-random loss. Holding everything else constant and toggling only that flag moved loss from ~113 to ~8. Every run with the flag set landed at sane loss (8-19) regardless of attention/checkpoint/length/dblock; the only catastrophic run was the one missing it.

	Generalisable lessons:
	1. Gate any checkpoint migration on the loss value, not "it ran." A from-scratch or mis-loaded run also "trains and saves cleanly."
	2. Random-init cross-entropy is ln(vocab). A loss far above that means weights are loaded but mis-mapped (confidently wrong) — a different failure than "didn't load."
	3. Isolate ONE variable per test. We violated this three times and mis-attributed the cause twice; the answer only appeared from a strict single-variable comparison. An independent reviewer rejecting the first wrong claim is what forced the proper isolation — adversarial review works.
	4. Cross-framework checkpoint ports hinge on schema/compatibility flags (head layout, tokenizer, optimizer-state format), not just tensor shapes. Optimizer formats differ across versions (e.g. Adafactor vs AdamW) — reset on migration; keep weights + step/token counters.

	## Open
	- A complete multi-node dblock run with end-to-end quality measurement — next experiment now the baseline is fixed.
	- dblock at the model's native context length at full attention is memory-heavy on CPU — better on a GPU.
	- Whether a disaggregated cheap-hardware swarm ever beats one good GPU — for owned hardware, a research demo, not a throughput win.

	## Bottom line
	Diffusion-block training is a legitimate primitive for heterogeneous, low-communication, fault-tolerant training — the family of DiLoCo, SWARM and Hivemind. It democratises participation within a bounded sync band, at a measured convergence tax. It does not repeal the efficiency advantage of purpose-built accelerators. And the debugging story is its own lesson: gate on loss, and change one variable at a time.