OpenTransformer's picture
rev2: correct root cause (--agillm3_compat / AR+SAT head-schema), not attention
0c3be72 verified
|
Raw
History Blame Contribute Delete
6.43 kB
# Diffusion-Block Disaggregated Training & a Catastrophic Port Bug β€” Research Notes
Date: 2026-06-01 (rev 2 β€” corrected root cause) Β· Status: method validated; a catastrophic port bug found, mis-diagnosed twice, then properly isolated and fixed.
A study of whether **diffusion-block (dblock) training** β€” training a *local window* of transformer layers from a detached snapshot instead of holding the full-depth backward graph β€” can decouple model training across heterogeneous hardware and slow networks. Includes an honest account of root-causing a catastrophic loss bug while porting an older checkpoint onto a newer framework.
## TL;DR
dblock makes the backward graph **local**, decoupling training across depth. That gives real activation-memory savings, low cross-node communication, and tolerance to rare synchronisation β€” the ingredients for heterogeneous / over-the-internet training. We measured the benefits and the costs. Net: the method is real and democratising (lets weak/cheap/mixed hardware participate), but does not beat a modern GPU on efficiency β€” it changes *what hardware can take part*, not the physics of matmul-per-watt.
## What we measured
**1. Activation-memory reduction is real.** 1-layer active window: ~6x@16 layers up to ~20x@64 layers; sparse top-1 MoE ~40x activation ratio. Total VRAM reduction bounded (~4-4.6x) by resident weights + optimizer state.
**2. Compression "quality loss" is mostly under-training.** Short runs (180 steps) suggested more compression = worse; trained to 960 steps, 93-97% of the gap closed. The memory lever is *more compression*, not necessarily MoE.
**3. MoE's edge narrows with training.** 21% win at 180 steps -> ~10% by 960, at +49% resident FFN params. Experimental, not a default.
**4. Parallel/decoupled training: throughput rises, pay a convergence tax.** Aggregate throughput does increase with workers, but matched on updates, fully-parallel "Jacobi" needed ~2.7x more steps than sequential "Gauss-Seidel". Raw throughput != speedup. From a warm checkpoint the penalty shrinks -> ~1.3-1.6x real speedup.
**5. The synchronisation-frequency law (key for internet training).** Quality tax vs sync interval H: {4:2%, 8:6%, 16:13%, 64:26%, 128:27%, 256:300%, never:collapse}. Wide usable band -> gentle slope -> sharp cliff once you sync only ~once per run. Communicate rarely (DiLoCo-style), but never *never*.
**6. Heterogeneous hardware works mechanically.** The same decoupled trainer ran with layer-windows split across a CPU, an NVIDIA consumer GPU, and an Intel integrated GPU in one process; sequential trajectory matched single-GPU exactly. (Some backends need a finite large-negative attention mask, not -inf.)
**7. Economics, anchored to a measured number.** A single RTX 4090 training a ~700M model measured ~3,136 tok/s (this corrected an earlier extrapolation that was ~25-30x too high β€” *measure, don't guess*). Cheap used consumer GPUs match/beat a 4090 on compute-per-Β£ (3070 ~0.40, 3060 ~0.26, 4090 ~0.25 TFLOPS/Β£); a CPU pool is 10-50x worse on Β£ and energy/token. So "any hardware can join" is real, and the efficient version is pooling cheap small-VRAM GPUs (each a slice over a normal network β€” no NVLink), not CPUs. CPUs only make sense when genuinely free/idle at volunteer scale.
## The bug worth sharing β€” and how NOT to diagnose it
A mature ~700M model (older checkpoint head-schema) was ported onto a newer training framework. Result: **catastrophic loss ~10x random** β€” the model appeared "loaded and training" but produced garbage and degraded its own weights.
**The instructive part is the diagnostic failure.** We proposed two confident root causes and both were wrong:
1. *"It's the attention algorithm (windowed vs full)."* Wrong β€” logs showed both the broken and the working run used the **same** (dense) attention backend.
2. *"It's a degraded checkpoint."* Wrong β€” that checkpoint loaded fine once the right flag was set.
Both errors came from the **same methodological sin: changing several variables at once** (checkpoint, sequence length, a compatibility flag, the attention backend) and attributing the outcome to one of them.
**Actual root cause (found by a clean single-variable test):** the run omitted a **checkpoint-compatibility flag** that tells the new framework how to interpret the *old model's head schema* on load. Without it, the heads/weights are mis-mapped -> confidently-wrong outputs -> ~10x-random loss. Holding everything else constant and toggling *only that flag* moved loss from ~113 to ~8. Every run with the flag set landed at sane loss (8-19) regardless of attention/checkpoint/length/dblock; the only catastrophic run was the one missing it.
**Generalisable lessons:**
1. **Gate any checkpoint migration on the loss value**, not "it ran." A from-scratch or mis-loaded run also "trains and saves cleanly."
2. Random-init cross-entropy is ln(vocab). A loss far *above* that means weights are loaded but mis-mapped (confidently wrong) β€” a different failure than "didn't load."
3. **Isolate ONE variable per test.** We violated this three times and mis-attributed the cause twice; the answer only appeared from a strict single-variable comparison. An independent reviewer rejecting the first wrong claim is what forced the proper isolation β€” adversarial review works.
4. Cross-framework checkpoint ports hinge on **schema/compatibility flags** (head layout, tokenizer, optimizer-state format), not just tensor shapes. Optimizer formats differ across versions (e.g. Adafactor vs AdamW) β€” reset on migration; keep weights + step/token counters.
## Open
- A complete multi-node dblock run with end-to-end **quality** measurement β€” next experiment now the baseline is fixed.
- dblock at the model's native context length at full attention is memory-heavy on CPU β€” better on a GPU.
- Whether a disaggregated cheap-hardware swarm ever beats one good GPU β€” for owned hardware, a research demo, not a throughput win.
## Bottom line
Diffusion-block training is a legitimate primitive for heterogeneous, low-communication, fault-tolerant training β€” the family of DiLoCo, SWARM and Hivemind. It democratises *participation* within a bounded sync band, at a measured convergence tax. It does not repeal the efficiency advantage of purpose-built accelerators. And the debugging story is its own lesson: gate on loss, and change one variable at a time.