| # Diffusion-Block Disaggregated Training & a Catastrophic Port Bug β Research Notes |
|
|
| Date: 2026-06-01 (rev 2 β corrected root cause) Β· Status: method validated; a catastrophic port bug found, mis-diagnosed twice, then properly isolated and fixed. |
|
|
| A study of whether **diffusion-block (dblock) training** β training a *local window* of transformer layers from a detached snapshot instead of holding the full-depth backward graph β can decouple model training across heterogeneous hardware and slow networks. Includes an honest account of root-causing a catastrophic loss bug while porting an older checkpoint onto a newer framework. |
|
|
| ## TL;DR |
| dblock makes the backward graph **local**, decoupling training across depth. That gives real activation-memory savings, low cross-node communication, and tolerance to rare synchronisation β the ingredients for heterogeneous / over-the-internet training. We measured the benefits and the costs. Net: the method is real and democratising (lets weak/cheap/mixed hardware participate), but does not beat a modern GPU on efficiency β it changes *what hardware can take part*, not the physics of matmul-per-watt. |
|
|
| ## What we measured |
| **1. Activation-memory reduction is real.** 1-layer active window: ~6x@16 layers up to ~20x@64 layers; sparse top-1 MoE ~40x activation ratio. Total VRAM reduction bounded (~4-4.6x) by resident weights + optimizer state. |
|
|
| **2. Compression "quality loss" is mostly under-training.** Short runs (180 steps) suggested more compression = worse; trained to 960 steps, 93-97% of the gap closed. The memory lever is *more compression*, not necessarily MoE. |
|
|
| **3. MoE's edge narrows with training.** 21% win at 180 steps -> ~10% by 960, at +49% resident FFN params. Experimental, not a default. |
|
|
| **4. Parallel/decoupled training: throughput rises, pay a convergence tax.** Aggregate throughput does increase with workers, but matched on updates, fully-parallel "Jacobi" needed ~2.7x more steps than sequential "Gauss-Seidel". Raw throughput != speedup. From a warm checkpoint the penalty shrinks -> ~1.3-1.6x real speedup. |
|
|
| **5. The synchronisation-frequency law (key for internet training).** Quality tax vs sync interval H: {4:2%, 8:6%, 16:13%, 64:26%, 128:27%, 256:300%, never:collapse}. Wide usable band -> gentle slope -> sharp cliff once you sync only ~once per run. Communicate rarely (DiLoCo-style), but never *never*. |
|
|
| **6. Heterogeneous hardware works mechanically.** The same decoupled trainer ran with layer-windows split across a CPU, an NVIDIA consumer GPU, and an Intel integrated GPU in one process; sequential trajectory matched single-GPU exactly. (Some backends need a finite large-negative attention mask, not -inf.) |
|
|
| **7. Economics, anchored to a measured number.** A single RTX 4090 training a ~700M model measured ~3,136 tok/s (this corrected an earlier extrapolation that was ~25-30x too high β *measure, don't guess*). Cheap used consumer GPUs match/beat a 4090 on compute-per-Β£ (3070 ~0.40, 3060 ~0.26, 4090 ~0.25 TFLOPS/Β£); a CPU pool is 10-50x worse on Β£ and energy/token. So "any hardware can join" is real, and the efficient version is pooling cheap small-VRAM GPUs (each a slice over a normal network β no NVLink), not CPUs. CPUs only make sense when genuinely free/idle at volunteer scale. |
|
|
| ## The bug worth sharing β and how NOT to diagnose it |
| A mature ~700M model (older checkpoint head-schema) was ported onto a newer training framework. Result: **catastrophic loss ~10x random** β the model appeared "loaded and training" but produced garbage and degraded its own weights. |
|
|
| **The instructive part is the diagnostic failure.** We proposed two confident root causes and both were wrong: |
| 1. *"It's the attention algorithm (windowed vs full)."* Wrong β logs showed both the broken and the working run used the **same** (dense) attention backend. |
| 2. *"It's a degraded checkpoint."* Wrong β that checkpoint loaded fine once the right flag was set. |
|
|
| Both errors came from the **same methodological sin: changing several variables at once** (checkpoint, sequence length, a compatibility flag, the attention backend) and attributing the outcome to one of them. |
|
|
| **Actual root cause (found by a clean single-variable test):** the run omitted a **checkpoint-compatibility flag** that tells the new framework how to interpret the *old model's head schema* on load. Without it, the heads/weights are mis-mapped -> confidently-wrong outputs -> ~10x-random loss. Holding everything else constant and toggling *only that flag* moved loss from ~113 to ~8. Every run with the flag set landed at sane loss (8-19) regardless of attention/checkpoint/length/dblock; the only catastrophic run was the one missing it. |
|
|
| **Generalisable lessons:** |
| 1. **Gate any checkpoint migration on the loss value**, not "it ran." A from-scratch or mis-loaded run also "trains and saves cleanly." |
| 2. Random-init cross-entropy is ln(vocab). A loss far *above* that means weights are loaded but mis-mapped (confidently wrong) β a different failure than "didn't load." |
| 3. **Isolate ONE variable per test.** We violated this three times and mis-attributed the cause twice; the answer only appeared from a strict single-variable comparison. An independent reviewer rejecting the first wrong claim is what forced the proper isolation β adversarial review works. |
| 4. Cross-framework checkpoint ports hinge on **schema/compatibility flags** (head layout, tokenizer, optimizer-state format), not just tensor shapes. Optimizer formats differ across versions (e.g. Adafactor vs AdamW) β reset on migration; keep weights + step/token counters. |
|
|
| ## Open |
| - A complete multi-node dblock run with end-to-end **quality** measurement β next experiment now the baseline is fixed. |
| - dblock at the model's native context length at full attention is memory-heavy on CPU β better on a GPU. |
| - Whether a disaggregated cheap-hardware swarm ever beats one good GPU β for owned hardware, a research demo, not a throughput win. |
|
|
| ## Bottom line |
| Diffusion-block training is a legitimate primitive for heterogeneous, low-communication, fault-tolerant training β the family of DiLoCo, SWARM and Hivemind. It democratises *participation* within a bounded sync band, at a measured convergence tax. It does not repeal the efficiency advantage of purpose-built accelerators. And the debugging story is its own lesson: gate on loss, and change one variable at a time. |
|
|