Baladithya Balamurugan

Wave 21: deep-read critical review — 8 source clusters re-read, findings verified

2a16b30 19 days ago

20.2 kB

	# Deep-read critical review: DiLoCo / Streaming DiLoCo family
	Date: 2026-06-09
	Sources read verbatim:
	- arXiv:2311.08105 (DiLoCo, Douillard et al. 2023/2024, HTML full text, 9503 words)
	- arXiv:2501.18512 (Streaming DiLoCo, Douillard et al. 2025, HTML full text, 19297 words)
	- arXiv:2502.12996 (Eager Updates, Kale, Douillard, Donchev 2025, abstract)
	- torchft/local_sgd.py (live main branch, fetched 2026-06-10)

	Repo artefacts cross-checked:
	- `composer_replication/diloco/__init__.py`
	- `composer_replication/diloco/serverless/allreduce.py`
	- `docs/adrs/ADR-003-diloco-impl.md`
	- `docs/adrs/ADR-005-serverless-diloco.md`
	- `research/02-diloco-family.md`
	- `research/design-F4-decoupled-diloco-s3.md`

	---

	## 1. Primary-source extraction: DiLoCo (arXiv:2311.08105)

	### 1.1 Algorithm

	Algorithm 1 (paper §2) is:

	1. Outer step t = 1…T
	2. Each worker i: θ_i^(t) ← θ^(t-1) (re-initialized from global params)
	3. H inner steps: θ_i^(t) ← InnerOpt(θ_i^(t), ∇L) using AdamW
	4. Outer gradient: Δ^(t) = (1/k) Σ_i ( θ^(t-1) − θ_i^(t) )
	5. Outer update: θ^(t) ← OuterOpt(θ^(t-1), Δ^(t))

	Sign convention (paper, verbatim): "the delta in parameters space…is computed per worker…Δ^(t) ← (1/k) Σ_i (θ^(t-1) − θ_i^(t))". This is θ_initial minus θ_local — the negative of the local update direction.

	### 1.2 Outer optimizer hyperparameters (paper §3.2 ablation)

	Paper reports tuning outer optimizer across SGD, SGDM, Nesterov, Adam. Results in Table 5:

	> "the setting with outer learning rate equal to 0.7 and outer momentum equal to 0.9 is very robust, and it is adopted for all our experiments throughout."

	Chosen values (bold in Table 5): outer LR = 0.7, outer momentum = 0.9, Nesterov.

	### 1.3 Sync frequency H

	Figure 4 sweeps H ∈ {50, 100, 250, 500, 1000, 2000}. Main experiment default: H = 500.

	> "communicating more frequently than H = 500 steps leads to diminishing returns. Moreover, the performance degradation is very mild up to H = 1000 steps."

	H = 2000 causes meaningful degradation. H = 500 chosen as best trade-off.

	### 1.4 Heterogeneous / unreliable workers

	Section 3.1 "Adaptive compute pool": workers can join/leave; final perplexity tracks total compute budget regardless of schedule. "Models quality is affected by the total amount of compute, but not as much by how such computed is allocated over time."

	Section 3.1 "Asynchronous Communication": outer gradients dropped with probability up to 50%. At 50% drop rate, perplexity degrades only 2.1% relative to perfect communication.

	Limitation (§5): "The version of DiLoCo presented here assumes that all workers are homogeneous. However, in practice workers might operate at wildly different speed." Heterogeneous/async DiLoCo is explicitly listed as future work.

	### 1.5 What the original paper does NOT cover

	- No scaling laws for optimal H vs model size.
	- No quantization (mentioned pruning in appendix: 50% sign-pruning costs only +0.39% PPL).
	- Models only up to 400M params.
	- All experiments start from a 24k-step pretrained checkpoint (cold-start is studied but as secondary).
	- No measurement of real wall-clock with actual network constraints.

	---

	## 2. Primary-source extraction: Streaming DiLoCo (arXiv:2501.18512)

	### 2.1 Three contributions

	1. Fragment streaming (§2.2): Partition model into P fragments. Each fragment syncs every H steps, but fragments are staggered (offsets t_p). Peak bandwidth reduced by factor \|p\|/L (fragment size/total layers).

	2. Overlapping communication with computation (§2.3): τ parameter (inner overlap delay). Fragment's allreduce initiated at step t, results applied τ steps later. Workers keep training during those τ steps. For heterogeneous workers: use per-worker τ values. Robust to τ up to ~5 inner steps.

	3. Low-precision outer gradients (§2.4): Outer gradients (what's communicated, not optimizer state) quantized to FP4 = E3M0 (1 sign bit, 3 exponent bits, 0 mantissa bits). Accumulation in FP32. No regression found at 4-bit. Applied at send time, before allreduce.

	Combined result: "reducing required bandwidth by two orders of magnitude" (abstract). Table 1 shows Data-Parallel 441 TB vs Streaming DiLoCo 1.10 TB (≈400× total bits reduction for 1B model).

	### 2.2 Outer hyperparameters in Streaming DiLoCo

	> "The main hyperparameter of DiLoCo is its outer learning rate; we tuned it to be optimal at small scale at 0.4, and kept it fixed across all scales."

	Streaming paper uses outer LR = 0.4, not 0.7. Momentum: not explicitly restated; inherits from original paper's Nesterov momentum = 0.9.

	H values in experiments: H = 30 and H = 100 (not H = 500 from the original paper).

	### 2.3 Fragment scheduling

	Algorithm 2: condition `t - t_p mod H == 0` to decide which fragment syncs at step t. Fragmented offsets allow continuous streaming. "As we increase model scale, the fragment definition…is maintained, which means that larger models have more fragments."

	### 2.4 Heterogeneous workers

	§3.3.2 (Overlapping with slack between workers): per-worker τ_m handles execution speed differences. "the loss degradation is limited under a delay of up to 5 inner steps." Above τ ≈ 5, degradation increases.

	### 2.5 Memory overhead

	66% more memory than Data-Parallel (outer parameters copy + Nesterov state). For Streaming: only active fragment's outer state needs to be in HBM; rest can be on CPU. For 100B model with 3-layer fragment out of 108 layers: ~2% additional memory.

	### 2.6 "Liu et al. 2024a" citation in Streaming paper

	The Streaming paper references "Liu et al. (2024a)" = Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato. "Asynchronous local-sgd training for language modeling." arXiv:2401.09135. This is cited in the context of per-worker slack for heterogeneous workers.

	---

	## 3. Primary-source extraction: torchft/local_sgd.py

	### 3.1 Sign convention (verbatim from `_save_grads`)

	```python
	def _save_grads(self) -> None:
	with torch.no_grad():
	for name, p in self._model_fragment.named_parameters():
	...
	pseudogradient = (
	self.original_parameters[name].to(p.device) - local_param
	)
	self._grads[name] = pseudogradient
	```

	`original_parameters[name]` = θ_initial, `local_param` = θ_local.
	torchft computes: pseudogradient = θ_initial − θ_local (same sign as paper's Δ).

	### 3.2 Outer optimizer application chain (verbatim from `perform_sync`)

	```
	1. _save_local_parameters() # saves θ_local into _local_parameters
	2. restore_parameters() # p.data ← θ_initial (from original_parameters)
	3. _set_grads() # p.grad ← averaged_pseudogradient
	4. _outer_optimizer.step() # SGD Nesterov: p.data ← θ_initial - lr*Nesterov(avg_pseudograd)
	5. save_parameters() # original_parameters ← p.data (θ_outer_updated)
	6. _merge_parameters() # p.data ← alphap.data + (1-alpha)_local_parameters
	# For alpha=0.0 (vanilla): p.data ← θ_local
	```

	Post-sync state: `p.data = θ_local`, `original_parameters = θ_outer_updated`.
	Next outer round: `restore_parameters()` sets `p.data = θ_outer_updated`; H inner steps produce `θ_local_next`.
	Pseudograd_next = θ_outer_updated − θ_local_next. This is faithful to Algorithm 1.

	### 3.3 Fragment rotation logic

	```python
	def _current_fragment(self) -> int:
	step = self._manager.current_step()
	return step % len(self._fragments)
	```

	For vanilla (1 fragment): always fragment 0. For Streaming with P fragments: round-robin by `step % P`.

	`start_quorum()` is called at `step = sync_every - fragment_sync_delay` (prepare_sync time).
	`current_step()` is read at `step = sync_every` (perform_sync time).

	### 3.4 `_use_async_quorum` constraint

	`DiLoCo.__init__` raises `ValueError` if `manager._use_async_quorum` is truthy. Must be False for synchronous quorum (as in the paper's design).

	---

	## 4. Repo correctness findings

	### F1. Sign convention — CORRECT

	`__init__.py` lines 17–38 documents the convention: "pseudograd = θ_initial - θ_local (per torchft's `_save_grads()`)". The code's outer SGD uses standard `p.data ← p.data - lr * grad` where `p.data = θ_initial` (restored before outer step) and `grad = avg_pseudograd = avg(θ_initial - θ_local)`. The arithmetic is correct and matches Algorithm 1. The sign-convention test in `spikes/008-streaming-diloco/tests/test_diloco_smoke.py` is appropriate insurance.

	No bug. No mischaracterization.

	### F2. Outer optimizer hyperparams — MINOR DIVERGENCE

	Repo default: `outer_lr=0.7, outer_momentum=0.9, nesterov=True` (lines 69–72 of `__init__.py`). These match the original DiLoCo paper (§3.2, Table 5 bold values).

	The Streaming DiLoCo paper uses `outer_lr=0.4` tuned at small scale. The repo uses original-paper values, which is appropriate for v0.1 (vanilla DiLoCo). But if someone enables Streaming by setting `fragment_sync_delay>0` and `model_fragments=[f0,...,fN]`, they will use lr=0.7 where the Streaming paper used lr=0.4. This is not a correctness bug but may underperform Streaming DiLoCo in practice.

	Recommendation: Add a note in `make_diloco_outer_loop` docstring: "Streaming DiLoCo (Douillard et al. 2025) tunes outer_lr=0.4; the default 0.7 is optimal for vanilla DiLoCo."

	### F3. Default sync_every=100 vs paper's H=500

	Repo default: `sync_every=100` (line 72 of `__init__.py`).

	Original DiLoCo paper main experiment: H=500. OpenDiLoCo: H=125. Streaming paper: H=30 or H=100.

	The default H=100 is consistent with the Streaming paper and OpenDiLoCo but does NOT match the original paper's "main experiment default." The ADR-003 docstring says "Default hyperparams (DiLoCo paper §3.2): outer_lr = 0.7, outer_momentum = 0.9, Nesterov" — this is correct for lr/momentum but the paper's main default H is 500, not 100.

	Not a correctness bug (H=100 is within the paper's tested range and performs well). But the docstring claiming it cites "DiLoCo paper §3.2" for the H default is misleading — §3.2 chooses H=500.

	ADR-005 says "H = 500-1000 inner steps" — this is consistent with the original paper's claimed range, though the practical default in code is 100.

	### F4. Author misattribution of Streaming DiLoCo — INCORRECT

	`__init__.py` line 8: `"Streaming DiLoCo (Liu et al. 2025)"`.

	`design-F4-decoupled-diloco-s3.md` line 109: `"Streaming DiLoCo (Liu et al. 2025, 'Eager Updates for Overlapped Communication in DiLoCo', arXiv:2501.18512)"`.

	All three facts here are wrong:
	- arXiv:2501.18512 is authored by Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, et al. — not "Liu et al."
	- The title of arXiv:2501.18512 is "Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch" — not "Eager Updates for Overlapped Communication in DiLoCo."
	- "Eager Updates for Overlapped Communication and Computation in DiLoCo" is a separate paper, arXiv:2502.12996, by Satyen Kale, Arthur Douillard, Yanislav Donchev — also not "Liu et al."
	- "Liu et al. (2024a)" is the correct citation for the Async Local-SGD paper (arXiv:2401.09135), not Streaming DiLoCo.

	The confusion appears to be: design-F4 imported "Liu et al. 2025" from a different context (possibly confusing with Dr.GRPO or another Liu et al. paper), attached it to the wrong arXiv ID, and applied the wrong title. The `__init__.py` propagated the wrong author name from design-F4.

	Correct attributions:
	- Vanilla DiLoCo: Douillard et al. 2023/2024, arXiv:2311.08105.
	- Streaming DiLoCo: Douillard et al. 2025, arXiv:2501.18512.
	- Eager Updates: Kale, Douillard, Donchev 2025, arXiv:2502.12996 (companion/workshop paper with noted text overlap with 2501.18512).

	Files to fix: `composer_replication/diloco/__init__.py` line 8; `research/design-F4-decoupled-diloco-s3.md` line 109.

	### F5. "Streaming degrades to vanilla" — IMPRECISE BUT DEFENSIBLE

	`design-F4-decoupled-diloco-s3.md` line 116 states:

	> "prepare_sync blocks for the full S3 rendezvous and `fragment_sync_delay` buys zero overlap — Streaming degrades to vanilla, correctly but without the comm/compute overlap benefit."

	This is directionally true but imprecise. With synchronous `ObjectStoreAllReduce`:

	- What is PRESERVED: fragment streaming (per-fragment partial sync, multiple fragments synced in staggered rotation). A 4-fragment model still syncs only 1/4 of parameters per outer step boundary; peak bandwidth is still reduced.
	- What is LOST: the τ (fragment_sync_delay) overlap benefit — inner training steps can no longer run while the allreduce is in flight because the allreduce blocks.

	So the correct characterization is: "synchronous allreduce loses overlap (Contribution 2 of Streaming DiLoCo), but fragment partial sync (Contribution 1) still works. fragment_sync_delay=0 with multiple fragments still gives partial-sync bandwidth savings."

	The statement "degrades to vanilla" implies full-model-sync-at-H, which is not what happens for multi-fragment configurations. However, for the current Spike 008 configuration (single fragment, fragment_sync_delay=0), this degrades to exactly vanilla DiLoCo, which makes the statement accidentally correct for that specific case. The broader claim for multi-fragment Streaming is imprecise.

	design-F4 correctly notes the fix (non-blocking PUT returning deferred Work) and correctly defers it to Phase 5. The characterization is acceptable as long as readers understand it applies to overlap loss, not to fragment streaming.

	### F6. Quantization — NOT IMPLEMENTED, CORRECTLY OMITTED

	`MockManager.allreduce` signature: `def allreduce(self, tensor, _kwargs)` — the `should_quantize` keyword is silently absorbed by `_kwargs` and passed to `ObjectStoreAllReduce`, which has no quantization logic. FP32 tensors are serialized directly via `torch.save`.

	The Streaming paper's FP4 (E3M0) outer gradient quantization is not implemented. This is appropriate for v0.1 vanilla scope. The paper shows: E3M0 quantization cuts communicated bits by 8× (FP32→FP4) with no regression. The framework loses this efficiency benefit but is algorithmically correct.

	research/02 line 244 incorrectly describes Streaming DiLoCo as using "FP16 outer state" for compression. The paper uses FP4 outer gradients (what is transmitted), NOT FP16 optimizer state. These are different things. FP16 is used by OpenDiLoCo (separately) to cut payload 2×; Streaming DiLoCo uses FP4 = 8× reduction.

	### F7. H range and sync cadence claims — MINOR INACCURACY in research/02

	`research/02` Table §2 says Streaming DiLoCo uses "Continuous partial" sync frequency. This is accurate. But the bandwidth reduction column "~100× peak BW + frequency" is underestimated. The paper's Table 1 shows ≈400× total bits reduction (not 100×) for a 1B model. The "two orders of magnitude" phrasing in the abstract means ≥100×; the actual measured result is ≥400×.

	### F8. Heterogeneous worker handling — PARTIALLY MISSING in research/02

	`research/02` says Streaming DiLoCo has "better tolerance since communication is continuous, but still synchronous." This misses the per-worker τ mechanism the Streaming paper introduces. The paper explicitly shows: with τ_1=1, varying τ_2 up to 5 inner steps shows robust degradation curve. This is a first-class mechanism for heterogeneous workers, not just an incidental property.

	The correct characterization: Streaming DiLoCo with per-worker τ tolerates timing heterogeneity of up to ~5 inner steps without significant degradation.

	---

	## 5. ADR-003 correctness summary

	\| Claim \| Verdict \|
	\|---\|---\|
	\| torchft computes pseudogradient = θ_initial − θ_local \| CORRECT (verified from source) \|
	\| Outer optimizer sign: no negation needed in our wrapper \| CORRECT (SGD subtracts, pseudograd is already "subtract away from θ_local") \|
	\| fragment_sync_delay > 0 requires CUDA streams \| CORRECT (torchft uses `torch.Stream` for overlap; without a stream, overlap is serial) \|
	\| Spike 008 uses vanilla (single fragment, delay=0) \| CORRECT \|
	\| Streaming is "a configuration-flag away" \| CORRECT (same API, just different params) \|
	\| torchft is Meta-maintained BSD-3 \| CORRECT \|

	\| Gap \| Verdict \|
	\|---\|---\|
	\| Default H=100 attributed to "DiLoCo paper §3.2" \| MISLEADING — paper's main default is H=500; H=100 comes from OpenDiLoCo / Streaming paper range \|
	\| "Liu et al. 2025" for Streaming DiLoCo \| WRONG — should be Douillard et al. 2025 \|

	---

	## 6. ADR-005 correctness summary

	\| Claim \| Verdict \|
	\|---\|---\|
	\| DiLoCo outer sync is once per H=500-1000 inner steps \| CORRECT for original paper; Streaming paper uses H=30-100 \|
	\| Pseudo-gradient size ~2 GB for 1B model in bf16 \| CORRECT: 1B params × 2 bytes/param = 2 GB \|
	\| Object-store rendezvous is bandwidth-efficient \| CORRECT — well-reasoned; consistent with paper's communication profile \|

	---

	## 7. DiLoCo scaling laws

	No dedicated DiLoCo scaling-laws paper exists as of search date. The Streaming DiLoCo paper (§3.2.1) shows scaling experiments from 35M to 4B parameters but does not derive scaling laws for optimal H or outer optimizer settings. Table 4 of the original paper shows modest scaling (60M, 150M, 400M). DiLoCoX (arXiv:2506.21263) scales to 107B but is a different framework.

	The repo makes no explicit scaling-law claims for DiLoCo, so no finding here.

	---

	## 8. Action items (priority-ordered)

	1. [HIGH] Fix author attribution in two files:
	- `composer_replication/diloco/__init__.py` line 8: change `"Streaming DiLoCo (Liu et al. 2025)"` to `"Streaming DiLoCo (Douillard et al. 2025)"`.
	- `research/design-F4-decoupled-diloco-s3.md` line 109: correct from `"Liu et al. 2025, 'Eager Updates…', arXiv:2501.18512"` to `"Douillard et al. 2025, 'Streaming DiLoCo…', arXiv:2501.18512"`.

	2. [MEDIUM] Add outer_lr note for Streaming:
	- `make_diloco_outer_loop` docstring: note that Streaming DiLoCo (2501.18512) uses outer_lr=0.4 tuned at small scale, while the default 0.7 is optimal for vanilla.

	3. [MEDIUM] Fix research/02 compression claim:
	- Line 244: "FP16 outer state" → "FP4 (E3M0) outer gradients" (what is communicated; accumulation stays FP32).
	- Bandwidth reduction: "~100× peak BW + frequency" → "≥400× total bits + peak BW reduction" to match Table 1 of the Streaming paper.

	4. [LOW] Clarify "degrades to vanilla" in design-F4:
	- The current text is accurate for the v0.1 single-fragment case. For multi-fragment configurations, tighten to: "fragment partial-sync bandwidth savings are preserved; only the τ overlap benefit is lost with synchronous allreduce."

	5. [LOW] Fix H=100 default source attribution:
	- `__init__.py` docstring says "Default hyperparams (DiLoCo paper §3.2)" — add that §3.2 uses H=500 for main experiments; H=100 matches Streaming/OpenDiLoCo range.

	6. [FUTURE] Phase-5 upgrade per design-F4:
	- Non-blocking `allreduce_async` in `ObjectStoreAllReduce` to realize genuine τ overlap on S3. ~60 LOC, deferred post-Phase-4.

	---

	## 9. What the papers say that the repo does NOT cover (gaps, not bugs)

	- FP4 quantization of outer gradients (Streaming paper §2.4): not implemented. Worth noting as Phase-5 item alongside the async allreduce.
	- Per-worker τ for heterogeneous devices (Streaming paper §3.3.2): the API supports different `fragment_sync_delay` values but there is no orchestration layer to set per-replica τ based on observed step latency. This is more of an RL orchestration concern than a DiLoCo wrapper concern.
	- Decoupled DiLoCo (DeepMind blog 2025, Gemma 4): a Pathways-style async variant not in the literature yet; no implementation expected.
	- Async outer gradient application (arXiv:2401.09135, Liu et al. 2024a = Async Local-SGD): delayed Nesterov (DN) optimizer + Dynamic Local Updates (DyLU) for heterogeneous workers. Not needed for v0.1 but relevant if serverless executors have highly variable step times.