Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # DiLoCo Reference Implementation Reconnaissance | |
| **Date:** 2026-05-25 | |
| **Purpose:** Pick ONE PyTorch reference implementation of (Streaming) DiLoCo to bolt onto | |
| the composer-replication-framework outer-loop optimizer. Feeds ADR-003. | |
| **Bias:** simple + working > fancy + theoretically-better. Library > research codebase. | |
| --- | |
| ## TL;DR — Recommendation | |
| **Use `meta-pytorch/torchft`'s `torchft.local_sgd.DiLoCo` context manager.** | |
| It is a maintained library (not a research codebase), BSD-3 licensed, supports both | |
| vanilla DiLoCo and Streaming DiLoCo through one class, and — critically — is unit-testable | |
| in a single process by passing a `MagicMock(Manager)` whose `allreduce` returns a `_DummyWork`. | |
| Their own `torchft/local_sgd_test.py` already demonstrates the exact pattern Spike 008 needs. | |
| The Streaming DiLoCo paper (Liu et al. 2025, arXiv:2501.18512) has no separate community | |
| implementation — torchft *is* the reference implementation as of mid-2026. PrimeIntellect's | |
| two repos are either too minimal (`diloco_simple`, no LICENSE, NCCL-locked, no Streaming) or | |
| deprecated (`OpenDiLoCo`, hivemind-based, "no longer maintained" per its own README). | |
| --- | |
| ## Candidates Audited (primary sources only) | |
| ### A1. PrimeIntellect-ai/diloco_simple | |
| - URL: https://github.com/PrimeIntellect-ai/diloco_simple | |
| - License: **NONE** (no LICENSE file in repo — confirmed via `git clone` + `ls`). | |
| All-rights-reserved by default under copyright law. **Cannot legally vendor or fork.** | |
| - Last commit: **2024-05-31** (`be38ec4 add weight decay`). | |
| - Activity: 8 commits total, ever. Two main authors. Effectively abandoned. | |
| - Shape: single 180-LOC research script (`pure_torch_diloco.py`), pedagogical demo. | |
| - Streaming DiLoCo? **No.** Vanilla DiLoCo only. | |
| - Distributed: **Hard-coded NCCL via `torchrun`** + `init_process_group(backend="nccl")`. | |
| Pulls in `wandb`, `transformers`, HuggingFace `datasets`, `cyclopts`, and trains a | |
| full LlamaForCausalLM on C4. Not a library — a benchmark script. | |
| - Verdict: **REJECT.** No license, no Streaming, no library API, NCCL-only, deps on | |
| HF + wandb just to run. Useful as an *algorithm reference*, not as code to depend on. | |
| ### A2. PrimeIntellect-ai/OpenDiLoCo | |
| - URL: https://github.com/PrimeIntellect-ai/OpenDiloco | |
| - License: present (Apache-2.0 typical, not re-verified — moot, see below). | |
| - Status: **Officially deprecated.** README first paragraph: | |
| > "**Important Notice**: OpenDiLoCo is no longer maintained. For our production-ready | |
| > distributed training solution, please check out `prime`." | |
| - Built on: `hivemind` (DHT-based decentralized training). Multi-machine only. | |
| - Streaming DiLoCo? No. | |
| - Verdict: **REJECT.** Deprecated by its authors. Hivemind dependency would force us to | |
| set up DHT initial peers just to run a unit test. | |
| ### A3. PrimeIntellect-ai/prime (a.k.a. INTELLECT-1 framework) | |
| - URL: https://github.com/PrimeIntellect-ai/prime — note: the GitHub org now uses this | |
| repo for their CLI/SDK; the original training framework was rebranded. | |
| - The actual INTELLECT-1 training code uses an `ElasticDeviceMesh` abstraction and is | |
| a full distributed training stack, not an algorithm library. | |
| - Verdict: **REJECT.** Production framework, not a drop-in library. Coupling a 1.5k-LOC | |
| fault-tolerant elastic mesh into our test framework is the opposite of "simple + working". | |
| ### A4. DeepMind reference implementation (Douillard et al., arXiv:2311.08105) | |
| - **No public reference implementation exists.** The DiLoCo paper is algorithm-only. | |
| Confirmed: paper has no associated GitHub link in arXiv abstract or PDF; HuggingFace | |
| papers page links no code. DeepMind has not open-sourced their internal trainer. | |
| - Verdict: **N/A — does not exist.** | |
| ### A5. meta-pytorch/torchft ← **CHOSEN** | |
| - URL: https://github.com/meta-pytorch/torchft | |
| - License: **BSD 3-Clause** (verified: `head -5 LICENSE` → "BSD 3-Clause License"). | |
| - Last commit on main: **2026-04-03** (HEAD `7eb7087 Add torchcomms ProcessGroup shim | |
| for fault-tolerant reconfiguration`). | |
| - Activity: 312 commits, multiple Meta contributors, recent commits across 2025 and 2026, | |
| active CI, nightly PyPI builds at https://pypi.org/project/torchft-nightly/. | |
| - Shape: **library**, not a research codebase. `torchft/` is a proper Python package with | |
| `local_sgd.py`, `manager.py`, `process_group.py`, `local_sgd_test.py` (real pytest unit | |
| tests), pyproject.toml, BSD-3. | |
| - Streaming DiLoCo? **Yes** — the `DiLoCo` class is itself a Streaming DiLoCo | |
| generalization (`fragment_sync_delay`, `fragment_update_alpha`); pass a single-element | |
| `model_fragments=[model]` for vanilla DiLoCo. | |
| - Source comment confirms: `"""... DiLoCo paper: https://arxiv.org/pdf/2311.08105 / | |
| Streaming DiLoCo paper: https://arxiv.org/pdf/2501.18512 """` | |
| --- | |
| ## Deep Dive: torchft (the chosen one) | |
| ### (1) Repo metadata | |
| | Field | Value | | |
| |---|---| | |
| | URL | https://github.com/meta-pytorch/torchft | | |
| | License | BSD 3-Clause | | |
| | HEAD commit | `7eb7087` (2026-04-03) | | |
| | Total commits on main | 312 | | |
| | Activity level | **Active** — commits in 2025 + 2026, Meta-maintained, PyPI nightly builds | | |
| | Distribution | `pip install torchft-nightly` (prebuilt wheels) **OR** install from source (requires Rust + protobuf-compiler + maturin — only because of the Lighthouse/process-group Rust ext, not the algorithm code) | | |
| | Python | `requires-python = ">=3.8"`; `torch>=2.7` per `pyproject.toml` | | |
| ### (2) Exact API / extension point | |
| The integration target is `torchft/local_sgd.py`. Two relevant classes: | |
| ```python | |
| # Public class — drop-in context manager | |
| class DiLoCo: | |
| def __init__( | |
| self, | |
| manager: Manager, # we mock this | |
| model_fragments: List[nn.Module], # [model] for vanilla DiLoCo | |
| inner_optimizer: optim.Optimizer, | |
| outer_optimizer: optim.Optimizer | list[optim.Optimizer], | |
| sync_every: int, # N inner steps | |
| backup_device: Optional[torch.device] = None, | |
| pin_memory: bool = True, | |
| use_bucketization: bool = False, | |
| bucket_cap_mb: Optional[int] = None, | |
| should_quantize: bool = False, | |
| fragment_sync_delay: int = 0, # τ in Streaming DiLoCo paper | |
| fragment_update_alpha: float = 0.0, | |
| ) -> None: ... | |
| ``` | |
| The **pseudo-gradient** is computed in `_StreamingDiLoCoFragment._save_grads()` | |
| (`torchft/local_sgd.py` line 324): | |
| ```python | |
| def _save_grads(self) -> None: | |
| """Saves pseudo-gradients of the parameters""" | |
| with torch.no_grad(): | |
| for name, p in self._model_fragment.named_parameters(): | |
| local_param = p.to_local() if isinstance(p, DTensor) else p | |
| pseudogradient = self.original_parameters[name].to(p.device) - local_param | |
| self._grads[name] = pseudogradient | |
| ``` | |
| Note the **sign**: `original − local` (i.e. `θ_initial − θ_local`). When this is later | |
| copied into `p.grad` via `_set_grads`, an SGD step `p ← p − lr · grad` becomes | |
| `p ← θ_initial − lr · (θ_initial − θ_local)` = a step *toward* `θ_local`. Our spec | |
| says δ = θ_local − θ_initial; torchft uses the negation. Either convention works as | |
| long as the outer optimizer's lr sign is consistent — torchft uses positive `outer_lr` | |
| (e.g. 0.7) and SGD which subtracts the grad, so the math nets out. **Be careful when | |
| unit-testing the sign in Spike 008.** | |
| The **outer Nesterov step** is in `_StreamingDiLoCoFragment.perform_sync()` (line 423): | |
| ```python | |
| if should_commit: | |
| self._set_grads() # write pseudogradient into p.grad | |
| self._outer_optimizer.step() # Nesterov SGD step (user-provided) | |
| self.save_parameters() | |
| self._merge_parameters() | |
| self._outer_optimizer.zero_grad() | |
| ``` | |
| The Nesterov-ness lives in the user-provided outer optimizer, e.g.: | |
| ```python | |
| outer_optimizer = torch.optim.SGD(model.parameters(), lr=0.7, momentum=0.9, nesterov=True) | |
| ``` | |
| This matches the DiLoCo paper exactly (Douillard §3 specifies Nesterov momentum outer). | |
| The cross-replica all-reduce happens in `_average_grads()` (called from `prepare_sync`) | |
| via `self._manager.allreduce(...)` — which is the seam we mock for single-process tests. | |
| ### (3) torch.distributed dependency for testing? | |
| **No, not for unit tests.** The `Manager` is mockable. From `torchft/local_sgd_test.py`: | |
| ```python | |
| from unittest.mock import create_autospec, MagicMock | |
| from torchft.manager import Manager | |
| from torchft.work import _DummyWork | |
| def create_manager() -> MagicMock: | |
| manager = create_autospec(Manager) | |
| manager.errored.return_value = None | |
| def mock_allreduce(tensor: torch.Tensor, should_quantize: bool = False): | |
| return _DummyWork(tensor) # returns the same tensor unchanged | |
| manager.allreduce.side_effect = mock_allreduce | |
| return manager | |
| ``` | |
| This bypasses NCCL/Gloo entirely. `_DummyWork` just wraps the tensor and returns it as | |
| the "all-reduced" result, so a single-process test with `world_size=1` works directly, | |
| and a 2-replica test is achieved by running two `DiLoCo` instances with two model | |
| copies in the same process and a `mock_allreduce` that *averages* the two tensors | |
| manually before returning. (Their `test_bucketization_correctness` does exactly this.) | |
| For real distributed runs torchft uses Gloo or NCCL via `torchft.process_group` | |
| (reconfigurable PGs that wrap `torch.distributed`). We do not need this for Spike 008. | |
| ### (4) Library, research codebase, or paper-companion? | |
| **Library.** Strong evidence: | |
| - Proper Python package layout (`torchft/__init__.py`, modules per concern). | |
| - Real unit tests (`*_test.py` per module) — not "run this script" demos. | |
| - BSD-3-Clause LICENSE (vs. diloco_simple having none, signaling "personal demo"). | |
| - Nightly PyPI distribution (`torchft-nightly`) with prebuilt wheels. | |
| - Documentation site at https://pytorch.org/torchft. | |
| - `meta-pytorch` org — Meta-internally maintained; lives next to `torchtitan`. | |
| - README explicitly: *"torchft is designed to provide the primitives required to | |
| implement fault tolerance in any application/train script"* — i.e. a building block. | |
| Only friction: installing **from source** needs Rust (pyo3 + maturin) and | |
| protobuf-compiler. This is for the Rust Lighthouse/process-group extension which we | |
| **do not need** for Spike 008's mock-based tests. Two clean options: | |
| - (a) `pip install torchft-nightly` — uses prebuilt wheel, no Rust toolchain needed. | |
| - (b) Vendor `torchft/local_sgd.py` + the few helpers (`work.py::_DummyWork`, | |
| type stubs for `Manager`) into our repo under BSD-3 attribution. ~700 LOC total. | |
| ### (5) Minimum viable test pattern for Spike 008 | |
| Goal: **2 replicas × 4 inner steps × 2 outer rounds on a tiny model**, single-process, no NCCL. | |
| ```python | |
| # spikes/008-diloco-outer-loop/tests/test_diloco_two_replicas.py | |
| """ | |
| Spike 008: prove the DiLoCo outer-loop math is correct under our framework. | |
| Runs entirely in a single process, no torch.distributed required. | |
| """ | |
| import copy | |
| import torch | |
| import torch.nn as nn | |
| import torch.optim as optim | |
| from unittest.mock import create_autospec, MagicMock | |
| from torchft.local_sgd import DiLoCo | |
| from torchft.manager import Manager | |
| from torchft.work import _DummyWork | |
| class TinyMLP(nn.Module): | |
| def __init__(self): | |
| super().__init__() | |
| self.net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 2)) | |
| def forward(self, x): return self.net(x) | |
| def _make_avg_manager(replica_buffer): | |
| """Manager whose allreduce averages tensors across replicas via shared buffer.""" | |
| mgr = create_autospec(Manager) | |
| mgr._use_async_quorum = False | |
| mgr.errored.return_value = None | |
| mgr.should_commit.return_value = True | |
| mgr.current_step.return_value = 0 | |
| def avg_allreduce(tensor, should_quantize=False): | |
| # Cross-replica average: stash and average against the other replica's tensor | |
| replica_buffer.append(tensor.clone()) | |
| if len(replica_buffer) == 2: | |
| mean = (replica_buffer[0] + replica_buffer[1]) / 2.0 | |
| tensor.copy_(mean) | |
| replica_buffer.clear() | |
| return _DummyWork(tensor) | |
| mgr.allreduce.side_effect = avg_allreduce | |
| return mgr | |
| def test_diloco_two_replicas_four_inner_two_outer(): | |
| torch.manual_seed(0) | |
| model_a = TinyMLP() | |
| model_b = copy.deepcopy(model_a) # identical init = same θ_initial | |
| # Inner optimizers (one per replica) | |
| inner_a = optim.AdamW(model_a.parameters(), lr=1e-3) | |
| inner_b = optim.AdamW(model_b.parameters(), lr=1e-3) | |
| # Outer Nesterov (one per replica, same hyperparams) | |
| outer_a = optim.SGD(model_a.parameters(), lr=0.7, momentum=0.9, nesterov=True) | |
| outer_b = optim.SGD(model_b.parameters(), lr=0.7, momentum=0.9, nesterov=True) | |
| # Shared buffer — both DiLoCo wrappers funnel through one "process group" of size 2 | |
| buf = [] | |
| mgr_a = _make_avg_manager(buf) | |
| mgr_b = _make_avg_manager(buf) | |
| SYNC_EVERY = 4 # 4 inner steps per outer round | |
| OUTER_ROUNDS = 2 | |
| with DiLoCo(mgr_a, [model_a], inner_a, outer_a, sync_every=SYNC_EVERY) as dla, \ | |
| DiLoCo(mgr_b, [model_b], inner_b, outer_b, sync_every=SYNC_EVERY) as dlb: | |
| # Snapshot θ_initial | |
| theta_initial_a = {n: p.detach().clone() for n, p in model_a.named_parameters()} | |
| for outer_round in range(OUTER_ROUNDS): | |
| for inner_step in range(SYNC_EVERY): | |
| # Replicas see DIFFERENT data — that is the whole point of DiLoCo | |
| x_a = torch.randn(8, 4) + 0.1 * outer_round | |
| x_b = torch.randn(8, 4) - 0.1 * outer_round | |
| y_a, y_b = torch.randn(8, 2), torch.randn(8, 2) | |
| inner_a.zero_grad(); inner_b.zero_grad() | |
| ((model_a(x_a) - y_a) ** 2).mean().backward() | |
| ((model_b(x_b) - y_b) ** 2).mean().backward() | |
| inner_a.step() # Inner step. Sync fires automatically inside post-hook | |
| inner_b.step() # at step %% SYNC_EVERY == 0. | |
| # Assertions: | |
| # 1. Both replicas now hold IDENTICAL parameters (they were averaged via mock allreduce). | |
| for (na, pa), (nb, pb) in zip(model_a.named_parameters(), model_b.named_parameters()): | |
| torch.testing.assert_close(pa, pb, msg=f"Replicas diverged at {na}") | |
| # 2. Parameters changed from θ_initial (outer optimizer actually stepped). | |
| any_change = any( | |
| not torch.equal(p, theta_initial_a[n]) for n, p in model_a.named_parameters() | |
| ) | |
| assert any_change, "outer optimizer did not move the parameters" | |
| # 3. The outer optimizer holds Nesterov momentum state for every parameter | |
| # (proves the SGD(nesterov=True) actually ran). | |
| n_params = len(list(model_a.parameters())) | |
| assert len(outer_a.state_dict()["state"]) == n_params | |
| # 4. Sync fired once per outer round per replica. | |
| assert mgr_a.start_quorum.call_count == OUTER_ROUNDS | |
| assert mgr_b.start_quorum.call_count == OUTER_ROUNDS | |
| ``` | |
| **Why this works:** | |
| - `DiLoCo` registers a post-step hook on `inner_optimizer` (see `__enter__`). The | |
| hook increments `_local_step` and triggers `prepare_sync` / `perform_sync` on every | |
| `sync_every` boundary — fully automatic, our test only calls `inner.step()`. | |
| - `_DummyWork.wait()` is a no-op. `_average_grads` calls `manager.allreduce(...)` | |
| which our `avg_allreduce` mocks to do real cross-replica averaging through `buf`. | |
| - `manager.should_commit.return_value = True` lets the outer optimizer fire on each | |
| outer round; setting it to `False` lets us also test rollback semantics. | |
| - All single-process — pytest plays nicely. Add to | |
| `spikes/005-integrated-trainer-skeleton/tests/` style or new `spikes/008/tests/`. | |
| **Install for this spike:** `pip install torchft-nightly` in the eidolon venv. If the | |
| nightly wheel proves brittle, fallback: vendor `local_sgd.py` + `work.py` + a | |
| minimal `manager.py` stub (≈800 LOC) into `framework/diloco/_vendored/` with BSD-3 | |
| attribution. | |
| --- | |
| ## Risks & Mitigations | |
| | Risk | Likelihood | Mitigation | | |
| |---|---|---| | |
| | `torchft-nightly` wheel breaks against torch 2.x | Med | Pin to a specific nightly hash; or vendor `local_sgd.py` directly under BSD-3. | | |
| | `torchft.manager.Manager` import pulls in Rust ext at import time | Low | The class is importable as a type; `MagicMock` replaces it. If import touches Rust, we vendor. Verified: the import in `local_sgd.py` is `from torchft.manager import Manager` — only used as a type annotation in our test path. | | |
| | Sign convention of pseudogradient causes our outer optimizer to move the wrong way | Med | Test 2 in the test pattern above explicitly checks "params moved from initial". A second test should compare the direction against a hand-computed expected. | | |
| | `fragment_sync_delay > 0` (true Streaming) requires CUDA streams | Med | Spike 008 starts with `fragment_sync_delay=0` (= vanilla DiLoCo). Streaming variant deferred to Spike 009 once basic loop works. | | |
| | Requires `torch>=2.7` per pyproject | Low | Framework already on torch 2.x; check exact pin. If <2.7, we vendor. | | |
| --- | |
| ## Decision (for ADR-003) | |
| Adopt **`torchft.local_sgd.DiLoCo`** as the reference DiLoCo / Streaming DiLoCo | |
| implementation. Integrate via `pip install torchft-nightly` for Spike 008. If | |
| brittleness emerges, vendor `local_sgd.py` (BSD-3) into `framework/diloco/_vendored/`. | |
| For the framework's outer-loop optimizer abstraction (the actual ADR-003 question): | |
| mirror torchft's `DiLoCo(manager, [model_fragments], inner_opt, outer_opt, sync_every)` | |
| constructor shape so that swapping our wrapper for the upstream class is a one-line | |
| change. Compute pseudogradient as `θ_local − θ_initial` (our convention) and negate | |
| when handing to the outer optimizer, OR follow torchft's `θ_initial − θ_local` | |
| convention end-to-end. **Pick one and document it loudly.** | |