# Spike 008 — Streaming DiLoCo outer-loop smoke

**Closes**: V2 (DiLoCo "deferred to v0.2") in `docs/VISION_VALIDATION.md`.

## Goal

Bolt the DiLoCo outer-loop pseudo-gradient sync onto the framework using
`torchft.local_sgd.DiLoCo` (see `docs/adrs/ADR-003-diloco-impl.md`).

Verify:
1. Two in-process replicas converge to identical parameters after outer sync.
2. Outer Nesterov momentum is actually populated (i.e. the outer optimizer
   ran).
3. The pseudo-gradient sign convention is what we expect (sign flip detected
   by an explicit unit test).
4. Importing torchft does not regress Spike 005's existing 38 tests.

Single-process, no NCCL. Mock `Manager.allreduce` does real cross-replica
averaging through a shared buffer.

## Files

- `composer_diloco.py` — `make_diloco_outer_loop(...)` wrapper around
  `torchft.local_sgd.DiLoCo`. Documents the sign convention.
- `tests/test_diloco_smoke.py` — 3 acceptance tests.

## Acceptance

| Criterion | Status |
|---|---|
| 2 replicas converge after 2 outer rounds | ✓ test 1 |
| Nesterov momentum state populated | ✓ test 1 |
| Sync fires once per outer round per replica | ✓ test 1 |
| Pseudo-gradient sign convention verified | ✓ test 2 |
| No regression in Spike 005 imports | ✓ test 3 |
| Spike 005's 38 tests still pass after this wave | (verified separately) |

## Future work (v0.2 Streaming DiLoCo)

- `fragment_sync_delay > 0` requires CUDA streams. Spike 008 uses
  `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke.
- Multiple fragments via `model_fragments=[frag_0, frag_1, ...]` configured
  by `make_diloco_outer_loop()` but not exercised in the smoke.
- Real torch.distributed backend (NCCL) for multi-node training is
  one config switch away (replace mock `Manager` with real `torchft.Manager`).

## Cost / time

- Pure CPU, single process, no GPU.
- Tests run in <2 seconds total.

## Dependencies added

- `torchft-nightly` (BSD-3, Meta-maintained, `pip install torchft-nightly`)