Spike 008 — Streaming DiLoCo outer-loop smoke

Closes: V2 (DiLoCo "deferred to v0.2") in docs/VISION_VALIDATION.md.

Goal

Bolt the DiLoCo outer-loop pseudo-gradient sync onto the framework using torchft.local_sgd.DiLoCo (see docs/adrs/ADR-003-diloco-impl.md).

Verify:

Two in-process replicas converge to identical parameters after outer sync.
Outer Nesterov momentum is actually populated (i.e. the outer optimizer ran).
The pseudo-gradient sign convention is what we expect (sign flip detected by an explicit unit test).
Importing torchft does not regress Spike 005's existing 38 tests.

Single-process, no NCCL. Mock Manager.allreduce does real cross-replica averaging through a shared buffer.

composer_diloco.py — make_diloco_outer_loop(...) wrapper around torchft.local_sgd.DiLoCo. Documents the sign convention.
tests/test_diloco_smoke.py — 3 acceptance tests.

Criterion	Status
2 replicas converge after 2 outer rounds	✓ test 1
Nesterov momentum state populated	✓ test 1
Sync fires once per outer round per replica	✓ test 1
Pseudo-gradient sign convention verified	✓ test 2
No regression in Spike 005 imports	✓ test 3
Spike 005's 38 tests still pass after this wave	(verified separately)

fragment_sync_delay > 0 requires CUDA streams. Spike 008 uses fragment_sync_delay=0 (vanilla DiLoCo) for the smoke.
Multiple fragments via model_fragments=[frag_0, frag_1, ...] configured by make_diloco_outer_loop() but not exercised in the smoke.
Real torch.distributed backend (NCCL) for multi-node training is one config switch away (replace mock Manager with real torchft.Manager).