| # AGILLM-3.5 distributed master — backup scheme | |
| `distributed/history/` holds **history-kept** backups (never overwritten/squashed, | |
| so a corrupted upload can't destroy older ones): | |
| - `master_r<round>_<ts>.full.pt` — complete master (incl optimizer). Self-contained. | |
| - `master_r<round>_<ts>.delta.pt` — ONLY `core.blocks.*` (the trained layers). Tiny | |
| (~1.2 GB vs 6.5 GB). Everything else (emb/ln/ar/sat) is frozen across rounds, so a | |
| delta + its base full == the exact model. Each delta records its `base_full`. | |
| Cadence: one backup every ~4 days; every 4th is a full, the rest deltas (no size cap). | |
| ## Restore | |
| Full only: `restore_from_backup.py <full.pt> master.pt` | |
| Full+delta: `restore_from_backup.py <base_full.pt> master.pt <delta.pt>` | |
| Then resume the disagg loop from the restored `master.pt`. | |