AGILLM-3.5 distributed master — backup scheme
distributed/history/ holds history-kept backups (never overwritten/squashed,
so a corrupted upload can't destroy older ones):
master_r<round>_<ts>.full.pt— complete master (incl optimizer). Self-contained.master_r<round>_<ts>.delta.pt— ONLYcore.blocks.*(the trained layers). Tiny (~1.2 GB vs 6.5 GB). Everything else (emb/ln/ar/sat) is frozen across rounds, so a delta + its base full == the exact model. Each delta records itsbase_full.
Cadence: one backup every ~4 days; every 4th is a full, the rest deltas (no size cap).
Restore
Full only: restore_from_backup.py <full.pt> master.pt
Full+delta: restore_from_backup.py <base_full.pt> master.pt <delta.pt>
Then resume the disagg loop from the restored master.pt.