File size: 812 Bytes
3cb7341
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# AGILLM-3.5 distributed master — backup scheme

`distributed/history/` holds **history-kept** backups (never overwritten/squashed,
so a corrupted upload can't destroy older ones):

- `master_r<round>_<ts>.full.pt`  — complete master (incl optimizer). Self-contained.
- `master_r<round>_<ts>.delta.pt` — ONLY `core.blocks.*` (the trained layers). Tiny
  (~1.2 GB vs 6.5 GB). Everything else (emb/ln/ar/sat) is frozen across rounds, so a
  delta + its base full == the exact model. Each delta records its `base_full`.

Cadence: one backup every ~4 days; every 4th is a full, the rest deltas (no size cap).

## Restore
Full only:  `restore_from_backup.py <full.pt> master.pt`
Full+delta: `restore_from_backup.py <base_full.pt> master.pt <delta.pt>`
Then resume the disagg loop from the restored `master.pt`.