OpenTransformer commited on
Commit
3cb7341
·
verified ·
1 Parent(s): 225450f

Document full/delta backup + restore

Browse files
Files changed (1) hide show
  1. distributed/BACKUPS.md +16 -0
distributed/BACKUPS.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AGILLM-3.5 distributed master — backup scheme
2
+
3
+ `distributed/history/` holds **history-kept** backups (never overwritten/squashed,
4
+ so a corrupted upload can't destroy older ones):
5
+
6
+ - `master_r<round>_<ts>.full.pt` — complete master (incl optimizer). Self-contained.
7
+ - `master_r<round>_<ts>.delta.pt` — ONLY `core.blocks.*` (the trained layers). Tiny
8
+ (~1.2 GB vs 6.5 GB). Everything else (emb/ln/ar/sat) is frozen across rounds, so a
9
+ delta + its base full == the exact model. Each delta records its `base_full`.
10
+
11
+ Cadence: one backup every ~4 days; every 4th is a full, the rest deltas (no size cap).
12
+
13
+ ## Restore
14
+ Full only: `restore_from_backup.py <full.pt> master.pt`
15
+ Full+delta: `restore_from_backup.py <base_full.pt> master.pt <delta.pt>`
16
+ Then resume the disagg loop from the restored `master.pt`.