Document full/delta backup + restore
Browse files- distributed/BACKUPS.md +16 -0
distributed/BACKUPS.md
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# AGILLM-3.5 distributed master — backup scheme
|
| 2 |
+
|
| 3 |
+
`distributed/history/` holds **history-kept** backups (never overwritten/squashed,
|
| 4 |
+
so a corrupted upload can't destroy older ones):
|
| 5 |
+
|
| 6 |
+
- `master_r<round>_<ts>.full.pt` — complete master (incl optimizer). Self-contained.
|
| 7 |
+
- `master_r<round>_<ts>.delta.pt` — ONLY `core.blocks.*` (the trained layers). Tiny
|
| 8 |
+
(~1.2 GB vs 6.5 GB). Everything else (emb/ln/ar/sat) is frozen across rounds, so a
|
| 9 |
+
delta + its base full == the exact model. Each delta records its `base_full`.
|
| 10 |
+
|
| 11 |
+
Cadence: one backup every ~4 days; every 4th is a full, the rest deltas (no size cap).
|
| 12 |
+
|
| 13 |
+
## Restore
|
| 14 |
+
Full only: `restore_from_backup.py <full.pt> master.pt`
|
| 15 |
+
Full+delta: `restore_from_backup.py <base_full.pt> master.pt <delta.pt>`
|
| 16 |
+
Then resume the disagg loop from the restored `master.pt`.
|