noanya commited on
Commit
246cc9a
·
1 Parent(s): 9d91248

feat(scripts): check_hub_checkpoints.py + recovery docs

Browse files

After a Colab/Kaggle session dies mid-training, the user needs a fast way
to confirm what made it to the HF Hub before the crash and to pull the
latest checkpoint locally.

scripts/check_hub_checkpoints.py:
--hub-model-id noanya/zombiee # default action: list
--hub-model-id noanya/zombiee --info # show step / loss / lr
# from trainer_state.json
--hub-model-id noanya/zombiee --download DIR # snapshot the latest
--checkpoint N # operate on a specific step

Honors HUGGINGFACE_TOKEN / HF_TOKEN for private repos. Prints a ready-to-run
`python -m training.train --resume-from-checkpoint <path>` command after a
successful download so resume on the DGX is one paste away.

notebooks/README.md:
- "My Colab/Kaggle session died — did I lose anything?" section pointing
at the recovery script.
- Document scripts/dgx_autorun.sh's tunables (MIN_FREE_GB, MAX_JOBS,
POLL_INTERVAL, etc.) so the auto-launch flow is discoverable.

notebooks/README.md CHANGED
@@ -95,3 +95,73 @@ In the *Configuration* cell of the notebook, lower:
95
  The DGX (V100 32 GB) can run the full Qwen2.5-3B / `NUM_GENERATIONS=8` /
96
  `MAX_SEQ_LENGTH=4096` config from `Dockerfile.dgx`'s default `CMD`; Kaggle's
97
  16 GB T4 needs the trimmed defaults shown in the notebook.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  The DGX (V100 32 GB) can run the full Qwen2.5-3B / `NUM_GENERATIONS=8` /
96
  `MAX_SEQ_LENGTH=4096` config from `Dockerfile.dgx`'s default `CMD`; Kaggle's
97
  16 GB T4 needs the trimmed defaults shown in the notebook.
98
+
99
+ ## "My Colab/Kaggle session died — did I lose anything?"
100
+
101
+ **No** — as long as `--push-to-hub` was set (it is, in both notebooks), every
102
+ checkpoint up to the last successful save lives on the Hub at
103
+ `huggingface.co/<HUB_MODEL_ID>`. The `hub_strategy="every_save"` setting in
104
+ `training/train.py` uploads each `checkpoint-N/` immediately after it's
105
+ written to disk, before the next training step begins.
106
+
107
+ Inspect what survived:
108
+
109
+ ```bash
110
+ # Just list:
111
+ python scripts/check_hub_checkpoints.py --hub-model-id noanya/zombiee
112
+
113
+ # Show training progress (step, loss, lr) from the latest checkpoint:
114
+ python scripts/check_hub_checkpoints.py --hub-model-id noanya/zombiee --info
115
+
116
+ # Pull the latest checkpoint locally:
117
+ python scripts/check_hub_checkpoints.py --hub-model-id noanya/zombiee \
118
+ --download ./recovered
119
+ ```
120
+
121
+ Then resume from anywhere:
122
+
123
+ | Where | How |
124
+ |---|---|
125
+ | Same Kaggle/Colab notebook | Just re-run it. Cell 6 auto-detects the Hub checkpoint and resumes. |
126
+ | DGX (single GPU) | `python -m training.train --resume-from-checkpoint noanya/zombiee --push-to-hub --hub-model-id noanya/zombiee --output-dir ./lora_v1` |
127
+ | DGX (auto-pick GPU) | `HUGGINGFACE_TOKEN=hf_xxx ./scripts/dgx_autorun.sh` (see below) |
128
+
129
+ ## DGX autorun script
130
+
131
+ `scripts/dgx_autorun.sh` watches `nvidia-smi` and launches a training
132
+ container as soon as a GPU has enough free memory. It survives container
133
+ crashes (each launch resumes from the same Hub checkpoint), and will spin up
134
+ **additional** containers on other GPUs as they free up — up to `MAX_JOBS`.
135
+
136
+ Prereqs:
137
+ 1. `docker build -f Dockerfile.dgx -t survivecity-train .` (do this once).
138
+ 2. Export your HF token: `export HUGGINGFACE_TOKEN=hf_xxx`.
139
+
140
+ Run:
141
+
142
+ ```bash
143
+ # 1 job, requires 10 GB free on a GPU before launching
144
+ ./scripts/dgx_autorun.sh
145
+
146
+ # Tighter memory budget, allow up to 2 parallel jobs
147
+ MIN_FREE_GB=8 MAX_JOBS=2 ./scripts/dgx_autorun.sh
148
+
149
+ # See what it would do without actually launching
150
+ DRY_RUN=1 ./scripts/dgx_autorun.sh
151
+ ```
152
+
153
+ Tunables (env vars):
154
+
155
+ | Var | Default | Meaning |
156
+ |---|---|---|
157
+ | `MIN_FREE_GB` | `10` | Minimum free GPU memory before considering a GPU |
158
+ | `MAX_JOBS` | `1` | Cap on parallel training containers |
159
+ | `POLL_INTERVAL` | `60` | Seconds between `nvidia-smi` scans |
160
+ | `HUB_MODEL_ID` | `noanya/zombiee` | HF Hub repo id |
161
+ | `MAX_STEPS` | `4000` | Passed through to `training/train.py` |
162
+ | `SAVE_STEPS` | `100` | Passed through to `training/train.py` |
163
+ | `OUTPUT_ROOT` | `./lora_v1` | Host dir; per-GPU subdirs are mounted into containers |
164
+ | `DRY_RUN` | `0` | If `1`, prints the launch command without running it |
165
+
166
+ Containers are named `survivecity-train-gpuN`. Stop everything with
167
+ `Ctrl-C` — the script's `trap` cleans up all launched containers.
scripts/check_hub_checkpoints.py ADDED
@@ -0,0 +1,242 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """Inspect / recover GRPO training checkpoints from the Hugging Face Hub.
3
+
4
+ Use this to answer "did Colab/Kaggle save anything before it died?" and to
5
+ pull the latest checkpoint locally for manual inspection or DGX resume.
6
+
7
+ Examples
8
+ --------
9
+ List what's on the Hub:
10
+ python scripts/check_hub_checkpoints.py --hub-model-id noanya/zombiee
11
+
12
+ Show training progress (step, loss, learning rate from trainer_state.json):
13
+ python scripts/check_hub_checkpoints.py --hub-model-id noanya/zombiee --info
14
+
15
+ Download the latest checkpoint to ./recovered/:
16
+ python scripts/check_hub_checkpoints.py --hub-model-id noanya/zombiee \\
17
+ --download ./recovered
18
+
19
+ Then resume training from it:
20
+ python -m training.train \\
21
+ --resume-from-checkpoint ./recovered \\
22
+ --push-to-hub --hub-model-id noanya/zombiee \\
23
+ --max-steps 4000 --output-dir ./lora_v1
24
+ """
25
+
26
+ from __future__ import annotations
27
+
28
+ import argparse
29
+ import json
30
+ import os
31
+ import sys
32
+ from datetime import datetime, timezone
33
+
34
+
35
+ def parse_args():
36
+ p = argparse.ArgumentParser(
37
+ description="List / download GRPO training checkpoints from HF Hub.",
38
+ formatter_class=argparse.RawDescriptionHelpFormatter,
39
+ epilog=__doc__,
40
+ )
41
+ p.add_argument(
42
+ "--hub-model-id", default=os.environ.get("HUB_MODEL_ID", "noanya/zombiee"),
43
+ help="HF Hub repo id, e.g. 'noanya/zombiee' (default: $HUB_MODEL_ID or noanya/zombiee).",
44
+ )
45
+ p.add_argument(
46
+ "--info", action="store_true",
47
+ help="Read trainer_state.json from the latest checkpoint and print training progress.",
48
+ )
49
+ p.add_argument(
50
+ "--download", metavar="DIR", default=None,
51
+ help="Download the latest checkpoint to this directory.",
52
+ )
53
+ p.add_argument(
54
+ "--checkpoint", metavar="N", type=int, default=None,
55
+ help="Operate on checkpoint-N specifically instead of the latest.",
56
+ )
57
+ p.add_argument(
58
+ "--token", default=os.environ.get("HUGGINGFACE_TOKEN") or os.environ.get("HF_TOKEN"),
59
+ help="HF token (default: $HUGGINGFACE_TOKEN / $HF_TOKEN). Required for private repos.",
60
+ )
61
+ return p.parse_args()
62
+
63
+
64
+ def list_checkpoints(api, repo_id, token):
65
+ """Return (sorted list of checkpoint step numbers, list of root files)."""
66
+ try:
67
+ files = api.list_repo_files(repo_id, token=token)
68
+ except Exception as e:
69
+ print(f"ERROR: could not list {repo_id}: {e}", file=sys.stderr)
70
+ print(
71
+ " If the repo is private, set HUGGINGFACE_TOKEN. If it doesn't exist yet,\n"
72
+ " no training run has pushed to it.",
73
+ file=sys.stderr,
74
+ )
75
+ sys.exit(1)
76
+
77
+ steps = set()
78
+ root_files = []
79
+ for f in files:
80
+ if f.startswith("checkpoint-"):
81
+ try:
82
+ steps.add(int(f.split("/", 1)[0].split("-", 1)[1]))
83
+ except ValueError:
84
+ pass
85
+ elif "/" not in f:
86
+ root_files.append(f)
87
+ return sorted(steps), sorted(root_files)
88
+
89
+
90
+ def fetch_trainer_state(api, repo_id, step, token, work_dir):
91
+ """Download trainer_state.json from checkpoint-step and return parsed dict."""
92
+ from huggingface_hub import hf_hub_download
93
+
94
+ path = hf_hub_download(
95
+ repo_id=repo_id,
96
+ filename=f"checkpoint-{step}/trainer_state.json",
97
+ local_dir=work_dir,
98
+ token=token,
99
+ )
100
+ with open(path) as f:
101
+ return json.load(f)
102
+
103
+
104
+ def fmt_age(iso_or_dt):
105
+ """Render 'X minutes/hours/days ago' from an HF datetime."""
106
+ if isinstance(iso_or_dt, str):
107
+ try:
108
+ dt = datetime.fromisoformat(iso_or_dt.replace("Z", "+00:00"))
109
+ except ValueError:
110
+ return iso_or_dt
111
+ else:
112
+ dt = iso_or_dt
113
+ if dt.tzinfo is None:
114
+ dt = dt.replace(tzinfo=timezone.utc)
115
+ delta = datetime.now(timezone.utc) - dt
116
+ s = int(delta.total_seconds())
117
+ if s < 60:
118
+ return f"{s}s ago"
119
+ if s < 3600:
120
+ return f"{s // 60}m ago"
121
+ if s < 86400:
122
+ return f"{s // 3600}h {(s % 3600) // 60}m ago"
123
+ return f"{s // 86400}d {(s % 86400) // 3600}h ago"
124
+
125
+
126
+ def cmd_list(api, repo_id, token):
127
+ steps, root_files = list_checkpoints(api, repo_id, token)
128
+
129
+ print(f"Repo: https://huggingface.co/{repo_id}")
130
+ try:
131
+ info = api.repo_info(repo_id, token=token)
132
+ print(f"Last commit: {info.sha[:8]} ({fmt_age(info.lastModified)})")
133
+ except Exception:
134
+ pass
135
+
136
+ print()
137
+ if not steps:
138
+ print("No checkpoint-* directories found.")
139
+ if root_files:
140
+ print(f"Root files present: {', '.join(root_files)}")
141
+ print("(Looks like only a final-model push, no intermediate checkpoints.)")
142
+ else:
143
+ print("Repo is empty — training never reached the first save.")
144
+ return
145
+
146
+ print(f"Found {len(steps)} checkpoint(s): {', '.join(f'checkpoint-{s}' for s in steps)}")
147
+ print(f"Latest: checkpoint-{steps[-1]}")
148
+ if root_files:
149
+ print(f"Root files: {', '.join(root_files)}")
150
+
151
+
152
+ def cmd_info(api, repo_id, token, step):
153
+ steps, _ = list_checkpoints(api, repo_id, token)
154
+ if not steps:
155
+ print("No checkpoints to inspect.", file=sys.stderr)
156
+ sys.exit(1)
157
+ target = step if step is not None else steps[-1]
158
+ if target not in steps:
159
+ print(f"checkpoint-{target} not on hub. Available: {steps}", file=sys.stderr)
160
+ sys.exit(1)
161
+
162
+ print(f"Inspecting checkpoint-{target}...")
163
+ state = fetch_trainer_state(api, repo_id, target, token, "/tmp/_hub_inspect")
164
+
165
+ print()
166
+ print(f" global_step : {state.get('global_step')}")
167
+ print(f" epoch : {state.get('epoch'):.4f}" if state.get("epoch") is not None else " epoch : ?")
168
+ print(f" max_steps : {state.get('max_steps')}")
169
+ print(f" best_metric : {state.get('best_metric')}")
170
+ print(f" total_flos : {state.get('total_flos')}")
171
+
172
+ log_history = state.get("log_history", [])
173
+ if log_history:
174
+ print(f" log entries : {len(log_history)}")
175
+ last = log_history[-1]
176
+ print()
177
+ print(" Most recent log entry:")
178
+ for k in ("loss", "learning_rate", "grad_norm", "reward", "kl", "step"):
179
+ if k in last:
180
+ v = last[k]
181
+ if isinstance(v, float):
182
+ print(f" {k:18}: {v:.6f}")
183
+ else:
184
+ print(f" {k:18}: {v}")
185
+
186
+ pct = (target / state["max_steps"] * 100) if state.get("max_steps") else None
187
+ if pct is not None:
188
+ print()
189
+ print(f"Progress: {target} / {state['max_steps']} steps ({pct:.1f}% done)")
190
+
191
+
192
+ def cmd_download(api, repo_id, token, target_dir, step):
193
+ from huggingface_hub import snapshot_download
194
+
195
+ steps, _ = list_checkpoints(api, repo_id, token)
196
+ if not steps:
197
+ print("Nothing to download — no checkpoints on hub.", file=sys.stderr)
198
+ sys.exit(1)
199
+ chosen = step if step is not None else steps[-1]
200
+ if chosen not in steps:
201
+ print(f"checkpoint-{chosen} not on hub. Available: {steps}", file=sys.stderr)
202
+ sys.exit(1)
203
+
204
+ os.makedirs(target_dir, exist_ok=True)
205
+ print(f"Downloading checkpoint-{chosen} from {repo_id} -> {target_dir}/")
206
+ local = snapshot_download(
207
+ repo_id=repo_id,
208
+ allow_patterns=[f"checkpoint-{chosen}/*"],
209
+ local_dir=target_dir,
210
+ token=token,
211
+ )
212
+ final = os.path.join(local, f"checkpoint-{chosen}")
213
+ print()
214
+ print(f"Done. Local path: {final}")
215
+ print()
216
+ print("To resume training from this checkpoint:")
217
+ print(f" python -m training.train \\")
218
+ print(f" --resume-from-checkpoint {final} \\")
219
+ print(f" --push-to-hub --hub-model-id {repo_id} \\")
220
+ print(f" --output-dir ./lora_v1")
221
+
222
+
223
+ def main():
224
+ args = parse_args()
225
+ try:
226
+ from huggingface_hub import HfApi
227
+ except ImportError:
228
+ print("pip install huggingface_hub", file=sys.stderr)
229
+ sys.exit(1)
230
+
231
+ api = HfApi()
232
+
233
+ if args.download:
234
+ cmd_download(api, args.hub_model_id, args.token, args.download, args.checkpoint)
235
+ elif args.info:
236
+ cmd_info(api, args.hub_model_id, args.token, args.checkpoint)
237
+ else:
238
+ cmd_list(api, args.hub_model_id, args.token)
239
+
240
+
241
+ if __name__ == "__main__":
242
+ main()