| --- |
| name: Manage disk proactively β prefer move-to-/shm over delete; record manifest |
| description: Disk-fill-up incidents are costly. Mitigate by uploading important to HF + MOVING (not deleting) intermediate files to /shm with a manifest of what went where. |
| type: feedback |
| originSessionId: 4037f43b-2133-46c6-84bd-02f7d454ec8b |
| --- |
| When a long-running session generates many GB of artifacts: |
|
|
| 1. **Upload important checkpoints to HF immediately** when achieved, don't accumulate on local disk: |
| - Final-quality oracles β `explcre/dnathinker-checkpoints/runs/<run_dir>/` |
| - Major arch-ablation checkpoints (e.g., midway-decision-point ones) |
| - Ground-truth motif targets / extracted labels (if costly to regenerate) |
| - The bundle scripts + README so others can use them |
|
|
| 2. **MOVE intermediate files to `/shm` instead of deleting them.** `/shm` is a symlink to `/dev/shm` (157 GB tmpfs). Always log every move so the user can restore. |
| - Quarantine area: `/shm/dnathinker_quarantine/<original-relative-path>` |
| - Manifest log: `/shm/dnathinker_quarantine/MANIFEST.tsv` β append `<ISO-timestamp>\t<source-abs-path>\t<dest-abs-path>\t<size-bytes>\t<reason>` per move. |
| - Order priority for moving (most-unneeded first): corrupt/partial ckpts, then obsolete intermediate eval_loss > final, then HF Trainer's `checkpoint-<step>/` dirs that duplicate human-named `ckpt_step<step>.pt`, then eval-output caches (FIMO TSVs). |
| - Only resort to actual `rm` when the file is truly known-recoverable AND the user explicitly approves (e.g., regenerable in minutes from versioned source). |
|
|
| **Why:** previous H100 disk fill killed an in-flight ablation (loss 5.45β4.57 was achieved but ckpt-2000 corrupted) and broke the Bash runtime itself. The user does not want a repeat β and prefers reversibility ("prefer not deleting things, and record what's moved, to where"). |
|
|
| **Volatility caveat:** `/dev/shm` is RAM-backed tmpfs β files there DO NOT survive container/host restart. Mention this to the user when moving any file that is not regenerable from HF or source. If they want durable moves, ask first; possible alternates: another large block-mounted volume, or just push to HF and delete. |
|
|
| **How to apply:** at every meaningful milestone (oracle done, ablation step landed, comparison done): (a) push final artifacts to HF, (b) MOVE previous intermediate copies to `/shm/dnathinker_quarantine/...` and append to the manifest. Never silently `rm` to free space. |
|
|
| **Move helper boilerplate:** |
| ```bash |
| QUAR=/shm/dnathinker_quarantine |
| MANIFEST=$QUAR/MANIFEST.tsv |
| mkdir -p "$QUAR" |
| move_to_shm() { |
| local src="$1" reason="$2" |
| local rel="${src#/}"; local dst="$QUAR/$rel" |
| mkdir -p "$(dirname "$dst")" |
| local sz; sz=$(stat -c %s "$src" 2>/dev/null || du -sb "$src" | cut -f1) |
| mv "$src" "$dst" && \ |
| printf "%s\t%s\t%s\t%s\t%s\n" "$(date -Iseconds)" "$src" "$dst" "$sz" "$reason" >> "$MANIFEST" |
| } |
| ``` |
|
|
| **HF upload boilerplate:** |
| ```python |
| from huggingface_hub import HfApi |
| api = HfApi() |
| api.upload_file(path_or_fileobj=local, path_in_repo=repo_path, |
| repo_id="explcre/dnathinker-checkpoints", repo_type="model") |
| ``` |
|
|