--- name: Manage disk proactively — prefer move-to-/shm over delete; record manifest description: Disk-fill-up incidents are costly. Mitigate by uploading important to HF + MOVING (not deleting) intermediate files to /shm with a manifest of what went where. type: feedback originSessionId: 4037f43b-2133-46c6-84bd-02f7d454ec8b --- When a long-running session generates many GB of artifacts: 1. **Upload important checkpoints to HF immediately** when achieved, don't accumulate on local disk: - Final-quality oracles → `explcre/dnathinker-checkpoints/runs//` - Major arch-ablation checkpoints (e.g., midway-decision-point ones) - Ground-truth motif targets / extracted labels (if costly to regenerate) - The bundle scripts + README so others can use them 2. **MOVE intermediate files to `/shm` instead of deleting them.** `/shm` is a symlink to `/dev/shm` (157 GB tmpfs). Always log every move so the user can restore. - Quarantine area: `/shm/dnathinker_quarantine/` - Manifest log: `/shm/dnathinker_quarantine/MANIFEST.tsv` — append `\t\t\t\t` per move. - Order priority for moving (most-unneeded first): corrupt/partial ckpts, then obsolete intermediate eval_loss > final, then HF Trainer's `checkpoint-/` dirs that duplicate human-named `ckpt_step.pt`, then eval-output caches (FIMO TSVs). - Only resort to actual `rm` when the file is truly known-recoverable AND the user explicitly approves (e.g., regenerable in minutes from versioned source). **Why:** previous H100 disk fill killed an in-flight ablation (loss 5.45→4.57 was achieved but ckpt-2000 corrupted) and broke the Bash runtime itself. The user does not want a repeat — and prefers reversibility ("prefer not deleting things, and record what's moved, to where"). **Volatility caveat:** `/dev/shm` is RAM-backed tmpfs — files there DO NOT survive container/host restart. Mention this to the user when moving any file that is not regenerable from HF or source. If they want durable moves, ask first; possible alternates: another large block-mounted volume, or just push to HF and delete. **How to apply:** at every meaningful milestone (oracle done, ablation step landed, comparison done): (a) push final artifacts to HF, (b) MOVE previous intermediate copies to `/shm/dnathinker_quarantine/...` and append to the manifest. Never silently `rm` to free space. **Move helper boilerplate:** ```bash QUAR=/shm/dnathinker_quarantine MANIFEST=$QUAR/MANIFEST.tsv mkdir -p "$QUAR" move_to_shm() { local src="$1" reason="$2" local rel="${src#/}"; local dst="$QUAR/$rel" mkdir -p "$(dirname "$dst")" local sz; sz=$(stat -c %s "$src" 2>/dev/null || du -sb "$src" | cut -f1) mv "$src" "$dst" && \ printf "%s\t%s\t%s\t%s\t%s\n" "$(date -Iseconds)" "$src" "$dst" "$sz" "$reason" >> "$MANIFEST" } ``` **HF upload boilerplate:** ```python from huggingface_hub import HfApi api = HfApi() api.upload_file(path_or_fileobj=local, path_in_repo=repo_path, repo_id="explcre/dnathinker-checkpoints", repo_type="model") ```