name: Manage disk proactively — prefer move-to-/shm over delete; record manifest
description: >-
Disk-fill-up incidents are costly. Mitigate by uploading important to HF +
MOVING (not deleting) intermediate files to /shm with a manifest of what went
where.
type: feedback
originSessionId: 4037f43b-2133-46c6-84bd-02f7d454ec8b
When a long-running session generates many GB of artifacts:
Upload important checkpoints to HF immediately when achieved, don't accumulate on local disk:
- Final-quality oracles →
explcre/dnathinker-checkpoints/runs/<run_dir>/ - Major arch-ablation checkpoints (e.g., midway-decision-point ones)
- Ground-truth motif targets / extracted labels (if costly to regenerate)
- The bundle scripts + README so others can use them
- Final-quality oracles →
MOVE intermediate files to
/shminstead of deleting them./shmis a symlink to/dev/shm(157 GB tmpfs). Always log every move so the user can restore.- Quarantine area:
/shm/dnathinker_quarantine/<original-relative-path> - Manifest log:
/shm/dnathinker_quarantine/MANIFEST.tsv— append<ISO-timestamp>\t<source-abs-path>\t<dest-abs-path>\t<size-bytes>\t<reason>per move. - Order priority for moving (most-unneeded first): corrupt/partial ckpts, then obsolete intermediate eval_loss > final, then HF Trainer's
checkpoint-<step>/dirs that duplicate human-namedckpt_step<step>.pt, then eval-output caches (FIMO TSVs). - Only resort to actual
rmwhen the file is truly known-recoverable AND the user explicitly approves (e.g., regenerable in minutes from versioned source).
- Quarantine area:
Why: previous H100 disk fill killed an in-flight ablation (loss 5.45→4.57 was achieved but ckpt-2000 corrupted) and broke the Bash runtime itself. The user does not want a repeat — and prefers reversibility ("prefer not deleting things, and record what's moved, to where").
Volatility caveat: /dev/shm is RAM-backed tmpfs — files there DO NOT survive container/host restart. Mention this to the user when moving any file that is not regenerable from HF or source. If they want durable moves, ask first; possible alternates: another large block-mounted volume, or just push to HF and delete.
How to apply: at every meaningful milestone (oracle done, ablation step landed, comparison done): (a) push final artifacts to HF, (b) MOVE previous intermediate copies to /shm/dnathinker_quarantine/... and append to the manifest. Never silently rm to free space.
Move helper boilerplate:
QUAR=/shm/dnathinker_quarantine
MANIFEST=$QUAR/MANIFEST.tsv
mkdir -p "$QUAR"
move_to_shm() {
local src="$1" reason="$2"
local rel="${src#/}"; local dst="$QUAR/$rel"
mkdir -p "$(dirname "$dst")"
local sz; sz=$(stat -c %s "$src" 2>/dev/null || du -sb "$src" | cut -f1)
mv "$src" "$dst" && \
printf "%s\t%s\t%s\t%s\t%s\n" "$(date -Iseconds)" "$src" "$dst" "$sz" "$reason" >> "$MANIFEST"
}
HF upload boilerplate:
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(path_or_fileobj=local, path_in_repo=repo_path,
repo_id="explcre/dnathinker-checkpoints", repo_type="model")