phase8_rl / _claude_memory /feedback_disk_management.md
explcre's picture
Upload _claude_memory/feedback_disk_management.md with huggingface_hub
f3bb917 verified
metadata
name: Manage disk proactively  prefer move-to-/shm over delete; record manifest
description: >-
  Disk-fill-up incidents are costly. Mitigate by uploading important to HF +
  MOVING (not deleting) intermediate files to /shm with a manifest of what went
  where.
type: feedback
originSessionId: 4037f43b-2133-46c6-84bd-02f7d454ec8b

When a long-running session generates many GB of artifacts:

  1. Upload important checkpoints to HF immediately when achieved, don't accumulate on local disk:

    • Final-quality oracles → explcre/dnathinker-checkpoints/runs/<run_dir>/
    • Major arch-ablation checkpoints (e.g., midway-decision-point ones)
    • Ground-truth motif targets / extracted labels (if costly to regenerate)
    • The bundle scripts + README so others can use them
  2. MOVE intermediate files to /shm instead of deleting them. /shm is a symlink to /dev/shm (157 GB tmpfs). Always log every move so the user can restore.

    • Quarantine area: /shm/dnathinker_quarantine/<original-relative-path>
    • Manifest log: /shm/dnathinker_quarantine/MANIFEST.tsv — append <ISO-timestamp>\t<source-abs-path>\t<dest-abs-path>\t<size-bytes>\t<reason> per move.
    • Order priority for moving (most-unneeded first): corrupt/partial ckpts, then obsolete intermediate eval_loss > final, then HF Trainer's checkpoint-<step>/ dirs that duplicate human-named ckpt_step<step>.pt, then eval-output caches (FIMO TSVs).
    • Only resort to actual rm when the file is truly known-recoverable AND the user explicitly approves (e.g., regenerable in minutes from versioned source).

Why: previous H100 disk fill killed an in-flight ablation (loss 5.45→4.57 was achieved but ckpt-2000 corrupted) and broke the Bash runtime itself. The user does not want a repeat — and prefers reversibility ("prefer not deleting things, and record what's moved, to where").

Volatility caveat: /dev/shm is RAM-backed tmpfs — files there DO NOT survive container/host restart. Mention this to the user when moving any file that is not regenerable from HF or source. If they want durable moves, ask first; possible alternates: another large block-mounted volume, or just push to HF and delete.

How to apply: at every meaningful milestone (oracle done, ablation step landed, comparison done): (a) push final artifacts to HF, (b) MOVE previous intermediate copies to /shm/dnathinker_quarantine/... and append to the manifest. Never silently rm to free space.

Move helper boilerplate:

QUAR=/shm/dnathinker_quarantine
MANIFEST=$QUAR/MANIFEST.tsv
mkdir -p "$QUAR"
move_to_shm() {
  local src="$1" reason="$2"
  local rel="${src#/}"; local dst="$QUAR/$rel"
  mkdir -p "$(dirname "$dst")"
  local sz; sz=$(stat -c %s "$src" 2>/dev/null || du -sb "$src" | cut -f1)
  mv "$src" "$dst" && \
    printf "%s\t%s\t%s\t%s\t%s\n" "$(date -Iseconds)" "$src" "$dst" "$sz" "$reason" >> "$MANIFEST"
}

HF upload boilerplate:

from huggingface_hub import HfApi
api = HfApi()
api.upload_file(path_or_fileobj=local, path_in_repo=repo_path,
                repo_id="explcre/dnathinker-checkpoints", repo_type="model")