phase8_rl / _claude_memory /feedback_disk_management.md
explcre's picture
Upload _claude_memory/feedback_disk_management.md with huggingface_hub
f3bb917 verified
---
name: Manage disk proactively β€” prefer move-to-/shm over delete; record manifest
description: Disk-fill-up incidents are costly. Mitigate by uploading important to HF + MOVING (not deleting) intermediate files to /shm with a manifest of what went where.
type: feedback
originSessionId: 4037f43b-2133-46c6-84bd-02f7d454ec8b
---
When a long-running session generates many GB of artifacts:
1. **Upload important checkpoints to HF immediately** when achieved, don't accumulate on local disk:
- Final-quality oracles β†’ `explcre/dnathinker-checkpoints/runs/<run_dir>/`
- Major arch-ablation checkpoints (e.g., midway-decision-point ones)
- Ground-truth motif targets / extracted labels (if costly to regenerate)
- The bundle scripts + README so others can use them
2. **MOVE intermediate files to `/shm` instead of deleting them.** `/shm` is a symlink to `/dev/shm` (157 GB tmpfs). Always log every move so the user can restore.
- Quarantine area: `/shm/dnathinker_quarantine/<original-relative-path>`
- Manifest log: `/shm/dnathinker_quarantine/MANIFEST.tsv` β€” append `<ISO-timestamp>\t<source-abs-path>\t<dest-abs-path>\t<size-bytes>\t<reason>` per move.
- Order priority for moving (most-unneeded first): corrupt/partial ckpts, then obsolete intermediate eval_loss > final, then HF Trainer's `checkpoint-<step>/` dirs that duplicate human-named `ckpt_step<step>.pt`, then eval-output caches (FIMO TSVs).
- Only resort to actual `rm` when the file is truly known-recoverable AND the user explicitly approves (e.g., regenerable in minutes from versioned source).
**Why:** previous H100 disk fill killed an in-flight ablation (loss 5.45β†’4.57 was achieved but ckpt-2000 corrupted) and broke the Bash runtime itself. The user does not want a repeat β€” and prefers reversibility ("prefer not deleting things, and record what's moved, to where").
**Volatility caveat:** `/dev/shm` is RAM-backed tmpfs β€” files there DO NOT survive container/host restart. Mention this to the user when moving any file that is not regenerable from HF or source. If they want durable moves, ask first; possible alternates: another large block-mounted volume, or just push to HF and delete.
**How to apply:** at every meaningful milestone (oracle done, ablation step landed, comparison done): (a) push final artifacts to HF, (b) MOVE previous intermediate copies to `/shm/dnathinker_quarantine/...` and append to the manifest. Never silently `rm` to free space.
**Move helper boilerplate:**
```bash
QUAR=/shm/dnathinker_quarantine
MANIFEST=$QUAR/MANIFEST.tsv
mkdir -p "$QUAR"
move_to_shm() {
local src="$1" reason="$2"
local rel="${src#/}"; local dst="$QUAR/$rel"
mkdir -p "$(dirname "$dst")"
local sz; sz=$(stat -c %s "$src" 2>/dev/null || du -sb "$src" | cut -f1)
mv "$src" "$dst" && \
printf "%s\t%s\t%s\t%s\t%s\n" "$(date -Iseconds)" "$src" "$dst" "$sz" "$reason" >> "$MANIFEST"
}
```
**HF upload boilerplate:**
```python
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(path_or_fileobj=local, path_in_repo=repo_path,
repo_id="explcre/dnathinker-checkpoints", repo_type="model")
```