File size: 3,147 Bytes
f3bb917
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
name: Manage disk proactively  prefer move-to-/shm over delete; record manifest
description: Disk-fill-up incidents are costly. Mitigate by uploading important to HF + MOVING (not deleting) intermediate files to /shm with a manifest of what went where.
type: feedback
originSessionId: 4037f43b-2133-46c6-84bd-02f7d454ec8b
---
When a long-running session generates many GB of artifacts:

1. **Upload important checkpoints to HF immediately** when achieved, don't accumulate on local disk:
   - Final-quality oracles → `explcre/dnathinker-checkpoints/runs/<run_dir>/`
   - Major arch-ablation checkpoints (e.g., midway-decision-point ones)
   - Ground-truth motif targets / extracted labels (if costly to regenerate)
   - The bundle scripts + README so others can use them

2. **MOVE intermediate files to `/shm` instead of deleting them.** `/shm` is a symlink to `/dev/shm` (157 GB tmpfs). Always log every move so the user can restore.
   - Quarantine area: `/shm/dnathinker_quarantine/<original-relative-path>`
   - Manifest log: `/shm/dnathinker_quarantine/MANIFEST.tsv` — append `<ISO-timestamp>\t<source-abs-path>\t<dest-abs-path>\t<size-bytes>\t<reason>` per move.
   - Order priority for moving (most-unneeded first): corrupt/partial ckpts, then obsolete intermediate eval_loss > final, then HF Trainer's `checkpoint-<step>/` dirs that duplicate human-named `ckpt_step<step>.pt`, then eval-output caches (FIMO TSVs).
   - Only resort to actual `rm` when the file is truly known-recoverable AND the user explicitly approves (e.g., regenerable in minutes from versioned source).

**Why:** previous H100 disk fill killed an in-flight ablation (loss 5.45→4.57 was achieved but ckpt-2000 corrupted) and broke the Bash runtime itself. The user does not want a repeat — and prefers reversibility ("prefer not deleting things, and record what's moved, to where").

**Volatility caveat:** `/dev/shm` is RAM-backed tmpfs — files there DO NOT survive container/host restart. Mention this to the user when moving any file that is not regenerable from HF or source. If they want durable moves, ask first; possible alternates: another large block-mounted volume, or just push to HF and delete.

**How to apply:** at every meaningful milestone (oracle done, ablation step landed, comparison done): (a) push final artifacts to HF, (b) MOVE previous intermediate copies to `/shm/dnathinker_quarantine/...` and append to the manifest. Never silently `rm` to free space.

**Move helper boilerplate:**
```bash
QUAR=/shm/dnathinker_quarantine
MANIFEST=$QUAR/MANIFEST.tsv
mkdir -p "$QUAR"
move_to_shm() {
  local src="$1" reason="$2"
  local rel="${src#/}"; local dst="$QUAR/$rel"
  mkdir -p "$(dirname "$dst")"
  local sz; sz=$(stat -c %s "$src" 2>/dev/null || du -sb "$src" | cut -f1)
  mv "$src" "$dst" && \
    printf "%s\t%s\t%s\t%s\t%s\n" "$(date -Iseconds)" "$src" "$dst" "$sz" "$reason" >> "$MANIFEST"
}
```

**HF upload boilerplate:**
```python
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(path_or_fileobj=local, path_in_repo=repo_path,
                repo_id="explcre/dnathinker-checkpoints", repo_type="model")
```