R8 Pre-Flight Checklist

Run: Round 8 — Final deleaked 5-class CyberNER training GPU: RTX PRO 6000 (96GB VRAM), Vast.ai SSH: ssh -p 18732 root@199.126.203.145

1. Sync (run from local machine)

Run dry-run first: bash scripts/sync_to_gpu.sh --dry-run
Review output — confirm only expected files listed
Run real sync: bash scripts/sync_to_gpu.sh
Verify post-sync checks all show ✅

2. Vendor Code (CRITICAL)

--o-downsample flag exists in opf/_train/args.py on GPU
_apply_o_downsample function exists in opf/_train/runner.py on GPU
opf reinstalled after sync: cd ~/alkyline/vendor/privacy-filter && pip install -e .
Verify: opf train --help | grep o-downsample shows the flag

3. Data Files on GPU

data/processed/r8_5class_train.jsonl exists (~18MB, deleaked training set)
data/processed/r8_5class_valid.jsonl exists (~1.3MB, deleaked validation set)
data/processed/aptner_5class_test_clean.jsonl exists (~56KB, independent test)
data/processed/securebert2_5class_test.jsonl exists (~64KB)
data/processed/enriched_5class_test.jsonl exists (should already be on GPU)
data/processed/cyner_test.jsonl exists (should already be on GPU)
data/label_spaces/cyner_5class.json exists

4. Scripts on GPU

scripts/run_train_v8.sh is executable: chmod +x scripts/run_train_v8.sh
scripts/early_stop_monitor.sh is executable
scripts/viterbi_grid_search.py exists
scripts/checkpoint_avg.py exists

5. GPU Environment

Python 3.12+ available
PyTorch with CUDA: python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
OPF installed: opf --help
CUDA memory clear: nvidia-smi — no other processes using GPU

6. Disk Space

At least 15GB free: df -h / (R8 checkpoints ~10GB, 15 epochs × ~700MB each)
Consider pruning old checkpoints: du -sh ~/alkyline/checkpoints/*/
Current state: 29GB used by checkpoints, 30GB free → may need cleanup

7. Quick Smoke Test

Run a 1-epoch test to confirm everything works end-to-end:

opf train data/processed/r8_5class_train.jsonl \
  --validation-dataset data/processed/r8_5class_valid.jsonl \
  --label-space-json data/label_spaces/cyner_5class.json \
  --output-dir /tmp/r8_smoke \
  --epochs 1 --batch-size 4 --grad-accum-steps 2 \
  --learning-rate 5e-5 --lr-schedule cosine \
  --loss-fn focal --focal-gamma 2.0 \
  --llrd-factor 0.9 --o-downsample 0.7 \
  --device cuda

Confirm it starts training, logs val_loss after epoch 1, and saves a checkpoint
Delete smoke test: rm -rf /tmp/r8_smoke

8. Launch

Start in tmux/screen: tmux new -s r8
Run: bash scripts/run_train_v8.sh 2>&1 | tee -a train_r8_full.log
Verify first epoch completes and val_loss is logged
Detach: Ctrl-B d

Known Risks

Risk	Mitigation
Disk fills during training (29GB checkpoints + 15 new epochs)	Delete old R5/R6 checkpoints before starting
`--o-downsample` not recognized	Sync script reinstalls opf; verify with `opf train --help`
Training killed by Vast.ai timeout	Use tmux; check `vast show instances` periodically
OOM on batch_size=4	Unlikely on 96GB; reduce to 2 + grad_accum=4 if needed