R8 Pre-Flight Checklist
Run: Round 8 — Final deleaked 5-class CyberNER training
GPU: RTX PRO 6000 (96GB VRAM), Vast.ai
SSH: ssh -p 18732 root@199.126.203.145
1. Sync (run from local machine)
- Run dry-run first:
bash scripts/sync_to_gpu.sh --dry-run - Review output — confirm only expected files listed
- Run real sync:
bash scripts/sync_to_gpu.sh - Verify post-sync checks all show ✅
2. Vendor Code (CRITICAL)
-
--o-downsampleflag exists inopf/_train/args.pyon GPU -
_apply_o_downsamplefunction exists inopf/_train/runner.pyon GPU -
opfreinstalled after sync:cd ~/alkyline/vendor/privacy-filter && pip install -e . - Verify:
opf train --help | grep o-downsampleshows the flag
3. Data Files on GPU
-
data/processed/r8_5class_train.jsonlexists (~18MB, deleaked training set) -
data/processed/r8_5class_valid.jsonlexists (~1.3MB, deleaked validation set) -
data/processed/aptner_5class_test_clean.jsonlexists (~56KB, independent test) -
data/processed/securebert2_5class_test.jsonlexists (~64KB) -
data/processed/enriched_5class_test.jsonlexists (should already be on GPU) -
data/processed/cyner_test.jsonlexists (should already be on GPU) -
data/label_spaces/cyner_5class.jsonexists
4. Scripts on GPU
-
scripts/run_train_v8.shis executable:chmod +x scripts/run_train_v8.sh -
scripts/early_stop_monitor.shis executable -
scripts/viterbi_grid_search.pyexists -
scripts/checkpoint_avg.pyexists
5. GPU Environment
- Python 3.12+ available
- PyTorch with CUDA:
python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))" - OPF installed:
opf --help - CUDA memory clear:
nvidia-smi— no other processes using GPU
6. Disk Space
- At least 15GB free:
df -h /(R8 checkpoints ~10GB, 15 epochs × ~700MB each) - Consider pruning old checkpoints:
du -sh ~/alkyline/checkpoints/*/ - Current state: 29GB used by checkpoints, 30GB free → may need cleanup
7. Quick Smoke Test
- Run a 1-epoch test to confirm everything works end-to-end:
opf train data/processed/r8_5class_train.jsonl \ --validation-dataset data/processed/r8_5class_valid.jsonl \ --label-space-json data/label_spaces/cyner_5class.json \ --output-dir /tmp/r8_smoke \ --epochs 1 --batch-size 4 --grad-accum-steps 2 \ --learning-rate 5e-5 --lr-schedule cosine \ --loss-fn focal --focal-gamma 2.0 \ --llrd-factor 0.9 --o-downsample 0.7 \ --device cuda - Confirm it starts training, logs val_loss after epoch 1, and saves a checkpoint
- Delete smoke test:
rm -rf /tmp/r8_smoke
8. Launch
- Start in tmux/screen:
tmux new -s r8 - Run:
bash scripts/run_train_v8.sh 2>&1 | tee -a train_r8_full.log - Verify first epoch completes and val_loss is logged
- Detach:
Ctrl-B d
Known Risks
| Risk | Mitigation |
|---|---|
| Disk fills during training (29GB checkpoints + 15 new epochs) | Delete old R5/R6 checkpoints before starting |
--o-downsample not recognized |
Sync script reinstalls opf; verify with opf train --help |
| Training killed by Vast.ai timeout | Use tmux; check vast show instances periodically |
| OOM on batch_size=4 | Unlikely on 96GB; reduce to 2 + grad_accum=4 if needed |