# R8 Pre-Flight Checklist **Run:** `Round 8 — Final deleaked 5-class CyberNER training` **GPU:** RTX PRO 6000 (96GB VRAM), Vast.ai **SSH:** `ssh -p 18732 root@199.126.203.145` --- ## 1. Sync (run from local machine) - [ ] Run dry-run first: `bash scripts/sync_to_gpu.sh --dry-run` - [ ] Review output — confirm only expected files listed - [ ] Run real sync: `bash scripts/sync_to_gpu.sh` - [ ] Verify post-sync checks all show ✅ ## 2. Vendor Code (CRITICAL) - [ ] `--o-downsample` flag exists in `opf/_train/args.py` on GPU - [ ] `_apply_o_downsample` function exists in `opf/_train/runner.py` on GPU - [ ] `opf` reinstalled after sync: `cd ~/alkyline/vendor/privacy-filter && pip install -e .` - [ ] Verify: `opf train --help | grep o-downsample` shows the flag ## 3. Data Files on GPU - [ ] `data/processed/r8_5class_train.jsonl` exists (~18MB, deleaked training set) - [ ] `data/processed/r8_5class_valid.jsonl` exists (~1.3MB, deleaked validation set) - [ ] `data/processed/aptner_5class_test_clean.jsonl` exists (~56KB, independent test) - [ ] `data/processed/securebert2_5class_test.jsonl` exists (~64KB) - [ ] `data/processed/enriched_5class_test.jsonl` exists (should already be on GPU) - [ ] `data/processed/cyner_test.jsonl` exists (should already be on GPU) - [ ] `data/label_spaces/cyner_5class.json` exists ## 4. Scripts on GPU - [ ] `scripts/run_train_v8.sh` is executable: `chmod +x scripts/run_train_v8.sh` - [ ] `scripts/early_stop_monitor.sh` is executable - [ ] `scripts/viterbi_grid_search.py` exists - [ ] `scripts/checkpoint_avg.py` exists ## 5. GPU Environment - [ ] Python 3.12+ available - [ ] PyTorch with CUDA: `python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"` - [ ] OPF installed: `opf --help` - [ ] CUDA memory clear: `nvidia-smi` — no other processes using GPU ## 6. Disk Space - [ ] At least 15GB free: `df -h /` (R8 checkpoints ~10GB, 15 epochs × ~700MB each) - [ ] Consider pruning old checkpoints: `du -sh ~/alkyline/checkpoints/*/` - [ ] Current state: 29GB used by checkpoints, 30GB free → **may need cleanup** ## 7. Quick Smoke Test - [ ] Run a 1-epoch test to confirm everything works end-to-end: ```bash opf train data/processed/r8_5class_train.jsonl \ --validation-dataset data/processed/r8_5class_valid.jsonl \ --label-space-json data/label_spaces/cyner_5class.json \ --output-dir /tmp/r8_smoke \ --epochs 1 --batch-size 4 --grad-accum-steps 2 \ --learning-rate 5e-5 --lr-schedule cosine \ --loss-fn focal --focal-gamma 2.0 \ --llrd-factor 0.9 --o-downsample 0.7 \ --device cuda ``` - [ ] Confirm it starts training, logs val_loss after epoch 1, and saves a checkpoint - [ ] Delete smoke test: `rm -rf /tmp/r8_smoke` ## 8. Launch - [ ] Start in tmux/screen: `tmux new -s r8` - [ ] Run: `bash scripts/run_train_v8.sh 2>&1 | tee -a train_r8_full.log` - [ ] Verify first epoch completes and val_loss is logged - [ ] Detach: `Ctrl-B d` --- ## Known Risks | Risk | Mitigation | |------|-----------| | Disk fills during training (29GB checkpoints + 15 new epochs) | Delete old R5/R6 checkpoints before starting | | `--o-downsample` not recognized | Sync script reinstalls opf; verify with `opf train --help` | | Training killed by Vast.ai timeout | Use tmux; check `vast show instances` periodically | | OOM on batch_size=4 | Unlikely on 96GB; reduce to 2 + grad_accum=4 if needed |