| # R8 Pre-Flight Checklist |
|
|
| **Run:** `Round 8 β Final deleaked 5-class CyberNER training` |
| **GPU:** RTX PRO 6000 (96GB VRAM), Vast.ai |
| **SSH:** `ssh -p 18732 root@199.126.203.145` |
|
|
| --- |
|
|
| ## 1. Sync (run from local machine) |
|
|
| - [ ] Run dry-run first: `bash scripts/sync_to_gpu.sh --dry-run` |
| - [ ] Review output β confirm only expected files listed |
| - [ ] Run real sync: `bash scripts/sync_to_gpu.sh` |
| - [ ] Verify post-sync checks all show β
|
|
|
| ## 2. Vendor Code (CRITICAL) |
|
|
| - [ ] `--o-downsample` flag exists in `opf/_train/args.py` on GPU |
| - [ ] `_apply_o_downsample` function exists in `opf/_train/runner.py` on GPU |
| - [ ] `opf` reinstalled after sync: `cd ~/alkyline/vendor/privacy-filter && pip install -e .` |
| - [ ] Verify: `opf train --help | grep o-downsample` shows the flag |
|
|
| ## 3. Data Files on GPU |
|
|
| - [ ] `data/processed/r8_5class_train.jsonl` exists (~18MB, deleaked training set) |
| - [ ] `data/processed/r8_5class_valid.jsonl` exists (~1.3MB, deleaked validation set) |
| - [ ] `data/processed/aptner_5class_test_clean.jsonl` exists (~56KB, independent test) |
| - [ ] `data/processed/securebert2_5class_test.jsonl` exists (~64KB) |
| - [ ] `data/processed/enriched_5class_test.jsonl` exists (should already be on GPU) |
| - [ ] `data/processed/cyner_test.jsonl` exists (should already be on GPU) |
| - [ ] `data/label_spaces/cyner_5class.json` exists |
|
|
| ## 4. Scripts on GPU |
|
|
| - [ ] `scripts/run_train_v8.sh` is executable: `chmod +x scripts/run_train_v8.sh` |
| - [ ] `scripts/early_stop_monitor.sh` is executable |
| - [ ] `scripts/viterbi_grid_search.py` exists |
| - [ ] `scripts/checkpoint_avg.py` exists |
|
|
| ## 5. GPU Environment |
|
|
| - [ ] Python 3.12+ available |
| - [ ] PyTorch with CUDA: `python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"` |
| - [ ] OPF installed: `opf --help` |
| - [ ] CUDA memory clear: `nvidia-smi` β no other processes using GPU |
|
|
| ## 6. Disk Space |
|
|
| - [ ] At least 15GB free: `df -h /` (R8 checkpoints ~10GB, 15 epochs Γ ~700MB each) |
| - [ ] Consider pruning old checkpoints: `du -sh ~/alkyline/checkpoints/*/` |
| - [ ] Current state: 29GB used by checkpoints, 30GB free β **may need cleanup** |
|
|
| ## 7. Quick Smoke Test |
|
|
| - [ ] Run a 1-epoch test to confirm everything works end-to-end: |
| ```bash |
| opf train data/processed/r8_5class_train.jsonl \ |
| --validation-dataset data/processed/r8_5class_valid.jsonl \ |
| --label-space-json data/label_spaces/cyner_5class.json \ |
| --output-dir /tmp/r8_smoke \ |
| --epochs 1 --batch-size 4 --grad-accum-steps 2 \ |
| --learning-rate 5e-5 --lr-schedule cosine \ |
| --loss-fn focal --focal-gamma 2.0 \ |
| --llrd-factor 0.9 --o-downsample 0.7 \ |
| --device cuda |
| ``` |
| - [ ] Confirm it starts training, logs val_loss after epoch 1, and saves a checkpoint |
| - [ ] Delete smoke test: `rm -rf /tmp/r8_smoke` |
|
|
| ## 8. Launch |
|
|
| - [ ] Start in tmux/screen: `tmux new -s r8` |
| - [ ] Run: `bash scripts/run_train_v8.sh 2>&1 | tee -a train_r8_full.log` |
| - [ ] Verify first epoch completes and val_loss is logged |
| - [ ] Detach: `Ctrl-B d` |
| |
| --- |
| |
| ## Known Risks |
| |
| | Risk | Mitigation | |
| |------|-----------| |
| | Disk fills during training (29GB checkpoints + 15 new epochs) | Delete old R5/R6 checkpoints before starting | |
| | `--o-downsample` not recognized | Sync script reinstalls opf; verify with `opf train --help` | |
| | Training killed by Vast.ai timeout | Use tmux; check `vast show instances` periodically | |
| | OOM on batch_size=4 | Unlikely on 96GB; reduce to 2 + grad_accum=4 if needed | |
| |