arcspan / scripts /r8_preflight.md
chairulridjal's picture
Add files using upload-large-folder tool
3dac39e verified
# R8 Pre-Flight Checklist
**Run:** `Round 8 β€” Final deleaked 5-class CyberNER training`
**GPU:** RTX PRO 6000 (96GB VRAM), Vast.ai
**SSH:** `ssh -p 18732 root@199.126.203.145`
---
## 1. Sync (run from local machine)
- [ ] Run dry-run first: `bash scripts/sync_to_gpu.sh --dry-run`
- [ ] Review output β€” confirm only expected files listed
- [ ] Run real sync: `bash scripts/sync_to_gpu.sh`
- [ ] Verify post-sync checks all show βœ…
## 2. Vendor Code (CRITICAL)
- [ ] `--o-downsample` flag exists in `opf/_train/args.py` on GPU
- [ ] `_apply_o_downsample` function exists in `opf/_train/runner.py` on GPU
- [ ] `opf` reinstalled after sync: `cd ~/alkyline/vendor/privacy-filter && pip install -e .`
- [ ] Verify: `opf train --help | grep o-downsample` shows the flag
## 3. Data Files on GPU
- [ ] `data/processed/r8_5class_train.jsonl` exists (~18MB, deleaked training set)
- [ ] `data/processed/r8_5class_valid.jsonl` exists (~1.3MB, deleaked validation set)
- [ ] `data/processed/aptner_5class_test_clean.jsonl` exists (~56KB, independent test)
- [ ] `data/processed/securebert2_5class_test.jsonl` exists (~64KB)
- [ ] `data/processed/enriched_5class_test.jsonl` exists (should already be on GPU)
- [ ] `data/processed/cyner_test.jsonl` exists (should already be on GPU)
- [ ] `data/label_spaces/cyner_5class.json` exists
## 4. Scripts on GPU
- [ ] `scripts/run_train_v8.sh` is executable: `chmod +x scripts/run_train_v8.sh`
- [ ] `scripts/early_stop_monitor.sh` is executable
- [ ] `scripts/viterbi_grid_search.py` exists
- [ ] `scripts/checkpoint_avg.py` exists
## 5. GPU Environment
- [ ] Python 3.12+ available
- [ ] PyTorch with CUDA: `python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"`
- [ ] OPF installed: `opf --help`
- [ ] CUDA memory clear: `nvidia-smi` β€” no other processes using GPU
## 6. Disk Space
- [ ] At least 15GB free: `df -h /` (R8 checkpoints ~10GB, 15 epochs Γ— ~700MB each)
- [ ] Consider pruning old checkpoints: `du -sh ~/alkyline/checkpoints/*/`
- [ ] Current state: 29GB used by checkpoints, 30GB free β†’ **may need cleanup**
## 7. Quick Smoke Test
- [ ] Run a 1-epoch test to confirm everything works end-to-end:
```bash
opf train data/processed/r8_5class_train.jsonl \
--validation-dataset data/processed/r8_5class_valid.jsonl \
--label-space-json data/label_spaces/cyner_5class.json \
--output-dir /tmp/r8_smoke \
--epochs 1 --batch-size 4 --grad-accum-steps 2 \
--learning-rate 5e-5 --lr-schedule cosine \
--loss-fn focal --focal-gamma 2.0 \
--llrd-factor 0.9 --o-downsample 0.7 \
--device cuda
```
- [ ] Confirm it starts training, logs val_loss after epoch 1, and saves a checkpoint
- [ ] Delete smoke test: `rm -rf /tmp/r8_smoke`
## 8. Launch
- [ ] Start in tmux/screen: `tmux new -s r8`
- [ ] Run: `bash scripts/run_train_v8.sh 2>&1 | tee -a train_r8_full.log`
- [ ] Verify first epoch completes and val_loss is logged
- [ ] Detach: `Ctrl-B d`
---
## Known Risks
| Risk | Mitigation |
|------|-----------|
| Disk fills during training (29GB checkpoints + 15 new epochs) | Delete old R5/R6 checkpoints before starting |
| `--o-downsample` not recognized | Sync script reinstalls opf; verify with `opf train --help` |
| Training killed by Vast.ai timeout | Use tmux; check `vast show instances` periodically |
| OOM on batch_size=4 | Unlikely on 96GB; reduce to 2 + grad_accum=4 if needed |