# R8 Pre-Flight Checklist

**Run:** `Round 8 — Final deleaked 5-class CyberNER training`
**GPU:** RTX PRO 6000 (96GB VRAM), Vast.ai
**SSH:** `ssh -p 18732 root@199.126.203.145`

---

## 1. Sync (run from local machine)

- [ ] Run dry-run first: `bash scripts/sync_to_gpu.sh --dry-run`
- [ ] Review output — confirm only expected files listed
- [ ] Run real sync: `bash scripts/sync_to_gpu.sh`
- [ ] Verify post-sync checks all show ✅

## 2. Vendor Code (CRITICAL)

- [ ] `--o-downsample` flag exists in `opf/_train/args.py` on GPU
- [ ] `_apply_o_downsample` function exists in `opf/_train/runner.py` on GPU
- [ ] `opf` reinstalled after sync: `cd ~/alkyline/vendor/privacy-filter && pip install -e .`
- [ ] Verify: `opf train --help | grep o-downsample` shows the flag

## 3. Data Files on GPU

- [ ] `data/processed/r8_5class_train.jsonl` exists (~18MB, deleaked training set)
- [ ] `data/processed/r8_5class_valid.jsonl` exists (~1.3MB, deleaked validation set)
- [ ] `data/processed/aptner_5class_test_clean.jsonl` exists (~56KB, independent test)
- [ ] `data/processed/securebert2_5class_test.jsonl` exists (~64KB)
- [ ] `data/processed/enriched_5class_test.jsonl` exists (should already be on GPU)
- [ ] `data/processed/cyner_test.jsonl` exists (should already be on GPU)
- [ ] `data/label_spaces/cyner_5class.json` exists

## 4. Scripts on GPU

- [ ] `scripts/run_train_v8.sh` is executable: `chmod +x scripts/run_train_v8.sh`
- [ ] `scripts/early_stop_monitor.sh` is executable
- [ ] `scripts/viterbi_grid_search.py` exists
- [ ] `scripts/checkpoint_avg.py` exists

## 5. GPU Environment

- [ ] Python 3.12+ available
- [ ] PyTorch with CUDA: `python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"`
- [ ] OPF installed: `opf --help`
- [ ] CUDA memory clear: `nvidia-smi` — no other processes using GPU

## 6. Disk Space

- [ ] At least 15GB free: `df -h /` (R8 checkpoints ~10GB, 15 epochs × ~700MB each)
- [ ] Consider pruning old checkpoints: `du -sh ~/alkyline/checkpoints/*/`
- [ ] Current state: 29GB used by checkpoints, 30GB free → **may need cleanup**

## 7. Quick Smoke Test

- [ ] Run a 1-epoch test to confirm everything works end-to-end:
  ```bash
  opf train data/processed/r8_5class_train.jsonl \
    --validation-dataset data/processed/r8_5class_valid.jsonl \
    --label-space-json data/label_spaces/cyner_5class.json \
    --output-dir /tmp/r8_smoke \
    --epochs 1 --batch-size 4 --grad-accum-steps 2 \
    --learning-rate 5e-5 --lr-schedule cosine \
    --loss-fn focal --focal-gamma 2.0 \
    --llrd-factor 0.9 --o-downsample 0.7 \
    --device cuda
  ```
- [ ] Confirm it starts training, logs val_loss after epoch 1, and saves a checkpoint
- [ ] Delete smoke test: `rm -rf /tmp/r8_smoke`

## 8. Launch

- [ ] Start in tmux/screen: `tmux new -s r8`
- [ ] Run: `bash scripts/run_train_v8.sh 2>&1 | tee -a train_r8_full.log`
- [ ] Verify first epoch completes and val_loss is logged
- [ ] Detach: `Ctrl-B d`

---

## Known Risks

| Risk | Mitigation |
|------|-----------|
| Disk fills during training (29GB checkpoints + 15 new epochs) | Delete old R5/R6 checkpoints before starting |
| `--o-downsample` not recognized | Sync script reinstalls opf; verify with `opf train --help` |
| Training killed by Vast.ai timeout | Use tmux; check `vast show instances` periodically |
| OOM on batch_size=4 | Unlikely on 96GB; reduce to 2 + grad_accum=4 if needed |