chairulridjal
/

arcspan

Token Classification

named-entity-recognition

threat-intelligence

Mixture of Experts

Model card Files Files and versions

arcspan / scripts /r8_preflight.md

chairulridjal's picture

Add files using upload-large-folder tool

3dac39e verified 13 days ago

|

history blame contribute delete

3.43 kB

	# R8 Pre-Flight Checklist

	Run: `Round 8 — Final deleaked 5-class CyberNER training`
	GPU: RTX PRO 6000 (96GB VRAM), Vast.ai
	SSH: `ssh -p 18732 root@199.126.203.145`

	---

	## 1. Sync (run from local machine)

	- [ ] Run dry-run first: `bash scripts/sync_to_gpu.sh --dry-run`
	- [ ] Review output — confirm only expected files listed
	- [ ] Run real sync: `bash scripts/sync_to_gpu.sh`
	- [ ] Verify post-sync checks all show ✅

	## 2. Vendor Code (CRITICAL)

	- [ ] `--o-downsample` flag exists in `opf/_train/args.py` on GPU
	- [ ] `_apply_o_downsample` function exists in `opf/_train/runner.py` on GPU
	- [ ] `opf` reinstalled after sync: `cd ~/alkyline/vendor/privacy-filter && pip install -e .`
	- [ ] Verify: `opf train --help \| grep o-downsample` shows the flag

	## 3. Data Files on GPU

	- [ ] `data/processed/r8_5class_train.jsonl` exists (~18MB, deleaked training set)
	- [ ] `data/processed/r8_5class_valid.jsonl` exists (~1.3MB, deleaked validation set)
	- [ ] `data/processed/aptner_5class_test_clean.jsonl` exists (~56KB, independent test)
	- [ ] `data/processed/securebert2_5class_test.jsonl` exists (~64KB)
	- [ ] `data/processed/enriched_5class_test.jsonl` exists (should already be on GPU)
	- [ ] `data/processed/cyner_test.jsonl` exists (should already be on GPU)
	- [ ] `data/label_spaces/cyner_5class.json` exists

	## 4. Scripts on GPU

	- [ ] `scripts/run_train_v8.sh` is executable: `chmod +x scripts/run_train_v8.sh`
	- [ ] `scripts/early_stop_monitor.sh` is executable
	- [ ] `scripts/viterbi_grid_search.py` exists
	- [ ] `scripts/checkpoint_avg.py` exists

	## 5. GPU Environment

	- [ ] Python 3.12+ available
	- [ ] PyTorch with CUDA: `python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"`
	- [ ] OPF installed: `opf --help`
	- [ ] CUDA memory clear: `nvidia-smi` — no other processes using GPU

	## 6. Disk Space

	- [ ] At least 15GB free: `df -h /` (R8 checkpoints ~10GB, 15 epochs × ~700MB each)
	- [ ] Consider pruning old checkpoints: `du -sh ~/alkyline/checkpoints/*/`
	- [ ] Current state: 29GB used by checkpoints, 30GB free → may need cleanup

	## 7. Quick Smoke Test

	- [ ] Run a 1-epoch test to confirm everything works end-to-end:
	```bash
	opf train data/processed/r8_5class_train.jsonl \
	--validation-dataset data/processed/r8_5class_valid.jsonl \
	--label-space-json data/label_spaces/cyner_5class.json \
	--output-dir /tmp/r8_smoke \
	--epochs 1 --batch-size 4 --grad-accum-steps 2 \
	--learning-rate 5e-5 --lr-schedule cosine \
	--loss-fn focal --focal-gamma 2.0 \
	--llrd-factor 0.9 --o-downsample 0.7 \
	--device cuda
	```
	- [ ] Confirm it starts training, logs val_loss after epoch 1, and saves a checkpoint
	- [ ] Delete smoke test: `rm -rf /tmp/r8_smoke`

	## 8. Launch

	- [ ] Start in tmux/screen: `tmux new -s r8`
	- [ ] Run: `bash scripts/run_train_v8.sh 2>&1 \| tee -a train_r8_full.log`
	- [ ] Verify first epoch completes and val_loss is logged
	- [ ] Detach: `Ctrl-B d`

	---

	## Known Risks

	\| Risk \| Mitigation \|
	\|------\|-----------\|
	\| Disk fills during training (29GB checkpoints + 15 new epochs) \| Delete old R5/R6 checkpoints before starting \|
	\| `--o-downsample` not recognized \| Sync script reinstalls opf; verify with `opf train --help` \|
	\| Training killed by Vast.ai timeout \| Use tmux; check `vast show instances` periodically \|
	\| OOM on batch_size=4 \| Unlikely on 96GB; reduce to 2 + grad_accum=4 if needed \|