arcspan / scripts /r8_preflight.md
chairulridjal's picture
Add files using upload-large-folder tool
3dac39e verified

R8 Pre-Flight Checklist

Run: Round 8 — Final deleaked 5-class CyberNER training GPU: RTX PRO 6000 (96GB VRAM), Vast.ai SSH: ssh -p 18732 root@199.126.203.145


1. Sync (run from local machine)

  • Run dry-run first: bash scripts/sync_to_gpu.sh --dry-run
  • Review output — confirm only expected files listed
  • Run real sync: bash scripts/sync_to_gpu.sh
  • Verify post-sync checks all show ✅

2. Vendor Code (CRITICAL)

  • --o-downsample flag exists in opf/_train/args.py on GPU
  • _apply_o_downsample function exists in opf/_train/runner.py on GPU
  • opf reinstalled after sync: cd ~/alkyline/vendor/privacy-filter && pip install -e .
  • Verify: opf train --help | grep o-downsample shows the flag

3. Data Files on GPU

  • data/processed/r8_5class_train.jsonl exists (~18MB, deleaked training set)
  • data/processed/r8_5class_valid.jsonl exists (~1.3MB, deleaked validation set)
  • data/processed/aptner_5class_test_clean.jsonl exists (~56KB, independent test)
  • data/processed/securebert2_5class_test.jsonl exists (~64KB)
  • data/processed/enriched_5class_test.jsonl exists (should already be on GPU)
  • data/processed/cyner_test.jsonl exists (should already be on GPU)
  • data/label_spaces/cyner_5class.json exists

4. Scripts on GPU

  • scripts/run_train_v8.sh is executable: chmod +x scripts/run_train_v8.sh
  • scripts/early_stop_monitor.sh is executable
  • scripts/viterbi_grid_search.py exists
  • scripts/checkpoint_avg.py exists

5. GPU Environment

  • Python 3.12+ available
  • PyTorch with CUDA: python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
  • OPF installed: opf --help
  • CUDA memory clear: nvidia-smi — no other processes using GPU

6. Disk Space

  • At least 15GB free: df -h / (R8 checkpoints ~10GB, 15 epochs × ~700MB each)
  • Consider pruning old checkpoints: du -sh ~/alkyline/checkpoints/*/
  • Current state: 29GB used by checkpoints, 30GB free → may need cleanup

7. Quick Smoke Test

  • Run a 1-epoch test to confirm everything works end-to-end:
    opf train data/processed/r8_5class_train.jsonl \
      --validation-dataset data/processed/r8_5class_valid.jsonl \
      --label-space-json data/label_spaces/cyner_5class.json \
      --output-dir /tmp/r8_smoke \
      --epochs 1 --batch-size 4 --grad-accum-steps 2 \
      --learning-rate 5e-5 --lr-schedule cosine \
      --loss-fn focal --focal-gamma 2.0 \
      --llrd-factor 0.9 --o-downsample 0.7 \
      --device cuda
    
  • Confirm it starts training, logs val_loss after epoch 1, and saves a checkpoint
  • Delete smoke test: rm -rf /tmp/r8_smoke

8. Launch

  • Start in tmux/screen: tmux new -s r8
  • Run: bash scripts/run_train_v8.sh 2>&1 | tee -a train_r8_full.log
  • Verify first epoch completes and val_loss is logged
  • Detach: Ctrl-B d

Known Risks

Risk Mitigation
Disk fills during training (29GB checkpoints + 15 new epochs) Delete old R5/R6 checkpoints before starting
--o-downsample not recognized Sync script reinstalls opf; verify with opf train --help
Training killed by Vast.ai timeout Use tmux; check vast show instances periodically
OOM on batch_size=4 Unlikely on 96GB; reduce to 2 + grad_accum=4 if needed