Humanlearning commited on
Commit
d8f4c34
·
verified ·
1 Parent(s): 9b94607

Add training script notes to blog

Browse files
Files changed (1) hide show
  1. blog/blog.md +11 -0
blog/blog.md CHANGED
@@ -26,6 +26,17 @@ The first plain-RL rubric looked promising on paper, but the agent learned a che
26
 
27
  The useful jump came from changing the recipe: first teach the model successful trajectories with SFT, then run GRPO on that LoRA. With the updated rubric, the SFT-warm-started agent spends less time gaming the interface and more time doing the real job: find the auth bug, patch it, and keep valid behavior alive.
28
 
 
 
 
 
 
 
 
 
 
 
 
29
  ## Why OWASP A01?
30
 
31
  The first target is **OWASP A01:2025 - Broken Access Control**.
 
26
 
27
  The useful jump came from changing the recipe: first teach the model successful trajectories with SFT, then run GRPO on that LoRA. With the updated rubric, the SFT-warm-started agent spends less time gaming the interface and more time doing the real job: find the auth bug, patch it, and keep valid behavior alive.
28
 
29
+ The plot is backed by repeatable scripts, not a one-off notebook. `scripts/modal_train_sft.py` trains the warm-start LoRA on verified trajectories, `scripts/modal_train_grpo.py` continues from that adapter with live OpenEnv rewards, and `scripts/launch_reward_ablations.ps1` launches comparable rubric trials using the YAML configs in `training/configs/reward_ablations/`.
30
+
31
+ The core handoff is intentionally simple:
32
+
33
+ ```bash
34
+ uv run --extra modal modal run --detach scripts/modal_train_sft.py --push-to-hub --detach
35
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
36
+ --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
37
+ --max-steps 300 --difficulty 0 --trace-log-every 10 --detach
38
+ ```
39
+
40
  ## Why OWASP A01?
41
 
42
  The first target is **OWASP A01:2025 - Broken Access Control**.