Spaces:

Humanlearning
/

CyberSecurity_OWASP

Sleeping

Humanlearning commited on Apr 26

Commit

d8f4c34

verified ·

1 Parent(s): 9b94607

Add training script notes to blog

Files changed (1) hide show

blog/blog.md CHANGED Viewed

@@ -26,6 +26,17 @@ The first plain-RL rubric looked promising on paper, but the agent learned a che
 The useful jump came from changing the recipe: first teach the model successful trajectories with SFT, then run GRPO on that LoRA. With the updated rubric, the SFT-warm-started agent spends less time gaming the interface and more time doing the real job: find the auth bug, patch it, and keep valid behavior alive.
 ## Why OWASP A01?
 The first target is **OWASP A01:2025 - Broken Access Control**.

 The useful jump came from changing the recipe: first teach the model successful trajectories with SFT, then run GRPO on that LoRA. With the updated rubric, the SFT-warm-started agent spends less time gaming the interface and more time doing the real job: find the auth bug, patch it, and keep valid behavior alive.
+The plot is backed by repeatable scripts, not a one-off notebook. `scripts/modal_train_sft.py` trains the warm-start LoRA on verified trajectories, `scripts/modal_train_grpo.py` continues from that adapter with live OpenEnv rewards, and `scripts/launch_reward_ablations.ps1` launches comparable rubric trials using the YAML configs in `training/configs/reward_ablations/`.
+The core handoff is intentionally simple:
+```bash
+uv run --extra modal modal run --detach scripts/modal_train_sft.py --push-to-hub --detach
+uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
+  --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
+  --max-steps 300 --difficulty 0 --trace-log-every 10 --detach
+```
 ## Why OWASP A01?
 The first target is **OWASP A01:2025 - Broken Access Control**.