Humanlearning commited on
Commit
543a845
·
verified ·
1 Parent(s): d8f4c34

Add SFT and GRPO run commands to README

Browse files
Files changed (1) hide show
  1. README.md +524 -452
README.md CHANGED
@@ -1,321 +1,225 @@
1
- ---
2
- title: CyberSecurity_OWASP Environment Server
3
- emoji: 🛡️
4
- colorFrom: blue
5
- colorTo: gray
6
- sdk: docker
7
- pinned: false
8
- app_port: 8000
9
- base_path: /web
10
- tags:
11
- - openenv
12
- - cybersecurity
13
- - owasp
14
- ---
15
-
16
- # CyberSecurity_OWASP
17
-
18
- [Hugging Face Space](https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP) | [Mini-blog](blog/blog.md)
19
-
20
- `CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:
21
-
22
- ```text
23
- inspect generated app + policy -> discover authorization bug -> submit diagnosis -> patch code -> preserve intended behavior
24
- ```
25
-
26
- The current implementation includes a functional closed-loop MVP scenario: an invoices FastAPI-style app with one injected OWASP A01 BOLA/IDOR defect, config-driven curriculum settings, cache-backed scenario reset, an ephemeral app sandbox, multi-layer deterministic verifier checks, anti-cheat safeguards, JSONL episode artifacts, and decomposed reward.
27
-
28
- ## Diagrams
29
-
30
- ![CyberSecurity_OWASP architecture](assets/architecture_diagram.svg)
31
-
32
- ![CyberSecurity_OWASP RL training flow](assets/env_rl_training_flow_diagram.svg)
33
-
34
- Editable Mermaid sources are available in `assets/architecture_diagram.mmd` and `assets/env_rl_training_flow_diagram.mmd`.
35
-
36
- ## Quick Start
37
-
38
- ```bash
39
- uv sync --extra dev
40
- uv run --extra dev pytest
41
- uv run python scripts/generate_scenario_cache.py --train-per-bucket 3 --validation-per-bucket 3 --heldout-per-bucket 3
42
- uv run server --port 8000
43
- ```
44
-
45
- Then connect with the OpenEnv client:
46
-
47
- ```python
48
- from CyberSecurity_OWASP import CyberSecurityOWASPAction, CyberSecurityOWASPEnv
49
-
50
- with CyberSecurityOWASPEnv(base_url="http://localhost:8000") as env:
51
- result = env.reset(seed=7)
52
- print(result.observation.task_brief)
53
- result = env.step(CyberSecurityOWASPAction(tool_name="list_routes"))
54
- print(result.observation.last_tool_result)
55
- ```
56
-
57
- ## Action Space
58
-
59
- The agent emits one JSON action at a time:
60
-
61
- ```json
62
- {"tool_name":"read_file","arguments":{"path":"app/routes/invoices.py"}}
63
- ```
64
-
65
- Supported tools:
66
-
67
- - `inspect_policy_graph`
68
- - `list_routes`
69
- - `read_openapi`
70
- - `read_file`
71
- - `search_code`
72
- - `send_local_request`
73
- - `compare_identities`
74
- - `submit_diagnosis`
75
- - `patch_file`
76
- - `run_visible_tests`
77
- - `submit_fix`
78
- - `noop`
79
-
80
- Tools are phase-gated:
81
-
82
- - `discover`: inspect policy/routes/files, run safe local requests, compare identities, submit diagnosis.
83
- - `patch`: read/search, patch editable app files, run visible tests, submit final fix.
84
- - `done`: stable terminal observation only.
85
-
86
- ## Reward
87
-
88
- Terminal reward uses stable components:
89
-
90
- ```python
91
- {
92
- "discovery": 0.0,
93
- "security": 0.0,
94
- "regression": 0.0,
95
- "public_routes": 0.0,
96
- "patch_quality": 0.0,
97
- "visible_tests": 0.0,
98
- "safety": 0.0,
99
- "anti_cheat": 0.0,
100
- "terminal_total": 0.0,
101
- "progressive": 0.0,
102
- "step_penalty": 0.0,
103
- "speed_bonus": 0.0,
104
- "token_penalty": 0.0,
105
- "behavior_penalty": 0.0,
106
- "train_total": 0.0,
107
- "total": 0.0,
108
- }
109
- ```
110
-
111
- The verifier rewards blocking the hidden exploit while preserving legitimate owner/admin behavior and intentionally public routes. Terminal scoring requires visible checks, hidden authorization checks, a policy-oracle matrix, regression checks, public-route preservation, and patch-quality checks. It penalizes deny-all fixes, hardcoded IDs, repeated/invalid action patterns, hidden file probes, external URL attempts, and test/fixture tampering.
112
-
113
- Training can enable dense rewards with `CYBERSECURITY_OWASP_REWARD_MODE=dense_train`.
114
- Dense mode adds configurable progressive rewards, small efficiency penalties, and capped behavior penalties from `training/configs/grpo_small.yaml`; evaluation defaults to sparse terminal scoring.
115
-
116
- ## Scenario Cache And Generation
117
-
118
- Scenario generation is an offline/cache-prep concern. `reset(seed)` asks the `CurriculumController` for a difficulty tier and target weakness, then loads a validated executable bundle from the scenario cache when `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`. Local development defaults to `fallback`, which compiles deterministically on a cache miss.
119
-
120
- The scenario/curriculum author is config-driven through `configs/scenario_authoring.small.json`. The default offline author model is `deepseek-ai/DeepSeek-V4-Pro` with Hugging Face provider settings, thinking mode enabled, `temperature=1.0`, and `top_p=1.0`. This model config is for scenario authoring, not the RL policy model.
121
-
122
- The cache bundle contract is:
123
-
124
- - `scenario.json`
125
- - `app_source/`
126
- - `policy_graph.json`
127
- - `visible_tests.py`
128
- - `hidden_tests.py`
129
- - `oracle_tests.py`
130
- - `expected_exploit_trace.json`
131
- - `reward_config.json`
132
- - `metadata.json`
133
-
134
- Cache keys include difficulty, authorization bug type, app family, framework, policy shape, tenant model, exploit depth, patch scope, regression risk, generator version, verifier version, and scenario hash.
135
-
136
- The MVP compiler currently generates:
137
-
138
- - invoices domain policy graph;
139
- - bounded adversarial target metadata such as same-role cross-object access, cross-tenant access, public-route overlocking traps, alternate route/service reachability, or visible-test-only edge cases;
140
- - randomized users, tenants, invoices, and IDs;
141
- - generated app files under `app/`;
142
- - visible tests under `tests/test_visible.py`;
143
- - hidden facts, oracle tuples, scenario family metadata, and verifier targets kept out of observations.
144
-
145
- Additional domains and bug families are scaffolded for extension.
146
-
147
- ## Runtime Components
148
-
149
- The OpenEnv runtime is split into small server modules:
150
-
151
- - `server/curriculum.py` tracks mastery, weak spots, reward trend, and difficulty tier.
152
- - `server/scenario_cache.py` writes and loads validated executable scenario bundles.
153
- - `server/adversarial_designer.py` chooses safe synthetic scenario targets from tracked weaknesses.
154
- - `server/scenario_factory.py` compiles the generated app during cache prep or local fallback.
155
- - `server/app_sandbox.py` handles editable workspace reads, patches, local requests, and OpenAPI summaries.
156
- - `server/action_tools.py` dispatches typed tools through the sandbox.
157
- - `server/authz_oracle.py` builds the hidden allowed/denied user-resource-action matrix.
158
- - `server/verifier.py` aggregates visible tests, hidden tests, oracle matrix, regression/public-route checks, and patch quality.
159
- - `server/episode_logger.py` appends JSONL rollouts under `outputs/rollouts/`.
160
-
161
- The agent sees partial observations only: product rules, fixture aliases, route summaries, visible test results, and action errors. Hidden tests, oracle tuples, injected bug labels, and held-out scenario-family labels stay internal.
162
-
163
- ## Testing
164
-
165
- ```bash
166
- uv run --extra dev pytest
167
- ```
168
-
169
- The suite covers model serialization, reset/step/state behavior, seed reproducibility, invalid actions, reward outcomes, anti-cheat checks, scripted rollout policies, curriculum selection, adversarial targeting, held-out scenario families, oracle checks, verifier aggregation, and episode artifact logging.
170
-
171
- ## Training Scaffold
172
-
173
- Training files are under `training/`:
174
-
175
- - `rollout.py`
176
- - `reward_funcs.py`
177
- - `train_grpo.py`
178
- - `eval_before_after.py`
179
- - `trackio_utils.py`
180
- - `configs/grpo_small.yaml`
181
-
182
- The training scaffold is intentionally minimal until the environment/verifier behavior is stable. Trackio metric names and GRPO defaults follow the project brief.
183
-
 
 
184
  `training/train_grpo.py` in this repo is a config helper only; it does not execute training locally.
185
  Use the Modal launchers in `scripts/modal_train_grpo.py` (persistent) and
186
  `scripts/modal_ephemeral_train.py` (smoke) for real GRPO runs.
187
 
188
- Modal smoke and GRPO runs use `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` and mount the persistent `CyberSecurity_OWASP-scenario-cache` volume. Prepare that cache before smoke/training:
189
-
190
- ```bash
191
- uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
192
- uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
193
- ```
194
-
195
- If the cache slice is missing or below the configured per-bucket minimum, Modal training fails before rollouts rather than compiling scenarios during the run.
196
- The persistent GRPO launcher runs a CPU-only scenario-cache preflight before it starts the L4 GPU function, so missing cache coverage fails before GPU allocation.
197
-
198
- ## Trackio Run Tracking
199
-
200
- Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.
201
-
202
- ```bash
203
- export TRACKIO_SPACE_ID=<hf-user>/CyberSecurity_OWASP-trackio
204
- export TRACKIO_PROJECT=CyberSecurity_OWASP-grpo
205
- ```
206
-
207
- Use the tracked smoke wrapper instead of invoking pytest directly when producing run artifacts:
208
-
209
- ```bash
210
- bash scripts/smoke_test.sh
211
- uv run python scripts/track_pytest.py tests
212
- ```
213
-
214
- Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.
215
-
216
- Training, baseline, and smoke runs also log the effective reward config at step
217
- 0. In Trackio, open **Media & Tables** and select the `reward_config` table to
218
- see the actual values for each reward key, including stage-specific values,
219
- caps, thresholds, terminate flags, and descriptions. Scalar metrics under
220
- `reward_config/<key>/<field>` expose the same numeric values for plotting and
221
- filtering, for example `reward_config/policy_inspected/value` and
222
- `reward_config/shaping_weight/resolved`.
223
-
224
- Each run config includes `reward_config_id`, `reward_config_hash`,
225
- `reward_config_source`, `reward_mode`, and `reward_stage`. For manual ablations,
226
- compare runs with the same scenario/model settings and different
227
- `reward_config_hash` values to see which reward weights produced each training
228
- curve.
229
 
230
- ## Modal Ephemeral Runs
 
 
231
 
232
- Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.
233
-
234
- Install the optional local Modal client:
235
 
236
  ```bash
237
  uv sync --extra modal
 
 
238
  ```
239
 
240
- Run a temporary Modal app for a cheap environment/training smoke check:
241
-
242
- ```bash
243
- uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
244
- uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode smoke --episodes 4
245
- ```
246
-
247
- The app is ephemeral: Modal starts it for the command and stops it when the command exits. The remote result is written locally under `outputs/rollouts/` and the summary metrics are logged to Trackio.
248
-
249
- You can also validate the GRPO config construction remotely:
250
-
251
- ```bash
252
- uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode grpo-config
253
- ```
254
-
255
- The shell wrapper is equivalent:
256
-
257
- ```bash
258
- MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh
259
- ```
260
-
261
- ## Synthetic SFT Before GRPO
262
-
263
- Use supervised fine-tuning to warm-start `unsloth/gemma-4-E2B-it` before GRPO.
264
- The SFT generator executes every teacher action in the real environment and
265
- keeps only trajectories that pass the deterministic reward verifier.
266
-
267
- Generate a 300-train-episode curriculum SFT dataset across levels `0,1,2,3`:
268
 
269
  ```bash
270
  uv run python scripts/generate_sft_dataset.py \
271
  --teacher-model deepseek-ai/DeepSeek-V4-Pro \
272
  --target-model unsloth/gemma-4-E2B-it \
273
  --difficulty-levels 0,1,2,3 \
274
- --difficulty-buckets 4 \
275
  --episodes 75 \
276
  --validation-episodes 20 \
277
  --workers 8 \
278
  --out-dir outputs/sft
279
- ```
280
-
281
- `--episodes` is per difficulty level when `--difficulty-levels` is set, so
282
- `--episodes 75` across four levels gives 300 total train episodes. Expect
283
- roughly 2,400-4,500 chat-format JSONL rows because each successful trajectory
284
- contributes one row per action step. The script writes JSONL rows under
285
- `outputs/sft/`, trajectory artifacts under `outputs/sft/trajectories/`, a
286
- dataset card at `outputs/sft/README.md`, and `outputs/sft/manifest.json` with
287
- reward summaries and curriculum coverage.
288
 
289
- Verify reward metadata before any training run:
290
-
291
- ```bash
292
  uv run python scripts/generate_sft_dataset.py \
293
  --verify-only \
294
  --difficulty-levels 0,1,2,3 \
295
  --out-dir outputs/sft
296
  ```
297
 
298
- Push the verified dataset to Hugging Face Hub:
299
-
300
- ```bash
301
- uv run python scripts/generate_sft_dataset.py \
302
- --push-only \
303
- --difficulty-levels 0,1,2,3 \
304
- --out-dir outputs/sft \
305
- --dataset-repo-id Humanlearning/CyberSecurity_OWASP-sft-dataset
306
- ```
307
-
308
- The canonical dataset repo name is
309
- `Humanlearning/CyberSecurity_OWASP-sft-dataset`. The upload is refused if
310
- reward verification fails or `HF_TOKEN` is missing.
311
-
312
- You can also generate and push in one command by adding `--push-to-hub` to the
313
- generation command.
314
-
315
- For local CI or smoke checks, add `--dry-run-oracle`; official SFT data should
316
- use the teacher path and still pass the verifier gate above.
317
-
318
- Launch SFT on Modal after reward verification passes:
319
 
320
  ```bash
321
  uv run --extra modal modal run --detach scripts/modal_train_sft.py \
@@ -330,184 +234,352 @@ uv run --extra modal modal run --detach scripts/modal_train_sft.py \
330
  --detach
331
  ```
332
 
333
- `scripts/modal_train_sft.py` re-checks the JSONL reward metadata locally before
334
- upload and again inside Modal before loading the model. It refuses to start SFT
335
- unless all required curriculum difficulties are represented and the verifier
336
- reward metadata passes. The default SFT config trains the full dataset
337
- (`--max-steps -1`) with bf16/tf32, LoRA rank 32, and Modal GPU fallback
338
- `H200 -> H100 -> A100-80GB -> L40S`. TRL does not support packing or
339
- assistant-only loss for the Gemma 4 vision-language loader, so both remain
340
- disabled for this model. The script pre-tokenizes the small JSONL dataset
341
- serially before constructing `SFTTrainer`, which avoids TRL multiprocessing
342
- around the Gemma/Unsloth config object. It also uses the base Transformers loss
343
- path to avoid a TRL entropy-metric incompatibility with Gemma 4 lazy logits. A
344
- warm run for the 300-400 episode dataset should usually finish in about 20-60
345
- minutes; first image or model-cache builds can push that closer to 45-90
346
- minutes.
347
-
348
- Continue GRPO from the SFT LoRA:
349
-
350
- The GRPO launcher downloads the Hub adapter, attaches a matching trainable
351
- Unsloth LoRA to Gemma 4, and then loads the adapter safetensors. This keeps the
352
- SFT handoff compatible with Gemma 4's Unsloth linear wrappers.
353
 
354
  ```bash
355
  uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
356
  --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
357
- --max-steps 300 \
358
- --dataset-size 64 \
359
- --num-generations 8 \
360
- --difficulty 0 \
361
- --trace-log-every 10 \
362
- --detach
363
- ```
364
-
365
- ## Modal GRPO Training
366
-
367
- The persistent GPU training launcher packages this local repo into Modal, trains
368
- a small LoRA GRPO run, logs metrics and traces to Trackio, stores checkpoints in
369
- the `CyberSecurity_OWASP-grpo-runs` Modal volume, and pushes the output adapter
370
- to Hugging Face Hub.
371
-
372
- Create a Modal secret named `CyberSecurity_OWASP-secrets` with `HF_TOKEN`, then
373
- run the import/config check:
374
-
375
- ```bash
376
- uv run --extra modal modal run scripts/modal_train_grpo.py --mode config
377
- ```
378
-
379
- Run the default smoke GRPO job:
380
-
381
- ```bash
382
- uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
383
- uv run --extra modal modal run scripts/modal_train_grpo.py \
384
- --max-steps 10 \
385
- --dataset-size 16 \
386
- --num-generations 6 \
387
- --difficulty 0
388
- ```
389
-
390
- For GPU-utilization tuning on the same single L4, start with a larger but still
391
- bounded no-code trial:
392
-
393
- ```bash
394
- uv run --extra modal modal run scripts/modal_train_grpo.py \
395
- --max-steps 30 \
396
- --dataset-size 64 \
397
- --num-generations 8 \
398
- --max-completion-length 256 \
399
- --difficulty 0
400
- ```
401
-
402
- The launcher exposes GRPO throughput knobs for follow-up trials:
403
-
404
- ```bash
405
- # larger generation group, no vLLM
406
- uv run --extra modal modal run scripts/modal_train_grpo.py \
407
- --max-steps 30 --dataset-size 64 --num-generations 8 \
408
- --max-completion-length 256 --trace-log-every 5
409
-
410
- # vLLM colocate on the same L4
411
- uv run --extra modal modal run scripts/modal_train_grpo.py \
412
- --max-steps 30 --dataset-size 64 --num-generations 8 \
413
- --max-completion-length 256 --use-vllm \
414
- --vllm-gpu-memory-utilization 0.35 --trace-log-every 5
415
-
416
- # larger microbatch if the vLLM trial does not OOM
417
- uv run --extra modal modal run scripts/modal_train_grpo.py \
418
- --max-steps 30 --dataset-size 64 --num-generations 8 \
419
- --per-device-train-batch-size 2 --gradient-accumulation-steps 4 \
420
- --max-completion-length 256 --use-vllm \
421
- --vllm-gpu-memory-utilization 0.45 --trace-log-every 5
422
- ```
423
-
424
- `per_device_train_batch_size * gradient_accumulation_steps * world_size` must
425
- be divisible by `num_generations`; the launcher validates this before the GPU
426
- container starts. Scalar Trackio metrics still log every reward callback, while
427
- sample trace tables and Trace objects are throttled by `--trace-log-every`
428
- (`1` restores every-callback logging, `0` disables trace artifacts).
429
-
430
- ### Parallel Modal GRPO Runs
431
-
432
- Parallel Modal GRPO runs are safe when each run has its own seed range, run
433
- name, and output target, while the shared cache volumes remain read-only.
434
- Before launching another job, check what is already active:
435
-
436
- ```bash
437
- uv run --extra modal modal app list
438
- uv run --extra modal modal app logs <app-id>
439
- ```
440
-
441
- Launch long-running parallel jobs with both Modal CLI detach and the launcher
442
- detach flag. The CLI-level `--detach` keeps the remote function alive after the
443
- local entrypoint exits; the launcher `--detach` prevents the parent Modal
444
- function from waiting on the GPU call.
445
-
446
- ```bash
447
- uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
448
  --max-steps 300 \
449
  --dataset-size 64 \
450
  --num-generations 8 \
451
  --max-completion-length 768 \
452
  --difficulty 0 \
453
  --trace-log-every 10 \
454
- --seed-start 10000 \
 
455
  --detach
456
  ```
457
 
458
- For multiple concurrent experiments:
459
-
460
- - Use a unique `--seed-start` range for every run, normally spaced by at least
461
- 10,000 seeds.
462
- - Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; do not compile
463
- scenarios during training.
464
- - Do not run `prepare-cache --cache-force` while training jobs are active.
465
- - Keep `--push-to-hub` disabled unless each run has a unique
466
- `--output-repo-id`.
467
- - Let the launcher generate unique timestamped Trackio run names, or set an
468
- explicit `RUN_NAME` only when it is globally unique.
469
- - Use the same Trackio Space/project for comparable metrics, but never reuse a
470
- run name.
471
- - Treat `CyberSecurity_OWASP-model-cache` and
472
- `CyberSecurity_OWASP-scenario-cache` as shared read-mostly infrastructure
473
- during training. Run outputs and checkpoints should stay under each run's
474
- unique output directory.
475
-
476
- If a Windows shell fails with a Unicode `charmap` encoding error during Modal
477
- startup, rerun with UTF-8 enabled for that command:
478
 
479
  ```powershell
480
- $env:PYTHONIOENCODING='utf-8'; $env:PYTHONUTF8='1'; uv run --extra modal modal run --detach scripts/modal_train_grpo.py --max-steps 300 --dataset-size 64 --num-generations 4 --max-completion-length 768 --difficulty 0 --trace-log-every 10 --seed-start 60000 --detach
481
  ```
482
 
483
- If running from a public repository and you do not want Modal to package the
484
- local workspace, use public source mode:
485
-
486
- ```bash
487
- uv run --extra modal modal run scripts/modal_train_grpo.py \
488
- --source-mode public \
489
- --repo-url https://github.com/humandotlearning/CyberSecurity_OWASP.git \
490
- --repo-branch master \
491
- --max-steps 10 \
492
- --dataset-size 16 \
493
- --num-generations 6 \
494
- --difficulty 0
495
- ```
496
-
497
- Defaults are derived from `HF_TOKEN`:
498
-
499
- - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
500
- - Trackio project: `CyberSecurity_OWASP-grpo`
501
- - Training model: `unsloth/gemma-4-E2B-it`
502
- - Output repo: `<hf-user>/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-grpo-lora`
503
-
504
- Override these with `--trackio-space-id`, `--trackio-project`, and
505
- `--output-repo-id` when needed. The persistent GRPO launcher intentionally rejects non-Gemma model overrides so smoke runs match the Unsloth Gemma 4 E2B RL notebook.
506
-
507
- ## Docker / Spaces
508
 
509
  ```bash
510
- docker build -t CyberSecurity_OWASP:latest -f server/Dockerfile .
511
- docker run --rm -p 8000:8000 CyberSecurity_OWASP:latest
512
- openenv push --repo-id <username>/CyberSecurity_OWASP
513
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: CyberSecurity_OWASP Environment Server
3
+ emoji: 🛡️
4
+ colorFrom: blue
5
+ colorTo: gray
6
+ sdk: docker
7
+ pinned: false
8
+ app_port: 8000
9
+ base_path: /web
10
+ tags:
11
+ - openenv
12
+ - cybersecurity
13
+ - owasp
14
+ ---
15
+
16
+ # CyberSecurity_OWASP
17
+
18
+ [Hugging Face Space](https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP) | [Mini-blog](blog/blog.md)
19
+
20
+ `CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:
21
+
22
+ ```text
23
+ inspect generated app + policy -> discover authorization bug -> submit diagnosis -> patch code -> preserve intended behavior
24
+ ```
25
+
26
+ The current implementation includes a functional closed-loop MVP scenario: an invoices FastAPI-style app with one injected OWASP A01 BOLA/IDOR defect, config-driven curriculum settings, cache-backed scenario reset, an ephemeral app sandbox, multi-layer deterministic verifier checks, anti-cheat safeguards, JSONL episode artifacts, and decomposed reward.
27
+
28
+ ## Diagrams
29
+
30
+ [Architecture diagram](assets/architecture_diagram.svg) | [RL training flow diagram](assets/env_rl_training_flow_diagram.svg)
31
+
32
+ ![CyberSecurity_OWASP architecture](assets/architecture_diagram.svg)
33
+
34
+ ![CyberSecurity_OWASP RL training flow](assets/env_rl_training_flow_diagram.svg)
35
+
36
+ Editable Mermaid sources are available in `assets/architecture_diagram.mmd` and `assets/env_rl_training_flow_diagram.mmd`.
37
+
38
+ ## Quick Start
39
+
40
+ ```bash
41
+ uv sync --extra dev
42
+ uv run --extra dev pytest
43
+ uv run python scripts/generate_scenario_cache.py --train-per-bucket 3 --validation-per-bucket 3 --heldout-per-bucket 3
44
+ uv run server --port 8000
45
+ ```
46
+
47
+ Then connect with the OpenEnv client:
48
+
49
+ ```python
50
+ from CyberSecurity_OWASP import CyberSecurityOWASPAction, CyberSecurityOWASPEnv
51
+
52
+ with CyberSecurityOWASPEnv(base_url="http://localhost:8000") as env:
53
+ result = env.reset(seed=7)
54
+ print(result.observation.task_brief)
55
+ result = env.step(CyberSecurityOWASPAction(tool_name="list_routes"))
56
+ print(result.observation.last_tool_result)
57
+ ```
58
+
59
+ ## Action Space
60
+
61
+ The agent emits one JSON action at a time:
62
+
63
+ ```json
64
+ {"tool_name":"read_file","arguments":{"path":"app/routes/invoices.py"}}
65
+ ```
66
+
67
+ Supported tools:
68
+
69
+ - `inspect_policy_graph`
70
+ - `list_routes`
71
+ - `read_openapi`
72
+ - `read_file`
73
+ - `search_code`
74
+ - `send_local_request`
75
+ - `compare_identities`
76
+ - `submit_diagnosis`
77
+ - `patch_file`
78
+ - `run_visible_tests`
79
+ - `submit_fix`
80
+ - `noop`
81
+
82
+ Tools are phase-gated:
83
+
84
+ - `discover`: inspect policy/routes/files, run safe local requests, compare identities, submit diagnosis.
85
+ - `patch`: read/search, patch editable app files, run visible tests, submit final fix.
86
+ - `done`: stable terminal observation only.
87
+
88
+ ## Reward
89
+
90
+ Terminal reward uses stable components:
91
+
92
+ ```python
93
+ {
94
+ "discovery": 0.0,
95
+ "security": 0.0,
96
+ "regression": 0.0,
97
+ "public_routes": 0.0,
98
+ "patch_quality": 0.0,
99
+ "visible_tests": 0.0,
100
+ "safety": 0.0,
101
+ "anti_cheat": 0.0,
102
+ "terminal_total": 0.0,
103
+ "progressive": 0.0,
104
+ "step_penalty": 0.0,
105
+ "speed_bonus": 0.0,
106
+ "token_penalty": 0.0,
107
+ "behavior_penalty": 0.0,
108
+ "train_total": 0.0,
109
+ "total": 0.0,
110
+ }
111
+ ```
112
+
113
+ The verifier rewards blocking the hidden exploit while preserving legitimate owner/admin behavior and intentionally public routes. Terminal scoring requires visible checks, hidden authorization checks, a policy-oracle matrix, regression checks, public-route preservation, and patch-quality checks. It penalizes deny-all fixes, hardcoded IDs, repeated/invalid action patterns, hidden file probes, external URL attempts, and test/fixture tampering.
114
+
115
+ Training can enable dense rewards with `CYBERSECURITY_OWASP_REWARD_MODE=dense_train`.
116
+ Dense mode adds configurable progressive rewards, small efficiency penalties, and capped behavior penalties from `training/configs/grpo_small.yaml`; evaluation defaults to sparse terminal scoring.
117
+
118
+ ## Scenario Cache And Generation
119
+
120
+ Scenario generation is an offline/cache-prep concern. `reset(seed)` asks the `CurriculumController` for a difficulty tier and target weakness, then loads a validated executable bundle from the scenario cache when `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`. Local development defaults to `fallback`, which compiles deterministically on a cache miss.
121
+
122
+ The scenario/curriculum author is config-driven through `configs/scenario_authoring.small.json`. The default offline author model is `deepseek-ai/DeepSeek-V4-Pro` with Hugging Face provider settings, thinking mode enabled, `temperature=1.0`, and `top_p=1.0`. This model config is for scenario authoring, not the RL policy model.
123
+
124
+ The cache bundle contract is:
125
+
126
+ - `scenario.json`
127
+ - `app_source/`
128
+ - `policy_graph.json`
129
+ - `visible_tests.py`
130
+ - `hidden_tests.py`
131
+ - `oracle_tests.py`
132
+ - `expected_exploit_trace.json`
133
+ - `reward_config.json`
134
+ - `metadata.json`
135
+
136
+ Cache keys include difficulty, authorization bug type, app family, framework, policy shape, tenant model, exploit depth, patch scope, regression risk, generator version, verifier version, and scenario hash.
137
+
138
+ The MVP compiler currently generates:
139
+
140
+ - invoices domain policy graph;
141
+ - bounded adversarial target metadata such as same-role cross-object access, cross-tenant access, public-route overlocking traps, alternate route/service reachability, or visible-test-only edge cases;
142
+ - randomized users, tenants, invoices, and IDs;
143
+ - generated app files under `app/`;
144
+ - visible tests under `tests/test_visible.py`;
145
+ - hidden facts, oracle tuples, scenario family metadata, and verifier targets kept out of observations.
146
+
147
+ Additional domains and bug families are scaffolded for extension.
148
+
149
+ ## Runtime Components
150
+
151
+ The OpenEnv runtime is split into small server modules:
152
+
153
+ - `server/curriculum.py` tracks mastery, weak spots, reward trend, and difficulty tier.
154
+ - `server/scenario_cache.py` writes and loads validated executable scenario bundles.
155
+ - `server/adversarial_designer.py` chooses safe synthetic scenario targets from tracked weaknesses.
156
+ - `server/scenario_factory.py` compiles the generated app during cache prep or local fallback.
157
+ - `server/app_sandbox.py` handles editable workspace reads, patches, local requests, and OpenAPI summaries.
158
+ - `server/action_tools.py` dispatches typed tools through the sandbox.
159
+ - `server/authz_oracle.py` builds the hidden allowed/denied user-resource-action matrix.
160
+ - `server/verifier.py` aggregates visible tests, hidden tests, oracle matrix, regression/public-route checks, and patch quality.
161
+ - `server/episode_logger.py` appends JSONL rollouts under `outputs/rollouts/`.
162
+
163
+ The agent sees partial observations only: product rules, fixture aliases, route summaries, visible test results, and action errors. Hidden tests, oracle tuples, injected bug labels, and held-out scenario-family labels stay internal.
164
+
165
+ ## Testing
166
+
167
+ ```bash
168
+ uv run --extra dev pytest
169
+ ```
170
+
171
+ The suite covers model serialization, reset/step/state behavior, seed reproducibility, invalid actions, reward outcomes, anti-cheat checks, scripted rollout policies, curriculum selection, adversarial targeting, held-out scenario families, oracle checks, verifier aggregation, and episode artifact logging.
172
+
173
+ ## Training Scaffold
174
+
175
+ Training files are under `training/`:
176
+
177
+ - `rollout.py`
178
+ - `reward_funcs.py`
179
+ - `train_grpo.py`
180
+ - `eval_before_after.py`
181
+ - `trackio_utils.py`
182
+ - `configs/grpo_small.yaml`
183
+
184
+ The training scaffold is intentionally minimal until the environment/verifier behavior is stable. Trackio metric names and GRPO defaults follow the project brief.
185
+
186
  `training/train_grpo.py` in this repo is a config helper only; it does not execute training locally.
187
  Use the Modal launchers in `scripts/modal_train_grpo.py` (persistent) and
188
  `scripts/modal_ephemeral_train.py` (smoke) for real GRPO runs.
189
 
190
+ ### Run SFT And GRPO Training Scripts
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
 
192
+ Training runs on Modal. Do not run the GRPO loop directly on the local machine;
193
+ use the launcher scripts so scenario cache preflight, Trackio logging, Modal
194
+ volumes, and Hub uploads stay consistent.
195
 
196
+ First install the Modal extra and prepare the scenario cache:
 
 
197
 
198
  ```bash
199
  uv sync --extra modal
200
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode config
201
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
202
  ```
203
 
204
+ Generate and verify SFT trajectories before supervised fine-tuning:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
205
 
206
  ```bash
207
  uv run python scripts/generate_sft_dataset.py \
208
  --teacher-model deepseek-ai/DeepSeek-V4-Pro \
209
  --target-model unsloth/gemma-4-E2B-it \
210
  --difficulty-levels 0,1,2,3 \
 
211
  --episodes 75 \
212
  --validation-episodes 20 \
213
  --workers 8 \
214
  --out-dir outputs/sft
 
 
 
 
 
 
 
 
 
215
 
 
 
 
216
  uv run python scripts/generate_sft_dataset.py \
217
  --verify-only \
218
  --difficulty-levels 0,1,2,3 \
219
  --out-dir outputs/sft
220
  ```
221
 
222
+ Run SFT on Modal and push the warm-start LoRA:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
 
224
  ```bash
225
  uv run --extra modal modal run --detach scripts/modal_train_sft.py \
 
234
  --detach
235
  ```
236
 
237
+ Continue with GRPO from the SFT adapter:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
238
 
239
  ```bash
240
  uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
241
  --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
  --max-steps 300 \
243
  --dataset-size 64 \
244
  --num-generations 8 \
245
  --max-completion-length 768 \
246
  --difficulty 0 \
247
  --trace-log-every 10 \
248
+ --trackio-space-id Humanlearning/CyberSecurity_OWASP-trackio \
249
+ --trackio-project CyberSecurity_OWASP-grpo \
250
  --detach
251
  ```
252
 
253
+ For reward-rubric ablations, use the PowerShell launcher and configs under
254
+ `training/configs/reward_ablations/`:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
255
 
256
  ```powershell
257
+ .\scripts\launch_reward_ablations.ps1
258
  ```
259
 
260
+ Modal smoke and GRPO runs use `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` and mount the persistent `CyberSecurity_OWASP-scenario-cache` volume. Prepare that cache before smoke/training:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
  ```bash
263
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
264
+ uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
265
+ ```
266
+
267
+ If the cache slice is missing or below the configured per-bucket minimum, Modal training fails before rollouts rather than compiling scenarios during the run.
268
+ The persistent GRPO launcher runs a CPU-only scenario-cache preflight before it starts the L4 GPU function, so missing cache coverage fails before GPU allocation.
269
+
270
+ ## Trackio Run Tracking
271
+
272
+ Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.
273
+
274
+ ```bash
275
+ export TRACKIO_SPACE_ID=<hf-user>/CyberSecurity_OWASP-trackio
276
+ export TRACKIO_PROJECT=CyberSecurity_OWASP-grpo
277
+ ```
278
+
279
+ Use the tracked smoke wrapper instead of invoking pytest directly when producing run artifacts:
280
+
281
+ ```bash
282
+ bash scripts/smoke_test.sh
283
+ uv run python scripts/track_pytest.py tests
284
+ ```
285
+
286
+ Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.
287
+
288
+ Training, baseline, and smoke runs also log the effective reward config at step
289
+ 0. In Trackio, open **Media & Tables** and select the `reward_config` table to
290
+ see the actual values for each reward key, including stage-specific values,
291
+ caps, thresholds, terminate flags, and descriptions. Scalar metrics under
292
+ `reward_config/<key>/<field>` expose the same numeric values for plotting and
293
+ filtering, for example `reward_config/policy_inspected/value` and
294
+ `reward_config/shaping_weight/resolved`.
295
+
296
+ Each run config includes `reward_config_id`, `reward_config_hash`,
297
+ `reward_config_source`, `reward_mode`, and `reward_stage`. For manual ablations,
298
+ compare runs with the same scenario/model settings and different
299
+ `reward_config_hash` values to see which reward weights produced each training
300
+ curve.
301
+
302
+ ## Modal Ephemeral Runs
303
+
304
+ Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.
305
+
306
+ Install the optional local Modal client:
307
+
308
+ ```bash
309
+ uv sync --extra modal
310
+ ```
311
+
312
+ Run a temporary Modal app for a cheap environment/training smoke check:
313
+
314
+ ```bash
315
+ uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
316
+ uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode smoke --episodes 4
317
+ ```
318
+
319
+ The app is ephemeral: Modal starts it for the command and stops it when the command exits. The remote result is written locally under `outputs/rollouts/` and the summary metrics are logged to Trackio.
320
+
321
+ You can also validate the GRPO config construction remotely:
322
+
323
+ ```bash
324
+ uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode grpo-config
325
+ ```
326
+
327
+ The shell wrapper is equivalent:
328
+
329
+ ```bash
330
+ MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh
331
+ ```
332
+
333
+ ## Synthetic SFT Before GRPO
334
+
335
+ Use supervised fine-tuning to warm-start `unsloth/gemma-4-E2B-it` before GRPO.
336
+ The SFT generator executes every teacher action in the real environment and
337
+ keeps only trajectories that pass the deterministic reward verifier.
338
+
339
+ Generate a 300-train-episode curriculum SFT dataset across levels `0,1,2,3`:
340
+
341
+ ```bash
342
+ uv run python scripts/generate_sft_dataset.py \
343
+ --teacher-model deepseek-ai/DeepSeek-V4-Pro \
344
+ --target-model unsloth/gemma-4-E2B-it \
345
+ --difficulty-levels 0,1,2,3 \
346
+ --difficulty-buckets 4 \
347
+ --episodes 75 \
348
+ --validation-episodes 20 \
349
+ --workers 8 \
350
+ --out-dir outputs/sft
351
+ ```
352
+
353
+ `--episodes` is per difficulty level when `--difficulty-levels` is set, so
354
+ `--episodes 75` across four levels gives 300 total train episodes. Expect
355
+ roughly 2,400-4,500 chat-format JSONL rows because each successful trajectory
356
+ contributes one row per action step. The script writes JSONL rows under
357
+ `outputs/sft/`, trajectory artifacts under `outputs/sft/trajectories/`, a
358
+ dataset card at `outputs/sft/README.md`, and `outputs/sft/manifest.json` with
359
+ reward summaries and curriculum coverage.
360
+
361
+ Verify reward metadata before any training run:
362
+
363
+ ```bash
364
+ uv run python scripts/generate_sft_dataset.py \
365
+ --verify-only \
366
+ --difficulty-levels 0,1,2,3 \
367
+ --out-dir outputs/sft
368
+ ```
369
+
370
+ Push the verified dataset to Hugging Face Hub:
371
+
372
+ ```bash
373
+ uv run python scripts/generate_sft_dataset.py \
374
+ --push-only \
375
+ --difficulty-levels 0,1,2,3 \
376
+ --out-dir outputs/sft \
377
+ --dataset-repo-id Humanlearning/CyberSecurity_OWASP-sft-dataset
378
+ ```
379
+
380
+ The canonical dataset repo name is
381
+ `Humanlearning/CyberSecurity_OWASP-sft-dataset`. The upload is refused if
382
+ reward verification fails or `HF_TOKEN` is missing.
383
+
384
+ You can also generate and push in one command by adding `--push-to-hub` to the
385
+ generation command.
386
+
387
+ For local CI or smoke checks, add `--dry-run-oracle`; official SFT data should
388
+ use the teacher path and still pass the verifier gate above.
389
+
390
+ Launch SFT on Modal after reward verification passes:
391
+
392
+ ```bash
393
+ uv run --extra modal modal run --detach scripts/modal_train_sft.py \
394
+ --local-train-path outputs/sft/train.jsonl \
395
+ --local-validation-path outputs/sft/validation.jsonl \
396
+ --local-manifest-path outputs/sft/manifest.json \
397
+ --required-difficulties 0,1,2,3 \
398
+ --trackio-space-id Humanlearning/CyberSecurity_OWASP-trackio \
399
+ --trackio-project CyberSecurity_OWASP-sft \
400
+ --output-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
401
+ --push-to-hub \
402
+ --detach
403
+ ```
404
+
405
+ `scripts/modal_train_sft.py` re-checks the JSONL reward metadata locally before
406
+ upload and again inside Modal before loading the model. It refuses to start SFT
407
+ unless all required curriculum difficulties are represented and the verifier
408
+ reward metadata passes. The default SFT config trains the full dataset
409
+ (`--max-steps -1`) with bf16/tf32, LoRA rank 32, and Modal GPU fallback
410
+ `H200 -> H100 -> A100-80GB -> L40S`. TRL does not support packing or
411
+ assistant-only loss for the Gemma 4 vision-language loader, so both remain
412
+ disabled for this model. The script pre-tokenizes the small JSONL dataset
413
+ serially before constructing `SFTTrainer`, which avoids TRL multiprocessing
414
+ around the Gemma/Unsloth config object. It also uses the base Transformers loss
415
+ path to avoid a TRL entropy-metric incompatibility with Gemma 4 lazy logits. A
416
+ warm run for the 300-400 episode dataset should usually finish in about 20-60
417
+ minutes; first image or model-cache builds can push that closer to 45-90
418
+ minutes.
419
+
420
+ Continue GRPO from the SFT LoRA:
421
+
422
+ The GRPO launcher downloads the Hub adapter, attaches a matching trainable
423
+ Unsloth LoRA to Gemma 4, and then loads the adapter safetensors. This keeps the
424
+ SFT handoff compatible with Gemma 4's Unsloth linear wrappers.
425
+
426
+ ```bash
427
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
428
+ --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
429
+ --max-steps 300 \
430
+ --dataset-size 64 \
431
+ --num-generations 8 \
432
+ --difficulty 0 \
433
+ --trace-log-every 10 \
434
+ --detach
435
+ ```
436
+
437
+ ## Modal GRPO Training
438
+
439
+ The persistent GPU training launcher packages this local repo into Modal, trains
440
+ a small LoRA GRPO run, logs metrics and traces to Trackio, stores checkpoints in
441
+ the `CyberSecurity_OWASP-grpo-runs` Modal volume, and pushes the output adapter
442
+ to Hugging Face Hub.
443
+
444
+ Create a Modal secret named `CyberSecurity_OWASP-secrets` with `HF_TOKEN`, then
445
+ run the import/config check:
446
+
447
+ ```bash
448
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode config
449
+ ```
450
+
451
+ Run the default smoke GRPO job:
452
+
453
+ ```bash
454
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
455
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
456
+ --max-steps 10 \
457
+ --dataset-size 16 \
458
+ --num-generations 6 \
459
+ --difficulty 0
460
+ ```
461
+
462
+ For GPU-utilization tuning on the same single L4, start with a larger but still
463
+ bounded no-code trial:
464
+
465
+ ```bash
466
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
467
+ --max-steps 30 \
468
+ --dataset-size 64 \
469
+ --num-generations 8 \
470
+ --max-completion-length 256 \
471
+ --difficulty 0
472
+ ```
473
+
474
+ The launcher exposes GRPO throughput knobs for follow-up trials:
475
+
476
+ ```bash
477
+ # larger generation group, no vLLM
478
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
479
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
480
+ --max-completion-length 256 --trace-log-every 5
481
+
482
+ # vLLM colocate on the same L4
483
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
484
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
485
+ --max-completion-length 256 --use-vllm \
486
+ --vllm-gpu-memory-utilization 0.35 --trace-log-every 5
487
+
488
+ # larger microbatch if the vLLM trial does not OOM
489
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
490
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
491
+ --per-device-train-batch-size 2 --gradient-accumulation-steps 4 \
492
+ --max-completion-length 256 --use-vllm \
493
+ --vllm-gpu-memory-utilization 0.45 --trace-log-every 5
494
+ ```
495
+
496
+ `per_device_train_batch_size * gradient_accumulation_steps * world_size` must
497
+ be divisible by `num_generations`; the launcher validates this before the GPU
498
+ container starts. Scalar Trackio metrics still log every reward callback, while
499
+ sample trace tables and Trace objects are throttled by `--trace-log-every`
500
+ (`1` restores every-callback logging, `0` disables trace artifacts).
501
+
502
+ ### Parallel Modal GRPO Runs
503
+
504
+ Parallel Modal GRPO runs are safe when each run has its own seed range, run
505
+ name, and output target, while the shared cache volumes remain read-only.
506
+ Before launching another job, check what is already active:
507
+
508
+ ```bash
509
+ uv run --extra modal modal app list
510
+ uv run --extra modal modal app logs <app-id>
511
+ ```
512
+
513
+ Launch long-running parallel jobs with both Modal CLI detach and the launcher
514
+ detach flag. The CLI-level `--detach` keeps the remote function alive after the
515
+ local entrypoint exits; the launcher `--detach` prevents the parent Modal
516
+ function from waiting on the GPU call.
517
+
518
+ ```bash
519
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
520
+ --max-steps 300 \
521
+ --dataset-size 64 \
522
+ --num-generations 8 \
523
+ --max-completion-length 768 \
524
+ --difficulty 0 \
525
+ --trace-log-every 10 \
526
+ --seed-start 10000 \
527
+ --detach
528
+ ```
529
+
530
+ For multiple concurrent experiments:
531
+
532
+ - Use a unique `--seed-start` range for every run, normally spaced by at least
533
+ 10,000 seeds.
534
+ - Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; do not compile
535
+ scenarios during training.
536
+ - Do not run `prepare-cache --cache-force` while training jobs are active.
537
+ - Keep `--push-to-hub` disabled unless each run has a unique
538
+ `--output-repo-id`.
539
+ - Let the launcher generate unique timestamped Trackio run names, or set an
540
+ explicit `RUN_NAME` only when it is globally unique.
541
+ - Use the same Trackio Space/project for comparable metrics, but never reuse a
542
+ run name.
543
+ - Treat `CyberSecurity_OWASP-model-cache` and
544
+ `CyberSecurity_OWASP-scenario-cache` as shared read-mostly infrastructure
545
+ during training. Run outputs and checkpoints should stay under each run's
546
+ unique output directory.
547
+
548
+ If a Windows shell fails with a Unicode `charmap` encoding error during Modal
549
+ startup, rerun with UTF-8 enabled for that command:
550
+
551
+ ```powershell
552
+ $env:PYTHONIOENCODING='utf-8'; $env:PYTHONUTF8='1'; uv run --extra modal modal run --detach scripts/modal_train_grpo.py --max-steps 300 --dataset-size 64 --num-generations 4 --max-completion-length 768 --difficulty 0 --trace-log-every 10 --seed-start 60000 --detach
553
+ ```
554
+
555
+ If running from a public repository and you do not want Modal to package the
556
+ local workspace, use public source mode:
557
+
558
+ ```bash
559
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
560
+ --source-mode public \
561
+ --repo-url https://github.com/humandotlearning/CyberSecurity_OWASP.git \
562
+ --repo-branch master \
563
+ --max-steps 10 \
564
+ --dataset-size 16 \
565
+ --num-generations 6 \
566
+ --difficulty 0
567
+ ```
568
+
569
+ Defaults are derived from `HF_TOKEN`:
570
+
571
+ - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
572
+ - Trackio project: `CyberSecurity_OWASP-grpo`
573
+ - Training model: `unsloth/gemma-4-E2B-it`
574
+ - Output repo: `<hf-user>/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-grpo-lora`
575
+
576
+ Override these with `--trackio-space-id`, `--trackio-project`, and
577
+ `--output-repo-id` when needed. The persistent GRPO launcher intentionally rejects non-Gemma model overrides so smoke runs match the Unsloth Gemma 4 E2B RL notebook.
578
+
579
+ ## Docker / Spaces
580
+
581
+ ```bash
582
+ docker build -t CyberSecurity_OWASP:latest -f server/Dockerfile .
583
+ docker run --rm -p 8000:8000 CyberSecurity_OWASP:latest
584
+ openenv push --repo-id <username>/CyberSecurity_OWASP
585
+ ```