mahithakur commited on
Commit
4bb4e67
·
1 Parent(s): d25e8b9

Readme cleanup

Browse files
Files changed (2) hide show
  1. README.md +86 -429
  2. uv.lock +0 -0
README.md CHANGED
@@ -12,493 +12,150 @@ tags:
12
  - code-review
13
  - rl-training
14
  - grpo
15
- - world-modeling
16
  - probe
17
  ---
18
 
19
- # PRobe — Train an Agent to Investigate Code, Not Just Scan It
20
 
21
- After training in PRobe, an agent can read a Python pull request, pinpoint real bugs and deliberate security backdoors line-by-line, classify each flaw as an honest mistake or an intentional attack, and know when to escalate to a security team — all from a reward signal with no LLM judge.
22
 
23
- ---
24
-
25
- ## The Problem
26
-
27
- The XZ Utils backdoor (CVE-2024-3094) slipped through two years of open-source review. SolarWinds compromised 18,000 organisations via a tampered build pipeline. In both cases the malicious change *looked* like a legitimate contribution — the kind of PR that lands in a code-review queue every day.
28
-
29
- Today's LLMs scan code like a linter. They find style issues, flag known CVE patterns, and produce plausible-sounding comments. What they don't do is *investigate* — reason about intent, distinguish an honest off-by-one from a planted authentication bypass, or know when to escalate rather than request changes. Reward signals for code generation are everywhere; reward signals for critical code *evaluation* barely exist.
30
-
31
- PRobe closes that gap. Its fully deterministic grader — keyword + line-range matching, no LLM judge — separates investigation quality from keyword spam. An agent that dumps every security term at random lines scores *negative*. One that reads carefully, probes for context, finds the right lines, and correctly labels each flaw as an honest bug or a deliberate backdoor scores close to `+1.0`.
32
-
33
- ---
34
-
35
- ## What the Agent Sees, Does, and Gets Rewarded For
36
-
37
- ### Plain English
38
-
39
- The agent is handed a Python source file and asked to review it like a senior security engineer. It can annotate suspicious lines, probe specific regions for more context, run a simulated scanner (which, like real tools, misses things and occasionally lies), and finally submit a verdict. On adversarial tasks it must also decide whether the code contains a deliberate backdoor and escalate to a security team if so. Every episode the code surface changes — variable names, line numbers, constants — so the agent cannot memorise answers; it has to read.
40
-
41
- ### What the Agent Observes (`ProbeObservation`)
42
-
43
- | Field | Description |
44
- |---|---|
45
- | `code_snippet` | Mutated Python source for this episode |
46
- | `task_description` | Review instructions and goals |
47
- | `file_name` | Name of the file being reviewed |
48
- | `task_id` / `task_difficulty` | Current task index (0–9) and difficulty label |
49
- | `review_history` | All actions taken so far this episode |
50
- | `step_count` / `max_steps` | Steps used vs. budget |
51
- | `issues_found_count` / `total_issues` | Progress tracker |
52
- | `context_hints` | Causal hints unlocked by finding key issues |
53
- | `reward` | Most recent step reward in `[-1.0, 1.0]` |
54
- | `done` | Whether the episode has ended |
55
-
56
- ### What Actions the Agent Can Take (`ProbeAction`)
57
-
58
- | Action | Effect |
59
- |---|---|
60
- | `add_comment` | Annotate a line with text, severity, category, and optional backdoor classification |
61
- | `get_context` | Reveal ±5 lines of context around a chosen line number |
62
- | `run_scanner` | Invoke simulated static-analysis tool (70 % recall, up to 2 false positives injected) |
63
- | `request_changes` | Mark PR as requiring fixes (correct terminal action for tasks 0–6) |
64
- | `approve` | Approve the PR (penalised if issues remain) |
65
- | `submit_review` | Finalise the review and end the episode |
66
- | `escalate_to_security_review` | Flag PR as containing a deliberate attack (required for tasks 7–9) |
67
-
68
- ### Reward Formula
69
-
70
- Reward accumulates across steps and is finalised at submission:
71
-
72
- ```
73
- Episode reward =
74
-
75
- Σ per-comment (ADD_COMMENT):
76
- issue_credit = (weight_i / total_weight) × 0.40 ← found a real issue
77
- classification_credit = (weight_i / total_weight) × 0.20 ← correct bug/backdoor label
78
- misclassify_penalty = −0.05 ← found it but labelled it wrong
79
- false_positive_penalty = −0.05 ← substantive comment, no issue matched
80
-
81
- + on terminal (SUBMIT_REVIEW or ESCALATE):
82
- coverage_bonus = weighted_coverage × 0.15 ← proportional to issues found
83
- decision_score = +0.15 / −0.15 ← correct / wrong final action
84
- (bonus gated: requires coverage ≥ 30 %)
85
- efficiency_bonus = (1 − steps_used/max_steps) × 0.10 ← unlocked only if coverage ≥ 60 %
86
-
87
- Maximum achievable: ~1.0 Minimum: −1.0
88
- ```
89
-
90
- ### Anti-Exploit Verifier
91
-
92
- A comment earns `issue_credit` only when **all three** conditions hold simultaneously:
93
-
94
- 1. **`keyword_hit`** — at least one issue keyword appears in the comment text
95
- 2. **`line_hit`** — `line_number` is within ±2 lines of the declared issue range
96
- 3. **`substantive`** — comment body is longer than 15 characters
97
-
98
- This closes three common reward-hacking paths: keyword spam (fails `line_hit`), wide-net line fishing (fails `keyword_hit`), and one-word dumps (fails `substantive`). The decision bonus additionally requires weighted coverage ≥ 30 % before it can be earned, so an agent that never reads code and always guesses `request_changes` earns zero — not a bonus.
99
-
100
- ### Perfect Episode vs. Failing Episode
101
 
102
- **Perfect:** The agent reads the code, annotates every real issue at the correct line with a substantive, keyword-bearing comment, correctly labels each as `accidental_bug` or `intentional_backdoor`, escalates when required, and submits with steps to spare. Score approaches `1.0`.
103
 
104
- **Failing:** The agent spams generic comments on random lines, never co-locates a keyword with a real issue line, triggers false-positive penalties on every step, and submits the wrong terminal action. Score approaches `−1.0`.
105
-
106
- ---
107
-
108
- ## Environment Design
109
-
110
- ### Difficulty Tiers
111
-
112
- | Tier | Tasks | Max Steps | Issues | What Changes |
113
- |---|---|---|---|---|
114
- | **Ultra-Easy** (bootstrap) | 0 | 6 | 2 | Category names spelled out in code comments above each bug; guarantees GRPO positive trajectories from step 1 |
115
- | **Easy** | 1 | 15 | 3 | Clean logic bugs, no hints, no distractors |
116
- | **Medium** | 2, 4 | 20 | 5 | Security and async-concurrency issues; wider surface area |
117
- | **Hard** | 3, 5, 6 | 30–35 | 6–7 | Mixed bug classes; task 6 adds a causal unlock chain |
118
- | **Adversarial** | 7, 8, 9 | 25–35 | 2–3 | Deliberate backdoors requiring `intentional_backdoor` classification and `escalate_to_security_review` |
119
-
120
- ### Tasks (10 total)
121
-
122
- | ID | Difficulty | File | Issues | Domain |
123
- |---|---|---|---|---|
124
- | 0 | Ultra-easy | `bootstrap.py` | 2 | Off-by-one + hardcoded credential (hinted in comments) |
125
- | 1 | Easy | `utils.py` | 3 | Logic bugs, off-by-one, dead code |
126
- | 2 | Medium | `auth.py` | 5 | SQL injection, MD5, eval(), hardcoded creds |
127
- | 3 | Hard | `data_pipeline.py` | 7 | N+1, SSL bypass, thread leak, OOM cache |
128
- | 4 | Medium | `async_worker.py` | 5 | Race condition, missing await, resource leak |
129
- | 5 | Hard | `api_server.py` | 6 | Command injection, path traversal, pickle RCE |
130
- | 6 | Hard | `auth_service.py` | 6 | **Causal chain** — JWT forgery → privilege escalation |
131
- | 7 | Adversarial | `session_manager.py` | 2 | Planted auth bypass + hardcoded bypass credential |
132
- | 8 | Adversarial | `jwt_validator.py` | 2 | **Compound backdoor** — fixed-seed JWT secret + timing oracle |
133
- | 9 | Adversarial | `compat_shim.py` | 3 | **Supply chain** — import-time env-var exfiltration to attacker domain |
134
-
135
- ### GRPO Curriculum (5 Phases in `train_grpo.py`)
136
-
137
- | Phase | Steps | Tasks in Pool |
138
- |---|---|---|
139
- | 0 | 0 – 40 | 0–1 (ultra-easy / easy) |
140
- | 1 | 40 – 80 | 0–3 (adds medium / hard) |
141
- | 2 | 80 – 120 | 0–6 (adds causal chain) |
142
- | 3 | 120 – 160 | 0–8 (adds adversarial) |
143
- | 4 | 160 – 200 | 0–9 (full curriculum) |
144
-
145
- ### Reward Components with Weights
146
-
147
- | Component | Weight | Trigger |
148
- |---|---|---|
149
- | `issue_credit` | up to **0.40** cumulative | `add_comment` matches a real issue (keyword + line + length) |
150
- | `classification_credit` | up to **0.20** cumulative | correct `accidental_bug` / `intentional_backdoor` label |
151
- | `misclassify_penalty` | **−0.05** per issue | issue found but wrong classification label |
152
- | `false_positive_penalty` | **−0.05** per comment | substantive comment, zero issues matched |
153
- | `coverage_bonus` | up to **0.15** terminal | `weighted_coverage × 0.15` |
154
- | `decision_score` | **±0.15** terminal | correct / wrong `request_changes` vs `escalate` decision |
155
- | `efficiency_bonus` | up to **0.10** terminal | `(1 − steps/max_steps) × 0.10` when coverage ≥ 60 % |
156
- | `format_bonus` | **+0.02** once | response contains a valid non-empty JSON array |
157
-
158
- ### Dynamic World (Anti-Memorisation)
159
-
160
- Each episode `mutate_task()` applies three seed-controlled transforms:
161
-
162
- | Mutation | Example |
163
  |---|---|
164
- | Variable rename | `total` `acc`, `data` `payload`, `password` `passwd` |
165
- | Line shift | Blank line inserted above first issue; all `line_range` values shift +1 |
166
- | Constant variance | `range(len(data) + 1)` `range(len(data) + 2)` |
167
-
168
- Mutations are deterministic given the episode seed reproducible runs, always fresh surfaces.
169
-
170
- ### Scanner Noise Model (`scanner.py`)
171
-
172
- `run_scanner()` simulates a real lint/security tool:
173
- - **Recall: 70 %** — each real issue is reported with probability 0.70; ~30 % silently missed
174
- - **False-positive rate: 40 %** — up to 2 injected plausible-but-wrong findings per run
175
- - Scanner output is **not auto-graded** — the agent must still call `add_comment` with a correct line + keyword to earn reward
176
-
177
- ### Causal Unlock Chain (Task 6)
178
-
179
- Finding certain issues appends new context hints to the observation, modelling real investigations where one discovery leads to a deeper one:
180
 
181
- ```
182
- Find hardcoded JWT secret → DB schema revealed → agent can reason: forge token → privilege escalation
183
- Find missing rate-limit → nginx config shown → confirms /auth fully exposed with no IP filtering
184
- ```
185
 
186
- ### OpenEnv Interface
187
 
188
- | Method | Returns | Notes |
189
- |---|---|---|
190
- | `reset()` | `ProbeObservation` | Starts new episode; advances task cursor; applies mutation |
191
- | `step(action)` | `(ProbeObservation, RewardType, bool, dict)` | Executes action; returns obs, structured reward, done flag, info dict |
192
- | `state` (sync property) | `State(episode_id, step_count)` | Lightweight snapshot for `create_app` |
193
- | `async_state()` | `dict` | Full async snapshot with all episode fields |
194
 
195
- ---
196
 
197
- ## Quickstart
198
 
199
  ```bash
200
- # 1. Install all dependencies
201
  uv sync
202
-
203
- # 2. Start the server + frontend in one command
204
  uv run python run.py
205
-
206
- # The terminal will print:
207
- # ==========================================================
208
- # PRobe — AI Code Review Training Environment
209
- # ==========================================================
210
- # Frontend → http://localhost:8000/ui/
211
- # API docs → http://localhost:8000/docs
212
- # WebSocket → ws://localhost:8000/ws
213
- # ==========================================================
214
-
215
- # 3. Open your browser
216
- open http://localhost:8000/ui/
217
-
218
- # Run zero-shot GPT-4o-mini baseline (requires OPENAI_API_KEY)
219
- export OPENAI_API_KEY=sk-...
220
- uv run python training/baseline.py
221
-
222
- # Smoke-test reward function (no GPU, no API key)
223
- uv run python training/train_grpo.py --test
224
  ```
225
 
226
- ---
227
-
228
- ## Interactive Frontend Dashboard
229
 
230
- PRobe ships with a **zero-dependency browser UI** that turns the RL environment into a live, interactive demo.
231
- No npm, no build step — just start the server and open your browser.
232
 
233
- ### What It Looks Like
234
 
235
- ```
236
- +----------------------------------------------------------------------------+
237
- | 🔍 PRobe Adversarial Code Review — RL Training Environment |
238
- | 🟢 Connected [New Ep] |
239
- +------------------------------+-------------------+-------------------------+
240
- | Task 2 — auth.py | Actions | Reward Dashboard |
241
- | medium • Step 3 / 20 | | |
242
- | | 💬 Add Comment | O +0.24 |
243
- | ⚠️ External contributor, | .------------. | cumulative |
244
- | no prior commit history | | Line: [12] | | |
245
- | | | Comment: | | Issue credit ##... |
246
- | Review this auth module. | | SQL inject.| | Classif. #.... |
247
- | Identify bugs and decide | | Severity: | | FP penalty ..... |
248
- | whether to escalate or | | [critical] | | Coverage ###.. |
249
- | request changes. | | Category: | | Decision ####. |
250
- | | | [security] | | Efficiency ##... |
251
- | .-- auth.py -------------. | .------------. | |
252
- | | 1: import hashlib | | [Submit Comment] | Issues Found |
253
- | | 2: | | | ######.... 2 / 5 |
254
- | | 3: DB_PASS = 's3cr' | | Quick Actions | |
255
- | |12: cursor.execute( | | [Get Context] | Episode History |
256
- | | f"SELECT * FROM | | [Run Scanner] | .---------------. |
257
- | | users WHERE | | --------------- | |ADD_COMMENT+0.12| |
258
- | |13: username='{u}'" | | [Req Changes] | |sql inj. L12 | |
259
- | |14: ) | | [Approve PR] | |---------------| |
260
- | '------------------------' | [Submit Review] | |RUN_SCANNER 0.00| |
261
- | | [Escalate!] | '---------------' |
262
- +------------------------------+-------------------+-------------------------+
263
- ```
264
 
265
- ### Three-Column Layout
266
-
267
- **Left — Code Viewer**
268
- - Full source code with **line numbers** for every episode
269
- - Lines are **colour-coded** as you act:
270
- - 🔵 Blue — line you just commented on
271
- - 🟡 Yellow — line flagged by the scanner
272
- - 🟢 Green — line you probed with Get Context
273
- - **Unlocked hints** appear below the code as green panels whenever a key issue is found
274
- - The **adversarial hint** banner tells you whether the PR is from a trusted team member or an unknown external contributor
275
-
276
- **Centre — Action Panel**
277
- - **Add Comment** form: line number, free-text comment, severity, category, and bug/backdoor classification
278
- - **Quick Actions**: single-click buttons for all 7 action types
279
-
280
- | Button | Action | What Happens |
281
- |---|---|---|
282
- | 🔍 Get Context | `get_context` | Reveals ±5 lines around the probed line number |
283
- | 🤖 Run Scanner | `run_scanner` | Runs the simulated static-analysis tool |
284
- | 🔄 Request Changes | `request_changes` | Records your review decision |
285
- | ✅ Approve PR | `approve` | Approves (−0.15 penalty if < 50 % issues found) |
286
- | 📤 Submit Review | `submit_review` | Ends the episode; triggers terminal scoring |
287
- | 🚨 Escalate to Security | `escalate_to_security_review` | Correct only on adversarial tasks 7–9 |
288
-
289
- **Right — Reward Dashboard**
290
- - **Animated ring** showing cumulative episode reward (green above zero, red below)
291
- - **Six component bars** updating in real time after every action:
292
- - Issue credit, Classification credit, FP penalty
293
- - Coverage bonus, Decision score, Efficiency bonus
294
- - **Issues progress bar** showing how many ground-truth issues you have found
295
- - **Episode history feed** — every action with its reward delta and explanation
296
-
297
- ### Episode End Modal
298
-
299
- When the episode terminates (via Submit Review or Escalate), a modal pops up showing:
300
 
301
- ```
302
- 🏆 Episode Passed!
303
 
304
- "Found 5/5 issues (weighted coverage 100%).
305
- Decision 'escalate_to_security_review' was correct."
 
 
306
 
307
- ┌───────────────────────────────────┐
308
- │ Cumulative reward +0.874 │
309
- │ Issues found 5 / 5 │
310
- │ Steps used 18 / 25 │
311
- │ Decision escalate │
312
- │ Escalation required Yes │
313
- └───────────────────────────────────┘
314
 
315
- [Start New Episode]
316
- ```
317
 
318
- Clicking **Start New Episode** automatically loads the next task in the difficulty ladder.
319
 
320
- ### How to Run
321
 
322
  ```bash
323
- # Install dependencies (one-time)
324
- uv sync
325
-
326
- # Start the server — this also serves the frontend
327
- uv run python run.py
328
  ```
329
 
330
- Then open **`http://localhost:8000/ui/`** in any browser. No additional setup, no separate frontend server.
331
-
332
- **Optional flags:**
333
 
334
  ```bash
335
- # Different port
336
- uv run python run.py --port 9000
337
-
338
- # Bind to localhost only (do not expose on the network)
339
- uv run python run.py --host 127.0.0.1
340
-
341
- # Dev mode: auto-reload Python files on save
342
- uv run python run.py --reload
343
  ```
344
 
345
- ### How the Frontend Connects
346
-
347
- The browser communicates with the backend over a **persistent WebSocket** at `ws://localhost:8000/ws`.
348
- Each browser tab gets its own isolated environment instance — concurrent sessions do not share state.
349
- The WebSocket URL is auto-detected from `window.location.hostname` so the UI works on any host or port without editing any file.
350
-
351
- ### Why a Frontend Helps the Story
352
-
353
- | Without Frontend | With Frontend |
354
- |---|---|
355
- | `total=0.345` in a log file | Animated reward ring filling green in real time |
356
- | `issues_found: ['sql_injection']` | Line 12 highlighted blue in the code viewer |
357
- | `decision: escalate_to_security_review` | 🚨 Escalate button, modal with final score and stats |
358
- | Understanding the anti-exploit rule | Watching a keyword-spam comment score −0.05 FP penalty |
359
- | Explaining the causal chain mechanic | Green hint panel appearing after finding the JWT issue |
360
-
361
- The dashboard makes the reward signal **tangible** — a visitor can play one episode in two minutes and immediately understand what makes PRobe different from a linter.
362
-
363
- ---
364
-
365
- ## Training
366
-
367
- | | |
368
- |---|---|
369
- | **Training script** | [`training/train_grpo.py`](training/train_grpo.py) |
370
- | **Notebook** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/FILL_COLAB_LINK) — replace with your Colab link |
371
- | **Model** | `Qwen/Qwen2.5-1.5B-Instruct` (default) — swap via `--model` flag |
372
- | **Algorithm** | GRPO via HuggingFace TRL + optional Unsloth 4-bit LoRA |
373
- | **Hardware** | Single T4 GPU (Kaggle free tier) or A10/A100 (Colab Pro) |
374
- | **Training time** | ~3 hours for 200 steps on T4 with Unsloth 4-bit |
375
- | **Curriculum** | 5 phases: ultra-easy → easy → medium/hard → causal chain → adversarial |
376
 
377
  ```bash
378
- # Install training dependencies (works on Kaggle/Colab too)
379
- pip install -e ".[training]"
380
-
381
- # Full training — Unsloth 4-bit (recommended, single T4/A10)
382
- uv run python training/train_grpo.py --model Qwen/Qwen2.5-1.5B-Instruct --use-unsloth
383
-
384
- # Plain TRL without Unsloth (more VRAM needed)
385
- uv run python training/train_grpo.py --model Qwen/Qwen2.5-1.5B-Instruct
386
-
387
- # Resume from checkpoint
388
- uv run python training/train_grpo.py --model Qwen/Qwen2.5-1.5B-Instruct --resume-from ./outputs/checkpoint-80
389
-
390
- # Smoke-test reward function only (no GPU, no model download, < 5 seconds)
391
- uv run python training/train_grpo.py --test
392
  ```
393
 
394
- ---
395
-
396
- ## Results
397
 
398
- ### Baseline Agent Comparison
399
-
400
- The chart below shows reward scores for four scripted (non-ML) agents across all 10 tasks.
401
- This validates the reward function *before* any ML training: only the oracle — which reads
402
- ground-truth issue data and constructs the ideal output — scores high. Every gaming attempt
403
- scores near zero or negative, proving the reward cannot be exploited.
404
-
405
- ![Baseline Comparison](outputs/baseline_comparison.svg)
406
 
407
- *Per-task and mean reward for four deterministic agents (seed = 42). Adversarial tasks 7–9
408
- require correct `intentional_backdoor` classification AND `escalate_to_security_review` —
409
- only the oracle achieves this.*
410
 
411
- ### Reward Component Breakdown
412
 
413
- ![Reward Breakdown](outputs/reward_breakdown.svg)
 
 
414
 
415
- *Stacked reward components for the Perfect Oracle agent. Tasks 7–9 unlock Classification
416
- Credit because they contain deliberate backdoors requiring `intentional_backdoor` labelling.
417
- No other agent earns classification credit — it requires correctly locating the issue AND
418
- understanding the attacker's intent.*
419
 
420
- ### Scripted Baseline Summary (seed = 42, all 10 tasks)
 
 
 
 
 
 
 
 
 
 
 
421
 
422
- | Agent | Mean Reward | Min | Max | What It Represents |
423
- |---|---|---|---|---|
424
- | **Perfect Oracle** | **+0.778** | 0.000 | +1.000 | Upper bound — reads ground truth, constructs ideal output |
425
- | Line Flooder | −0.025 | −0.130 | +0.020 | Comments on every 5th line with vague generic text |
426
- | Keyword Spammer | −0.075 | −0.180 | −0.030 | Dumps all security keywords at a deliberately wrong line |
427
- | Random Agent | −0.260 | −0.480 | −0.080 | Random lines, random categories, random terminal actions |
428
 
429
- **Anti-gaming result:** both exploit agents score less than −3 % of oracle reward.
430
- An agent *must* read the code to earn positive reward — keyword spam alone scores negative.
431
 
432
- ### Anti-Gaming Verifier Results
 
 
433
 
434
- | Exploit Attempt | Score | Why It Fails |
435
- |---|---|---|
436
- | All keywords, wrong line (9999) | −0.075 | `line_hit` fails — no issue within ±2 lines of 9999 |
437
- | Correct line, 8-char comment | 0.00 | `substantive` fails — comment < 15 characters |
438
- | Correct keyword + line, no classification | partial | Earns `issue_credit` but misses `classification_credit` |
439
- | Guess `escalate` on non-adversarial task | −0.15 | `decision_score` = −DECISION_REWARD (false alarm penalty) |
440
- | Guess `request_changes` on adversarial task | −0.15 | Wrong terminal action even if issues are found |
441
 
442
- ### Oracle Per-Task Scores
443
 
444
- | Task | Difficulty | Oracle Score | Notes |
445
- |---|---|---|---|
446
- | T0 | ultra-easy | +0.787 | Hinted in comments bootstraps GRPO positive trajectories |
447
- | T1 | easy | +0.800 | Clean logic bugs |
448
- | T2 | medium | +0.795 | SQL injection, eval(), hardcoded creds |
449
- | T3 | hard | +0.000 | Known edge case: mutation shifts lines beyond oracle's fixed midpoint |
450
- | T4 | medium | +0.795 | Async concurrency bugs |
451
- | T5 | hard | +0.800 | Command injection, path traversal, pickle RCE |
452
- | T6 | hard | +0.803 | Causal chain — JWT forgery → privilege escalation |
453
- | T7 | adversarial | **+1.000** | Perfect: both backdoors found + escalated + classified |
454
- | T8 | adversarial | **+1.000** | Compound JWT backdoor chain |
455
- | T9 | adversarial | **+1.000** | Supply-chain import hook — 3 coordinated exfiltration components |
456
 
457
- ### GRPO Training Curves
458
 
459
- > **To populate:** run `uv run python training/train_grpo.py --model Qwen/Qwen2.5-1.5B-Instruct --use-unsloth`
460
- > The script writes `outputs/training_curves.png` on completion. Commit it and the image appears here.
461
 
462
  ![Training Curves](outputs/training_curves.png)
463
 
464
- *Reward, policy loss, and classification accuracy vs. training step (200 steps, 5-phase
465
- curriculum). Smoothed curve (window = 10) over raw per-episode values.*
466
-
467
- ### Before vs. After Training
468
-
469
- | Metric | Random Baseline | Scripted Oracle (ceiling) | Trained Agent |
470
- |---|---|---|---|
471
- | Average episode reward | −0.260 | +0.778 | *run training to populate* |
472
- | Issue detection rate | 0 % | 100 % | *run training to populate* |
473
- | False positive rate | high | 0 % | *run training to populate* |
474
- | Backdoor classification accuracy | 0 % | 100 % | *run training to populate* |
475
- | Escalation recall (adversarial tasks 7–9) | ~33 % (random) | 100 % | *run training to populate* |
476
-
477
- > Random baseline (−0.260) and oracle ceiling (+0.778) are **real measured numbers** from
478
- > `outputs/scripted_baseline.jsonl`. Replace the *"run training to populate"* cells with your
479
- > actual numbers after training completes.
480
-
481
- ---
482
-
483
- ## Why This Matters
484
-
485
- Security code review is a high-stakes task performed by a small number of specialists — it does not scale to the volume of code that modern teams ship. An agent that can reliably read a PR, flag bugs with accurate line references, distinguish honest mistakes from deliberate backdoors (the XZ Utils and SolarWinds failure mode), and escalate with justification would directly accelerate secure software delivery for any team using AI-assisted development. This is also a largely unexplored domain for RL: existing code benchmarks reward *generating* correct outputs, not *critically evaluating* someone else's work, leaving the oversight and adversarial-detection capabilities of LLMs essentially untrained.
486
-
487
- ---
488
-
489
- ## Links
490
-
491
- > **Before submitting:** replace every placeholder below with a real URL.
492
- > Judges follow these links directly — missing links are a non-negotiable submission disadvantage.
493
-
494
- | Resource | URL |
495
- |---|---|
496
- | 🤗 HuggingFace Space (live environment) | Replace with your HF Space URL |
497
- | 📓 Training notebook (Colab / Kaggle) | Replace with your Colab or Kaggle link |
498
- | 📝 Mini-blog / writeup (HuggingFace) | Replace with your HF blog post URL |
499
- | 🎥 Demo video (YouTube, < 2 min) | Replace with your YouTube URL |
500
- | 📊 Slides / presentation | Replace with your slides URL |
501
- | 📈 WandB training run | Replace with your WandB run URL |
502
 
503
  ---
504
 
 
12
  - code-review
13
  - rl-training
14
  - grpo
 
15
  - probe
16
  ---
17
 
18
+ # PRobe — an AI code reviewer that can spot backdoors
19
 
20
+ ## Submission links (judge quick access)
21
 
22
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/FILL_COLAB_LINK)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
+ > Replace each placeholder below with a real URL before submission.
25
 
26
+ | Resource | URL |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  |---|---|
28
+ | 🤗 HuggingFace Space (live environment) | Replace with your HF Space URL |
29
+ | 📓 Training notebook (Colab / Kaggle) | Replace with your Colab or Kaggle link |
30
+ | 📝 Mini-blog / writeup (HuggingFace) | Replace with your HF blog post URL |
31
+ | 🎥 Demo video (YouTube, < 2 min) | Replace with your YouTube URL |
32
+ | 📊 Slides / presentation | Replace with your slides URL |
33
+ | 📈 WandB training run | Replace with your WandB run URL |
 
 
 
 
 
 
 
 
 
 
34
 
35
+ ## TL;DR
 
 
 
36
 
37
+ PRobe is a training environment where an AI learns to **review Python code like a careful security engineer**:
38
 
39
+ - Find real bugs and security issues (with correct line numbers)
40
+ - Tell the difference between an honest mistake vs. a deliberate backdoor
41
+ - Decide whether to **approve**, **request changes**, or **escalate to security**
 
 
 
42
 
43
+ Unlike many demos, PRobe uses a **deterministic reward** (no “LLM judge”). Keyword-spam on random lines gets penalized; careful, accurate findings score high.
44
 
45
+ ## Try it in 60 seconds
46
 
47
  ```bash
 
48
  uv sync
 
 
49
  uv run python run.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ```
51
 
52
+ Then open `http://localhost:8000/ui/` and click **New Episode**.
 
 
53
 
54
+ ## Why it exists (simple version)
 
55
 
56
+ Real supply-chain attacks (like XZ Utils / SolarWinds) often look like normal code changes. A useful AI reviewer must do more than “scan” — it must **investigate intent** and know when to escalate.
57
 
58
+ ## What’s novel (in plain English)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
+ - **No LLM judge**: reward is deterministic and reproducible.
61
+ - **Anti-gaming**: keyword spam on random lines gets penalized.
62
+ - **Backdoor escalation**: some tasks require choosing “escalate to security”, not just listing bugs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
+ ## What’s inside (high level)
 
65
 
66
+ - **10 tasks** that simulate real review situations (bugs + adversarial backdoors)
67
+ - A **mutator** that changes variable names/line numbers so the model can’t memorize answers
68
+ - A **grader** that scores outputs based on “right issue + right place + good explanation”
69
+ - A lightweight **web UI** so anyone can try an episode in the browser
70
 
71
+ If you want the full technical design, see `docs/design.md`.
 
 
 
 
 
 
72
 
73
+ ## Training (GRPO)
 
74
 
75
+ The training entrypoint is `training/train_grpo.py`.
76
 
77
+ ### Install training dependencies
78
 
79
  ```bash
80
+ pip install -e ".[training]"
 
 
 
 
81
  ```
82
 
83
+ ### Smoke test (no GPU, no model download)
 
 
84
 
85
  ```bash
86
+ python training/train_grpo.py --test
 
 
 
 
 
 
 
87
  ```
88
 
89
+ ### Train (example)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ```bash
92
+ python training/train_grpo.py \
93
+ --model Qwen/Qwen2.5-1.5B-Instruct \
94
+ --steps 200 \
95
+ --group-size 2 \
96
+ --batch-size 2 \
97
+ --grad-accum 1 \
98
+ --max-seq-len 1024 \
99
+ --max-completion-len 128 \
100
+ --save-steps 50
 
 
 
 
 
101
  ```
102
 
103
+ ### Resume from a checkpoint
 
 
104
 
105
+ ```bash
106
+ python training/train_grpo.py \
107
+ --model Qwen/Qwen2.5-1.5B-Instruct \
108
+ --steps 200 \
109
+ --resume-from outputs/checkpoint-100
110
+ ```
 
 
111
 
112
+ ### Reproduce our run (copy/paste template)
 
 
113
 
114
+ Fill these before submission:
115
 
116
+ - **Hardware**: (T4 / A100 / …)
117
+ - **Steps**: (100 / 200)
118
+ - **Runtime**: (~__ minutes)
119
 
120
+ Example command (200 steps, checkpoints every 50 steps):
 
 
 
121
 
122
+ ```bash
123
+ python training/train_grpo.py \
124
+ --model Qwen/Qwen2.5-1.5B-Instruct \
125
+ --steps 200 \
126
+ --group-size 2 \
127
+ --batch-size 2 \
128
+ --grad-accum 1 \
129
+ --max-seq-len 1024 \
130
+ --max-completion-len 128 \
131
+ --save-steps 50 \
132
+ --output-dir outputs
133
+ ```
134
 
135
+ ## Outputs
 
 
 
 
 
136
 
137
+ Training writes artifacts under `outputs/` (or your `--output-dir`), including:
 
138
 
139
+ - Checkpoints: `checkpoint-*`
140
+ - Curves: `training_curves.png`, `per_task_reward.png`
141
+ - Demo traces (adversarial tasks): `demo/before_task*.json`, `demo/after_task*.json`
142
 
143
+ ## Before vs. after training (images)
 
 
 
 
 
 
144
 
145
+ Fill these before submission (numbers judges can scan fast):
146
 
147
+ - **Mean reward**: before __ after __
148
+ - **Escalation recall (tasks 7–9)**: before __ → after __
149
+ - **False positives per episode**: before __ after __
 
 
 
 
 
 
 
 
 
150
 
151
+ After training, these images are written to `outputs/` and help show improvement:
152
 
153
+ - `outputs/training_curves.png` (reward / loss over steps)
154
+ - `outputs/per_task_reward.png` (per-task reward before vs after)
155
 
156
  ![Training Curves](outputs/training_curves.png)
157
 
158
+ ![Per-task Reward](outputs/per_task_reward.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
 
160
  ---
161
 
uv.lock CHANGED
The diff for this file is too large to render. See raw diff