Spaces:
Runtime error
Runtime error
Commit ·
4bb4e67
1
Parent(s): d25e8b9
Readme cleanup
Browse files
README.md
CHANGED
|
@@ -12,493 +12,150 @@ tags:
|
|
| 12 |
- code-review
|
| 13 |
- rl-training
|
| 14 |
- grpo
|
| 15 |
-
- world-modeling
|
| 16 |
- probe
|
| 17 |
---
|
| 18 |
|
| 19 |
-
# PRobe —
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
-
|
| 24 |
-
|
| 25 |
-
## The Problem
|
| 26 |
-
|
| 27 |
-
The XZ Utils backdoor (CVE-2024-3094) slipped through two years of open-source review. SolarWinds compromised 18,000 organisations via a tampered build pipeline. In both cases the malicious change *looked* like a legitimate contribution — the kind of PR that lands in a code-review queue every day.
|
| 28 |
-
|
| 29 |
-
Today's LLMs scan code like a linter. They find style issues, flag known CVE patterns, and produce plausible-sounding comments. What they don't do is *investigate* — reason about intent, distinguish an honest off-by-one from a planted authentication bypass, or know when to escalate rather than request changes. Reward signals for code generation are everywhere; reward signals for critical code *evaluation* barely exist.
|
| 30 |
-
|
| 31 |
-
PRobe closes that gap. Its fully deterministic grader — keyword + line-range matching, no LLM judge — separates investigation quality from keyword spam. An agent that dumps every security term at random lines scores *negative*. One that reads carefully, probes for context, finds the right lines, and correctly labels each flaw as an honest bug or a deliberate backdoor scores close to `+1.0`.
|
| 32 |
-
|
| 33 |
-
---
|
| 34 |
-
|
| 35 |
-
## What the Agent Sees, Does, and Gets Rewarded For
|
| 36 |
-
|
| 37 |
-
### Plain English
|
| 38 |
-
|
| 39 |
-
The agent is handed a Python source file and asked to review it like a senior security engineer. It can annotate suspicious lines, probe specific regions for more context, run a simulated scanner (which, like real tools, misses things and occasionally lies), and finally submit a verdict. On adversarial tasks it must also decide whether the code contains a deliberate backdoor and escalate to a security team if so. Every episode the code surface changes — variable names, line numbers, constants — so the agent cannot memorise answers; it has to read.
|
| 40 |
-
|
| 41 |
-
### What the Agent Observes (`ProbeObservation`)
|
| 42 |
-
|
| 43 |
-
| Field | Description |
|
| 44 |
-
|---|---|
|
| 45 |
-
| `code_snippet` | Mutated Python source for this episode |
|
| 46 |
-
| `task_description` | Review instructions and goals |
|
| 47 |
-
| `file_name` | Name of the file being reviewed |
|
| 48 |
-
| `task_id` / `task_difficulty` | Current task index (0–9) and difficulty label |
|
| 49 |
-
| `review_history` | All actions taken so far this episode |
|
| 50 |
-
| `step_count` / `max_steps` | Steps used vs. budget |
|
| 51 |
-
| `issues_found_count` / `total_issues` | Progress tracker |
|
| 52 |
-
| `context_hints` | Causal hints unlocked by finding key issues |
|
| 53 |
-
| `reward` | Most recent step reward in `[-1.0, 1.0]` |
|
| 54 |
-
| `done` | Whether the episode has ended |
|
| 55 |
-
|
| 56 |
-
### What Actions the Agent Can Take (`ProbeAction`)
|
| 57 |
-
|
| 58 |
-
| Action | Effect |
|
| 59 |
-
|---|---|
|
| 60 |
-
| `add_comment` | Annotate a line with text, severity, category, and optional backdoor classification |
|
| 61 |
-
| `get_context` | Reveal ±5 lines of context around a chosen line number |
|
| 62 |
-
| `run_scanner` | Invoke simulated static-analysis tool (70 % recall, up to 2 false positives injected) |
|
| 63 |
-
| `request_changes` | Mark PR as requiring fixes (correct terminal action for tasks 0–6) |
|
| 64 |
-
| `approve` | Approve the PR (penalised if issues remain) |
|
| 65 |
-
| `submit_review` | Finalise the review and end the episode |
|
| 66 |
-
| `escalate_to_security_review` | Flag PR as containing a deliberate attack (required for tasks 7–9) |
|
| 67 |
-
|
| 68 |
-
### Reward Formula
|
| 69 |
-
|
| 70 |
-
Reward accumulates across steps and is finalised at submission:
|
| 71 |
-
|
| 72 |
-
```
|
| 73 |
-
Episode reward =
|
| 74 |
-
|
| 75 |
-
Σ per-comment (ADD_COMMENT):
|
| 76 |
-
issue_credit = (weight_i / total_weight) × 0.40 ← found a real issue
|
| 77 |
-
classification_credit = (weight_i / total_weight) × 0.20 ← correct bug/backdoor label
|
| 78 |
-
misclassify_penalty = −0.05 ← found it but labelled it wrong
|
| 79 |
-
false_positive_penalty = −0.05 ← substantive comment, no issue matched
|
| 80 |
-
|
| 81 |
-
+ on terminal (SUBMIT_REVIEW or ESCALATE):
|
| 82 |
-
coverage_bonus = weighted_coverage × 0.15 ← proportional to issues found
|
| 83 |
-
decision_score = +0.15 / −0.15 ← correct / wrong final action
|
| 84 |
-
(bonus gated: requires coverage ≥ 30 %)
|
| 85 |
-
efficiency_bonus = (1 − steps_used/max_steps) × 0.10 ← unlocked only if coverage ≥ 60 %
|
| 86 |
-
|
| 87 |
-
Maximum achievable: ~1.0 Minimum: −1.0
|
| 88 |
-
```
|
| 89 |
-
|
| 90 |
-
### Anti-Exploit Verifier
|
| 91 |
-
|
| 92 |
-
A comment earns `issue_credit` only when **all three** conditions hold simultaneously:
|
| 93 |
-
|
| 94 |
-
1. **`keyword_hit`** — at least one issue keyword appears in the comment text
|
| 95 |
-
2. **`line_hit`** — `line_number` is within ±2 lines of the declared issue range
|
| 96 |
-
3. **`substantive`** — comment body is longer than 15 characters
|
| 97 |
-
|
| 98 |
-
This closes three common reward-hacking paths: keyword spam (fails `line_hit`), wide-net line fishing (fails `keyword_hit`), and one-word dumps (fails `substantive`). The decision bonus additionally requires weighted coverage ≥ 30 % before it can be earned, so an agent that never reads code and always guesses `request_changes` earns zero — not a bonus.
|
| 99 |
-
|
| 100 |
-
### Perfect Episode vs. Failing Episode
|
| 101 |
|
| 102 |
-
|
| 103 |
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
---
|
| 107 |
-
|
| 108 |
-
## Environment Design
|
| 109 |
-
|
| 110 |
-
### Difficulty Tiers
|
| 111 |
-
|
| 112 |
-
| Tier | Tasks | Max Steps | Issues | What Changes |
|
| 113 |
-
|---|---|---|---|---|
|
| 114 |
-
| **Ultra-Easy** (bootstrap) | 0 | 6 | 2 | Category names spelled out in code comments above each bug; guarantees GRPO positive trajectories from step 1 |
|
| 115 |
-
| **Easy** | 1 | 15 | 3 | Clean logic bugs, no hints, no distractors |
|
| 116 |
-
| **Medium** | 2, 4 | 20 | 5 | Security and async-concurrency issues; wider surface area |
|
| 117 |
-
| **Hard** | 3, 5, 6 | 30–35 | 6–7 | Mixed bug classes; task 6 adds a causal unlock chain |
|
| 118 |
-
| **Adversarial** | 7, 8, 9 | 25–35 | 2–3 | Deliberate backdoors requiring `intentional_backdoor` classification and `escalate_to_security_review` |
|
| 119 |
-
|
| 120 |
-
### Tasks (10 total)
|
| 121 |
-
|
| 122 |
-
| ID | Difficulty | File | Issues | Domain |
|
| 123 |
-
|---|---|---|---|---|
|
| 124 |
-
| 0 | Ultra-easy | `bootstrap.py` | 2 | Off-by-one + hardcoded credential (hinted in comments) |
|
| 125 |
-
| 1 | Easy | `utils.py` | 3 | Logic bugs, off-by-one, dead code |
|
| 126 |
-
| 2 | Medium | `auth.py` | 5 | SQL injection, MD5, eval(), hardcoded creds |
|
| 127 |
-
| 3 | Hard | `data_pipeline.py` | 7 | N+1, SSL bypass, thread leak, OOM cache |
|
| 128 |
-
| 4 | Medium | `async_worker.py` | 5 | Race condition, missing await, resource leak |
|
| 129 |
-
| 5 | Hard | `api_server.py` | 6 | Command injection, path traversal, pickle RCE |
|
| 130 |
-
| 6 | Hard | `auth_service.py` | 6 | **Causal chain** — JWT forgery → privilege escalation |
|
| 131 |
-
| 7 | Adversarial | `session_manager.py` | 2 | Planted auth bypass + hardcoded bypass credential |
|
| 132 |
-
| 8 | Adversarial | `jwt_validator.py` | 2 | **Compound backdoor** — fixed-seed JWT secret + timing oracle |
|
| 133 |
-
| 9 | Adversarial | `compat_shim.py` | 3 | **Supply chain** — import-time env-var exfiltration to attacker domain |
|
| 134 |
-
|
| 135 |
-
### GRPO Curriculum (5 Phases in `train_grpo.py`)
|
| 136 |
-
|
| 137 |
-
| Phase | Steps | Tasks in Pool |
|
| 138 |
-
|---|---|---|
|
| 139 |
-
| 0 | 0 – 40 | 0–1 (ultra-easy / easy) |
|
| 140 |
-
| 1 | 40 – 80 | 0–3 (adds medium / hard) |
|
| 141 |
-
| 2 | 80 – 120 | 0–6 (adds causal chain) |
|
| 142 |
-
| 3 | 120 – 160 | 0–8 (adds adversarial) |
|
| 143 |
-
| 4 | 160 – 200 | 0–9 (full curriculum) |
|
| 144 |
-
|
| 145 |
-
### Reward Components with Weights
|
| 146 |
-
|
| 147 |
-
| Component | Weight | Trigger |
|
| 148 |
-
|---|---|---|
|
| 149 |
-
| `issue_credit` | up to **0.40** cumulative | `add_comment` matches a real issue (keyword + line + length) |
|
| 150 |
-
| `classification_credit` | up to **0.20** cumulative | correct `accidental_bug` / `intentional_backdoor` label |
|
| 151 |
-
| `misclassify_penalty` | **−0.05** per issue | issue found but wrong classification label |
|
| 152 |
-
| `false_positive_penalty` | **−0.05** per comment | substantive comment, zero issues matched |
|
| 153 |
-
| `coverage_bonus` | up to **0.15** terminal | `weighted_coverage × 0.15` |
|
| 154 |
-
| `decision_score` | **±0.15** terminal | correct / wrong `request_changes` vs `escalate` decision |
|
| 155 |
-
| `efficiency_bonus` | up to **0.10** terminal | `(1 − steps/max_steps) × 0.10` when coverage ≥ 60 % |
|
| 156 |
-
| `format_bonus` | **+0.02** once | response contains a valid non-empty JSON array |
|
| 157 |
-
|
| 158 |
-
### Dynamic World (Anti-Memorisation)
|
| 159 |
-
|
| 160 |
-
Each episode `mutate_task()` applies three seed-controlled transforms:
|
| 161 |
-
|
| 162 |
-
| Mutation | Example |
|
| 163 |
|---|---|
|
| 164 |
-
|
|
| 165 |
-
|
|
| 166 |
-
|
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
### Scanner Noise Model (`scanner.py`)
|
| 171 |
-
|
| 172 |
-
`run_scanner()` simulates a real lint/security tool:
|
| 173 |
-
- **Recall: 70 %** — each real issue is reported with probability 0.70; ~30 % silently missed
|
| 174 |
-
- **False-positive rate: 40 %** — up to 2 injected plausible-but-wrong findings per run
|
| 175 |
-
- Scanner output is **not auto-graded** — the agent must still call `add_comment` with a correct line + keyword to earn reward
|
| 176 |
-
|
| 177 |
-
### Causal Unlock Chain (Task 6)
|
| 178 |
-
|
| 179 |
-
Finding certain issues appends new context hints to the observation, modelling real investigations where one discovery leads to a deeper one:
|
| 180 |
|
| 181 |
-
|
| 182 |
-
Find hardcoded JWT secret → DB schema revealed → agent can reason: forge token → privilege escalation
|
| 183 |
-
Find missing rate-limit → nginx config shown → confirms /auth fully exposed with no IP filtering
|
| 184 |
-
```
|
| 185 |
|
| 186 |
-
|
| 187 |
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
| `step(action)` | `(ProbeObservation, RewardType, bool, dict)` | Executes action; returns obs, structured reward, done flag, info dict |
|
| 192 |
-
| `state` (sync property) | `State(episode_id, step_count)` | Lightweight snapshot for `create_app` |
|
| 193 |
-
| `async_state()` | `dict` | Full async snapshot with all episode fields |
|
| 194 |
|
| 195 |
-
-
|
| 196 |
|
| 197 |
-
##
|
| 198 |
|
| 199 |
```bash
|
| 200 |
-
# 1. Install all dependencies
|
| 201 |
uv sync
|
| 202 |
-
|
| 203 |
-
# 2. Start the server + frontend in one command
|
| 204 |
uv run python run.py
|
| 205 |
-
|
| 206 |
-
# The terminal will print:
|
| 207 |
-
# ==========================================================
|
| 208 |
-
# PRobe — AI Code Review Training Environment
|
| 209 |
-
# ==========================================================
|
| 210 |
-
# Frontend → http://localhost:8000/ui/
|
| 211 |
-
# API docs → http://localhost:8000/docs
|
| 212 |
-
# WebSocket → ws://localhost:8000/ws
|
| 213 |
-
# ==========================================================
|
| 214 |
-
|
| 215 |
-
# 3. Open your browser
|
| 216 |
-
open http://localhost:8000/ui/
|
| 217 |
-
|
| 218 |
-
# Run zero-shot GPT-4o-mini baseline (requires OPENAI_API_KEY)
|
| 219 |
-
export OPENAI_API_KEY=sk-...
|
| 220 |
-
uv run python training/baseline.py
|
| 221 |
-
|
| 222 |
-
# Smoke-test reward function (no GPU, no API key)
|
| 223 |
-
uv run python training/train_grpo.py --test
|
| 224 |
```
|
| 225 |
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
## Interactive Frontend Dashboard
|
| 229 |
|
| 230 |
-
|
| 231 |
-
No npm, no build step — just start the server and open your browser.
|
| 232 |
|
| 233 |
-
|
| 234 |
|
| 235 |
-
|
| 236 |
-
+----------------------------------------------------------------------------+
|
| 237 |
-
| 🔍 PRobe Adversarial Code Review — RL Training Environment |
|
| 238 |
-
| 🟢 Connected [New Ep] |
|
| 239 |
-
+------------------------------+-------------------+-------------------------+
|
| 240 |
-
| Task 2 — auth.py | Actions | Reward Dashboard |
|
| 241 |
-
| medium • Step 3 / 20 | | |
|
| 242 |
-
| | 💬 Add Comment | O +0.24 |
|
| 243 |
-
| ⚠️ External contributor, | .------------. | cumulative |
|
| 244 |
-
| no prior commit history | | Line: [12] | | |
|
| 245 |
-
| | | Comment: | | Issue credit ##... |
|
| 246 |
-
| Review this auth module. | | SQL inject.| | Classif. #.... |
|
| 247 |
-
| Identify bugs and decide | | Severity: | | FP penalty ..... |
|
| 248 |
-
| whether to escalate or | | [critical] | | Coverage ###.. |
|
| 249 |
-
| request changes. | | Category: | | Decision ####. |
|
| 250 |
-
| | | [security] | | Efficiency ##... |
|
| 251 |
-
| .-- auth.py -------------. | .------------. | |
|
| 252 |
-
| | 1: import hashlib | | [Submit Comment] | Issues Found |
|
| 253 |
-
| | 2: | | | ######.... 2 / 5 |
|
| 254 |
-
| | 3: DB_PASS = 's3cr' | | Quick Actions | |
|
| 255 |
-
| |12: cursor.execute( | | [Get Context] | Episode History |
|
| 256 |
-
| | f"SELECT * FROM | | [Run Scanner] | .---------------. |
|
| 257 |
-
| | users WHERE | | --------------- | |ADD_COMMENT+0.12| |
|
| 258 |
-
| |13: username='{u}'" | | [Req Changes] | |sql inj. L12 | |
|
| 259 |
-
| |14: ) | | [Approve PR] | |---------------| |
|
| 260 |
-
| '------------------------' | [Submit Review] | |RUN_SCANNER 0.00| |
|
| 261 |
-
| | [Escalate!] | '---------------' |
|
| 262 |
-
+------------------------------+-------------------+-------------------------+
|
| 263 |
-
```
|
| 264 |
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
**
|
| 268 |
-
- Full source code with **line numbers** for every episode
|
| 269 |
-
- Lines are **colour-coded** as you act:
|
| 270 |
-
- 🔵 Blue — line you just commented on
|
| 271 |
-
- 🟡 Yellow — line flagged by the scanner
|
| 272 |
-
- 🟢 Green — line you probed with Get Context
|
| 273 |
-
- **Unlocked hints** appear below the code as green panels whenever a key issue is found
|
| 274 |
-
- The **adversarial hint** banner tells you whether the PR is from a trusted team member or an unknown external contributor
|
| 275 |
-
|
| 276 |
-
**Centre — Action Panel**
|
| 277 |
-
- **Add Comment** form: line number, free-text comment, severity, category, and bug/backdoor classification
|
| 278 |
-
- **Quick Actions**: single-click buttons for all 7 action types
|
| 279 |
-
|
| 280 |
-
| Button | Action | What Happens |
|
| 281 |
-
|---|---|---|
|
| 282 |
-
| 🔍 Get Context | `get_context` | Reveals ±5 lines around the probed line number |
|
| 283 |
-
| 🤖 Run Scanner | `run_scanner` | Runs the simulated static-analysis tool |
|
| 284 |
-
| 🔄 Request Changes | `request_changes` | Records your review decision |
|
| 285 |
-
| ✅ Approve PR | `approve` | Approves (−0.15 penalty if < 50 % issues found) |
|
| 286 |
-
| 📤 Submit Review | `submit_review` | Ends the episode; triggers terminal scoring |
|
| 287 |
-
| 🚨 Escalate to Security | `escalate_to_security_review` | Correct only on adversarial tasks 7–9 |
|
| 288 |
-
|
| 289 |
-
**Right — Reward Dashboard**
|
| 290 |
-
- **Animated ring** showing cumulative episode reward (green above zero, red below)
|
| 291 |
-
- **Six component bars** updating in real time after every action:
|
| 292 |
-
- Issue credit, Classification credit, FP penalty
|
| 293 |
-
- Coverage bonus, Decision score, Efficiency bonus
|
| 294 |
-
- **Issues progress bar** showing how many ground-truth issues you have found
|
| 295 |
-
- **Episode history feed** — every action with its reward delta and explanation
|
| 296 |
-
|
| 297 |
-
### Episode End Modal
|
| 298 |
-
|
| 299 |
-
When the episode terminates (via Submit Review or Escalate), a modal pops up showing:
|
| 300 |
|
| 301 |
-
|
| 302 |
-
🏆 Episode Passed!
|
| 303 |
|
| 304 |
-
|
| 305 |
-
|
|
|
|
|
|
|
| 306 |
|
| 307 |
-
|
| 308 |
-
│ Cumulative reward +0.874 │
|
| 309 |
-
│ Issues found 5 / 5 │
|
| 310 |
-
│ Steps used 18 / 25 │
|
| 311 |
-
│ Decision escalate │
|
| 312 |
-
│ Escalation required Yes │
|
| 313 |
-
└───────────────────────────────────┘
|
| 314 |
|
| 315 |
-
|
| 316 |
-
```
|
| 317 |
|
| 318 |
-
|
| 319 |
|
| 320 |
-
###
|
| 321 |
|
| 322 |
```bash
|
| 323 |
-
|
| 324 |
-
uv sync
|
| 325 |
-
|
| 326 |
-
# Start the server — this also serves the frontend
|
| 327 |
-
uv run python run.py
|
| 328 |
```
|
| 329 |
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
**Optional flags:**
|
| 333 |
|
| 334 |
```bash
|
| 335 |
-
|
| 336 |
-
uv run python run.py --port 9000
|
| 337 |
-
|
| 338 |
-
# Bind to localhost only (do not expose on the network)
|
| 339 |
-
uv run python run.py --host 127.0.0.1
|
| 340 |
-
|
| 341 |
-
# Dev mode: auto-reload Python files on save
|
| 342 |
-
uv run python run.py --reload
|
| 343 |
```
|
| 344 |
|
| 345 |
-
###
|
| 346 |
-
|
| 347 |
-
The browser communicates with the backend over a **persistent WebSocket** at `ws://localhost:8000/ws`.
|
| 348 |
-
Each browser tab gets its own isolated environment instance — concurrent sessions do not share state.
|
| 349 |
-
The WebSocket URL is auto-detected from `window.location.hostname` so the UI works on any host or port without editing any file.
|
| 350 |
-
|
| 351 |
-
### Why a Frontend Helps the Story
|
| 352 |
-
|
| 353 |
-
| Without Frontend | With Frontend |
|
| 354 |
-
|---|---|
|
| 355 |
-
| `total=0.345` in a log file | Animated reward ring filling green in real time |
|
| 356 |
-
| `issues_found: ['sql_injection']` | Line 12 highlighted blue in the code viewer |
|
| 357 |
-
| `decision: escalate_to_security_review` | 🚨 Escalate button, modal with final score and stats |
|
| 358 |
-
| Understanding the anti-exploit rule | Watching a keyword-spam comment score −0.05 FP penalty |
|
| 359 |
-
| Explaining the causal chain mechanic | Green hint panel appearing after finding the JWT issue |
|
| 360 |
-
|
| 361 |
-
The dashboard makes the reward signal **tangible** — a visitor can play one episode in two minutes and immediately understand what makes PRobe different from a linter.
|
| 362 |
-
|
| 363 |
-
---
|
| 364 |
-
|
| 365 |
-
## Training
|
| 366 |
-
|
| 367 |
-
| | |
|
| 368 |
-
|---|---|
|
| 369 |
-
| **Training script** | [`training/train_grpo.py`](training/train_grpo.py) |
|
| 370 |
-
| **Notebook** | [](https://colab.research.google.com/drive/FILL_COLAB_LINK) — replace with your Colab link |
|
| 371 |
-
| **Model** | `Qwen/Qwen2.5-1.5B-Instruct` (default) — swap via `--model` flag |
|
| 372 |
-
| **Algorithm** | GRPO via HuggingFace TRL + optional Unsloth 4-bit LoRA |
|
| 373 |
-
| **Hardware** | Single T4 GPU (Kaggle free tier) or A10/A100 (Colab Pro) |
|
| 374 |
-
| **Training time** | ~3 hours for 200 steps on T4 with Unsloth 4-bit |
|
| 375 |
-
| **Curriculum** | 5 phases: ultra-easy → easy → medium/hard → causal chain → adversarial |
|
| 376 |
|
| 377 |
```bash
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
|
| 386 |
-
|
| 387 |
-
# Resume from checkpoint
|
| 388 |
-
uv run python training/train_grpo.py --model Qwen/Qwen2.5-1.5B-Instruct --resume-from ./outputs/checkpoint-80
|
| 389 |
-
|
| 390 |
-
# Smoke-test reward function only (no GPU, no model download, < 5 seconds)
|
| 391 |
-
uv run python training/train_grpo.py --test
|
| 392 |
```
|
| 393 |
|
| 394 |
-
|
| 395 |
-
|
| 396 |
-
## Results
|
| 397 |
|
| 398 |
-
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
|
| 402 |
-
|
| 403 |
-
|
| 404 |
-
|
| 405 |
-

|
| 406 |
|
| 407 |
-
|
| 408 |
-
require correct `intentional_backdoor` classification AND `escalate_to_security_review` —
|
| 409 |
-
only the oracle achieves this.*
|
| 410 |
|
| 411 |
-
|
| 412 |
|
| 413 |
-
|
|
|
|
|
|
|
| 414 |
|
| 415 |
-
|
| 416 |
-
Credit because they contain deliberate backdoors requiring `intentional_backdoor` labelling.
|
| 417 |
-
No other agent earns classification credit — it requires correctly locating the issue AND
|
| 418 |
-
understanding the attacker's intent.*
|
| 419 |
|
| 420 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 421 |
|
| 422 |
-
|
| 423 |
-
|---|---|---|---|---|
|
| 424 |
-
| **Perfect Oracle** | **+0.778** | 0.000 | +1.000 | Upper bound — reads ground truth, constructs ideal output |
|
| 425 |
-
| Line Flooder | −0.025 | −0.130 | +0.020 | Comments on every 5th line with vague generic text |
|
| 426 |
-
| Keyword Spammer | −0.075 | −0.180 | −0.030 | Dumps all security keywords at a deliberately wrong line |
|
| 427 |
-
| Random Agent | −0.260 | −0.480 | −0.080 | Random lines, random categories, random terminal actions |
|
| 428 |
|
| 429 |
-
|
| 430 |
-
An agent *must* read the code to earn positive reward — keyword spam alone scores negative.
|
| 431 |
|
| 432 |
-
|
|
|
|
|
|
|
| 433 |
|
| 434 |
-
|
| 435 |
-
|---|---|---|
|
| 436 |
-
| All keywords, wrong line (9999) | −0.075 | `line_hit` fails — no issue within ±2 lines of 9999 |
|
| 437 |
-
| Correct line, 8-char comment | 0.00 | `substantive` fails — comment < 15 characters |
|
| 438 |
-
| Correct keyword + line, no classification | partial | Earns `issue_credit` but misses `classification_credit` |
|
| 439 |
-
| Guess `escalate` on non-adversarial task | −0.15 | `decision_score` = −DECISION_REWARD (false alarm penalty) |
|
| 440 |
-
| Guess `request_changes` on adversarial task | −0.15 | Wrong terminal action even if issues are found |
|
| 441 |
|
| 442 |
-
|
| 443 |
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
| T1 | easy | +0.800 | Clean logic bugs |
|
| 448 |
-
| T2 | medium | +0.795 | SQL injection, eval(), hardcoded creds |
|
| 449 |
-
| T3 | hard | +0.000 | Known edge case: mutation shifts lines beyond oracle's fixed midpoint |
|
| 450 |
-
| T4 | medium | +0.795 | Async concurrency bugs |
|
| 451 |
-
| T5 | hard | +0.800 | Command injection, path traversal, pickle RCE |
|
| 452 |
-
| T6 | hard | +0.803 | Causal chain — JWT forgery → privilege escalation |
|
| 453 |
-
| T7 | adversarial | **+1.000** | Perfect: both backdoors found + escalated + classified |
|
| 454 |
-
| T8 | adversarial | **+1.000** | Compound JWT backdoor chain |
|
| 455 |
-
| T9 | adversarial | **+1.000** | Supply-chain import hook — 3 coordinated exfiltration components |
|
| 456 |
|
| 457 |
-
|
| 458 |
|
| 459 |
-
|
| 460 |
-
|
| 461 |
|
| 462 |

|
| 463 |
|
| 464 |
-
|
| 465 |
-
curriculum). Smoothed curve (window = 10) over raw per-episode values.*
|
| 466 |
-
|
| 467 |
-
### Before vs. After Training
|
| 468 |
-
|
| 469 |
-
| Metric | Random Baseline | Scripted Oracle (ceiling) | Trained Agent |
|
| 470 |
-
|---|---|---|---|
|
| 471 |
-
| Average episode reward | −0.260 | +0.778 | *run training to populate* |
|
| 472 |
-
| Issue detection rate | 0 % | 100 % | *run training to populate* |
|
| 473 |
-
| False positive rate | high | 0 % | *run training to populate* |
|
| 474 |
-
| Backdoor classification accuracy | 0 % | 100 % | *run training to populate* |
|
| 475 |
-
| Escalation recall (adversarial tasks 7–9) | ~33 % (random) | 100 % | *run training to populate* |
|
| 476 |
-
|
| 477 |
-
> Random baseline (−0.260) and oracle ceiling (+0.778) are **real measured numbers** from
|
| 478 |
-
> `outputs/scripted_baseline.jsonl`. Replace the *"run training to populate"* cells with your
|
| 479 |
-
> actual numbers after training completes.
|
| 480 |
-
|
| 481 |
-
---
|
| 482 |
-
|
| 483 |
-
## Why This Matters
|
| 484 |
-
|
| 485 |
-
Security code review is a high-stakes task performed by a small number of specialists — it does not scale to the volume of code that modern teams ship. An agent that can reliably read a PR, flag bugs with accurate line references, distinguish honest mistakes from deliberate backdoors (the XZ Utils and SolarWinds failure mode), and escalate with justification would directly accelerate secure software delivery for any team using AI-assisted development. This is also a largely unexplored domain for RL: existing code benchmarks reward *generating* correct outputs, not *critically evaluating* someone else's work, leaving the oversight and adversarial-detection capabilities of LLMs essentially untrained.
|
| 486 |
-
|
| 487 |
-
---
|
| 488 |
-
|
| 489 |
-
## Links
|
| 490 |
-
|
| 491 |
-
> **Before submitting:** replace every placeholder below with a real URL.
|
| 492 |
-
> Judges follow these links directly — missing links are a non-negotiable submission disadvantage.
|
| 493 |
-
|
| 494 |
-
| Resource | URL |
|
| 495 |
-
|---|---|
|
| 496 |
-
| 🤗 HuggingFace Space (live environment) | Replace with your HF Space URL |
|
| 497 |
-
| 📓 Training notebook (Colab / Kaggle) | Replace with your Colab or Kaggle link |
|
| 498 |
-
| 📝 Mini-blog / writeup (HuggingFace) | Replace with your HF blog post URL |
|
| 499 |
-
| 🎥 Demo video (YouTube, < 2 min) | Replace with your YouTube URL |
|
| 500 |
-
| 📊 Slides / presentation | Replace with your slides URL |
|
| 501 |
-
| 📈 WandB training run | Replace with your WandB run URL |
|
| 502 |
|
| 503 |
---
|
| 504 |
|
|
|
|
| 12 |
- code-review
|
| 13 |
- rl-training
|
| 14 |
- grpo
|
|
|
|
| 15 |
- probe
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# PRobe — an AI code reviewer that can spot backdoors
|
| 19 |
|
| 20 |
+
## Submission links (judge quick access)
|
| 21 |
|
| 22 |
+
[](https://colab.research.google.com/drive/FILL_COLAB_LINK)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
> Replace each placeholder below with a real URL before submission.
|
| 25 |
|
| 26 |
+
| Resource | URL |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|---|---|
|
| 28 |
+
| 🤗 HuggingFace Space (live environment) | Replace with your HF Space URL |
|
| 29 |
+
| 📓 Training notebook (Colab / Kaggle) | Replace with your Colab or Kaggle link |
|
| 30 |
+
| 📝 Mini-blog / writeup (HuggingFace) | Replace with your HF blog post URL |
|
| 31 |
+
| 🎥 Demo video (YouTube, < 2 min) | Replace with your YouTube URL |
|
| 32 |
+
| 📊 Slides / presentation | Replace with your slides URL |
|
| 33 |
+
| 📈 WandB training run | Replace with your WandB run URL |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
## TL;DR
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
+
PRobe is a training environment where an AI learns to **review Python code like a careful security engineer**:
|
| 38 |
|
| 39 |
+
- Find real bugs and security issues (with correct line numbers)
|
| 40 |
+
- Tell the difference between an honest mistake vs. a deliberate backdoor
|
| 41 |
+
- Decide whether to **approve**, **request changes**, or **escalate to security**
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
Unlike many demos, PRobe uses a **deterministic reward** (no “LLM judge”). Keyword-spam on random lines gets penalized; careful, accurate findings score high.
|
| 44 |
|
| 45 |
+
## Try it in 60 seconds
|
| 46 |
|
| 47 |
```bash
|
|
|
|
| 48 |
uv sync
|
|
|
|
|
|
|
| 49 |
uv run python run.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
```
|
| 51 |
|
| 52 |
+
Then open `http://localhost:8000/ui/` and click **New Episode**.
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
## Why it exists (simple version)
|
|
|
|
| 55 |
|
| 56 |
+
Real supply-chain attacks (like XZ Utils / SolarWinds) often look like normal code changes. A useful AI reviewer must do more than “scan” — it must **investigate intent** and know when to escalate.
|
| 57 |
|
| 58 |
+
## What’s novel (in plain English)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
+
- **No LLM judge**: reward is deterministic and reproducible.
|
| 61 |
+
- **Anti-gaming**: keyword spam on random lines gets penalized.
|
| 62 |
+
- **Backdoor escalation**: some tasks require choosing “escalate to security”, not just listing bugs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
+
## What’s inside (high level)
|
|
|
|
| 65 |
|
| 66 |
+
- **10 tasks** that simulate real review situations (bugs + adversarial backdoors)
|
| 67 |
+
- A **mutator** that changes variable names/line numbers so the model can’t memorize answers
|
| 68 |
+
- A **grader** that scores outputs based on “right issue + right place + good explanation”
|
| 69 |
+
- A lightweight **web UI** so anyone can try an episode in the browser
|
| 70 |
|
| 71 |
+
If you want the full technical design, see `docs/design.md`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
+
## Training (GRPO)
|
|
|
|
| 74 |
|
| 75 |
+
The training entrypoint is `training/train_grpo.py`.
|
| 76 |
|
| 77 |
+
### Install training dependencies
|
| 78 |
|
| 79 |
```bash
|
| 80 |
+
pip install -e ".[training]"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
```
|
| 82 |
|
| 83 |
+
### Smoke test (no GPU, no model download)
|
|
|
|
|
|
|
| 84 |
|
| 85 |
```bash
|
| 86 |
+
python training/train_grpo.py --test
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
```
|
| 88 |
|
| 89 |
+
### Train (example)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
```bash
|
| 92 |
+
python training/train_grpo.py \
|
| 93 |
+
--model Qwen/Qwen2.5-1.5B-Instruct \
|
| 94 |
+
--steps 200 \
|
| 95 |
+
--group-size 2 \
|
| 96 |
+
--batch-size 2 \
|
| 97 |
+
--grad-accum 1 \
|
| 98 |
+
--max-seq-len 1024 \
|
| 99 |
+
--max-completion-len 128 \
|
| 100 |
+
--save-steps 50
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
```
|
| 102 |
|
| 103 |
+
### Resume from a checkpoint
|
|
|
|
|
|
|
| 104 |
|
| 105 |
+
```bash
|
| 106 |
+
python training/train_grpo.py \
|
| 107 |
+
--model Qwen/Qwen2.5-1.5B-Instruct \
|
| 108 |
+
--steps 200 \
|
| 109 |
+
--resume-from outputs/checkpoint-100
|
| 110 |
+
```
|
|
|
|
|
|
|
| 111 |
|
| 112 |
+
### Reproduce our run (copy/paste template)
|
|
|
|
|
|
|
| 113 |
|
| 114 |
+
Fill these before submission:
|
| 115 |
|
| 116 |
+
- **Hardware**: (T4 / A100 / …)
|
| 117 |
+
- **Steps**: (100 / 200)
|
| 118 |
+
- **Runtime**: (~__ minutes)
|
| 119 |
|
| 120 |
+
Example command (200 steps, checkpoints every 50 steps):
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
+
```bash
|
| 123 |
+
python training/train_grpo.py \
|
| 124 |
+
--model Qwen/Qwen2.5-1.5B-Instruct \
|
| 125 |
+
--steps 200 \
|
| 126 |
+
--group-size 2 \
|
| 127 |
+
--batch-size 2 \
|
| 128 |
+
--grad-accum 1 \
|
| 129 |
+
--max-seq-len 1024 \
|
| 130 |
+
--max-completion-len 128 \
|
| 131 |
+
--save-steps 50 \
|
| 132 |
+
--output-dir outputs
|
| 133 |
+
```
|
| 134 |
|
| 135 |
+
## Outputs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
+
Training writes artifacts under `outputs/` (or your `--output-dir`), including:
|
|
|
|
| 138 |
|
| 139 |
+
- Checkpoints: `checkpoint-*`
|
| 140 |
+
- Curves: `training_curves.png`, `per_task_reward.png`
|
| 141 |
+
- Demo traces (adversarial tasks): `demo/before_task*.json`, `demo/after_task*.json`
|
| 142 |
|
| 143 |
+
## Before vs. after training (images)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
+
Fill these before submission (numbers judges can scan fast):
|
| 146 |
|
| 147 |
+
- **Mean reward**: before __ → after __
|
| 148 |
+
- **Escalation recall (tasks 7–9)**: before __ → after __
|
| 149 |
+
- **False positives per episode**: before __ → after __
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
+
After training, these images are written to `outputs/` and help show improvement:
|
| 152 |
|
| 153 |
+
- `outputs/training_curves.png` (reward / loss over steps)
|
| 154 |
+
- `outputs/per_task_reward.png` (per-task reward before vs after)
|
| 155 |
|
| 156 |

|
| 157 |
|
| 158 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
---
|
| 161 |
|
uv.lock
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|