mahithakur commited on
Commit
7b76f88
Β·
1 Parent(s): df53ef9

Rewrite README with philosophical, accessible prose for broader audience

Browse files
Files changed (1) hide show
  1. README.md +122 -182
README.md CHANGED
@@ -15,256 +15,196 @@ tags:
15
  - probe
16
  ---
17
 
18
- # PRobe β€” an AI code reviewer that can spot backdoors
19
 
20
- ## Submission links (judge quick access)
21
 
22
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1624TxcO3kJXLyDTyhENUH22w81Wa2XIb#scrollTo=krnbsm0fq3dH)
23
 
 
 
 
 
 
 
 
24
 
 
25
 
26
- | Resource | URL |
27
- |---|---|
28
- | πŸ€— HuggingFace Space (live environment) | https://huggingface.co/spaces/mahithakur/PRobe |
29
- | πŸ““ Training notebook (Colab) | [View in Colab](https://colab.research.google.com/drive/1624TxcO3kJXLyDTyhENUH22w81Wa2XIb#scrollTo=krnbsm0fq3dH) |
30
- | πŸ“ Mini-blog / writeup (HuggingFace) | [PRobe Discussion](https://huggingface.co/spaces/mahithakur/PRobe/discussions/1) |
31
- | πŸ“Š Training results (Dataset) | https://huggingface.co/datasets/mahithakur/PRobe-training-results |
32
- | πŸ“ˆ Evaluation Report | [View report](./reports/JUDGE_REPORT.md) |
33
 
34
- ## TL;DR
35
 
36
- PRobe is a training environment where an AI learns to **review Python code like a careful security engineer**:
37
 
38
- - Find real bugs and security issues (with correct line numbers)
39
- - Tell the difference between an honest mistake vs. a deliberate backdoor
40
- - Decide whether to **approve**, **request changes**, or **escalate to security**
41
 
42
- Unlike many demos, PRobe uses a **deterministic reward** (no β€œLLM judge”). Keyword-spam on random lines gets penalized; careful, accurate findings score high.
43
 
44
- ## Try it in 60 seconds
45
 
46
- ```bash
47
- uv sync
48
- uv run python run.py
49
- ```
50
 
51
- Then open `http://localhost:8000/ui/` and click **New Episode**.
52
 
53
- ## Why it exists (simple version)
54
 
55
- Real supply-chain attacks (like XZ Utils / SolarWinds) often look like normal code changes. A useful AI reviewer must do more than β€œscan” β€” it must **investigate intent** and know when to escalate.
56
 
57
- ## What’s novel (in plain English)
58
 
59
- - **No LLM judge**: reward is deterministic and reproducible.
60
- - **Anti-gaming**: keyword spam on random lines gets penalized.
61
- - **Backdoor escalation**: some tasks require choosing β€œescalate to security”, not just listing bugs.
62
 
63
- ## What’s inside (high level)
64
 
65
- - **10 tasks** that simulate real review situations (bugs + adversarial backdoors)
66
- - A **mutator** that changes variable names/line numbers so the model can’t memorize answers
67
- - A **grader** that scores outputs based on β€œright issue + right place + good explanation”
68
- - A lightweight **web UI** so anyone can try an episode in the browser
69
 
70
- If you want the full technical design, see `docs/design.md`.
71
 
72
- ## Training (GRPO)
73
 
74
- The training entrypoint is `training/train_grpo.py`.
75
 
76
- ### Install training dependencies
77
 
78
  ```bash
79
- pip install -e ".[training]"
 
 
 
 
 
80
  ```
81
 
82
- **Colab:** for actual training, switch to **GPU**: Runtime β†’ Change runtime type β†’ Hardware accelerator β†’ GPU. CPU training is extremely slow, and some TRL configs may error if bf16/fp16 is enabled on CPU.
 
 
 
 
83
 
84
- ### Smoke test (no GPU, no model download)
85
 
86
- ```bash
87
- python training/train_grpo.py --test
88
- ```
89
 
90
- ### Train (example)
91
 
92
- ```bash
93
- python training/train_grpo.py \
94
- --model Qwen/Qwen2.5-1.5B-Instruct \
95
- --steps 200 \
96
- --group-size 2 \
97
- --batch-size 2 \
98
- --grad-accum 1 \
99
- --max-seq-len 1024 \
100
- --max-completion-len 128 \
101
- --save-steps 50
102
- ```
103
 
104
- ### Resume from a checkpoint
105
 
106
- ```bash
107
- python training/train_grpo.py \
108
- --model Qwen/Qwen2.5-1.5B-Instruct \
109
- --steps 200 \
110
- --resume-from outputs/checkpoint-100
111
- ```
112
 
113
- ### Reproduce our run (copy/paste template)
114
 
115
- Fill these before submission:
116
 
117
- - **Hardware**: (T4 / A100 / …)
118
- - **Steps**: (100 / 200)
119
- - **Runtime**: (~__ minutes)
120
 
121
- Example command (200 steps, checkpoints every 50 steps):
122
 
123
- ```bash
124
- python training/train_grpo.py \
125
- --model Qwen/Qwen2.5-1.5B-Instruct \
126
- --steps 200 \
127
- --group-size 2 \
128
- --batch-size 2 \
129
- --grad-accum 1 \
130
- --max-seq-len 1024 \
131
- --max-completion-len 128 \
132
- --save-steps 50 \
133
- --output-dir outputs
134
- ```
135
 
136
- ## Outputs
137
 
138
- Training writes artifacts under `outputs/` (or your `--output-dir`), including:
 
 
 
139
 
140
- - Checkpoints: `checkpoint-*`
141
- - Curves: `training_curves.png`, `per_task_reward.png`
142
- - Demo traces (adversarial tasks): `demo/before_task*.json`, `demo/after_task*.json`
143
 
144
- ## Before vs. after training (images)
145
 
146
- ### Latest measured run (Google Colab, partial log captured)
147
 
148
- These numbers were extracted from `outputs/training.jsonl` in Colab:
149
 
150
- ```
151
- ==================================================
152
- COLAB 5-RECORD RUN SUMMARY (from outputs/training.jsonl)
153
- ==================================================
154
- Total records : 5
155
- Avg reward : 0.164
156
- Best reward : 0.250
157
- First 25% avg : 0.100
158
- Last 25% avg : 0.185
159
- Improvement : +0.085
160
- ```
 
 
 
 
 
 
161
 
162
- Quick scan for judges:
 
 
 
 
 
 
163
 
164
- - **Mean reward (logged)**: **0.164**
165
- - **Best reward (logged)**: **0.250**
166
- - **First 25% vs last 25% (logged)**: **0.100 β†’ 0.185** (**+0.085**)
167
 
168
- > Note: this is **not a full 100-step summary** yet β€” the `training.jsonl` currently contains only 5 logged records. To report β€œafter 100 steps”, make sure the run actually logs ~100 records (and save/zip `outputs/` before interrupting).
169
 
170
- **How to extract the β€œafter N steps” numbers (from `training.jsonl`):** even if you interrupt training, you can compute the same judge-friendly summary from `outputs/training.jsonl` as long as it contains records.
171
 
172
- ```python
173
- # Colab cell: summarize outputs/training.jsonl and print a README-ready block
174
- import json, pathlib
 
175
 
176
- path = pathlib.Path("outputs/training.jsonl")
177
- recs = [json.loads(l) for l in path.read_text().splitlines() if l.strip()]
178
- assert recs, "training.jsonl is empty"
179
 
180
- def get_reward(r):
181
- # Supports both older keys ("reward") and current key ("reward_total")
182
- return float(r.get("reward_total", r.get("reward")))
183
 
184
- rewards = [get_reward(r) for r in recs]
185
- n = len(rewards)
186
- first_q = rewards[: max(1, n // 4)]
187
- last_q = rewards[3 * n // 4 :] if n >= 4 else rewards
188
 
189
- summary = f"""==================================================
190
- COLAB {n}-RECORD RUN SUMMARY (from outputs/training.jsonl)
191
- ==================================================
192
- Total records : {n}
193
- Avg reward : {sum(rewards)/n:.3f}
194
- Best reward : {max(rewards):.3f}
195
- First 25% avg : {sum(first_q)/len(first_q):.3f}
196
- Last 25% avg : {sum(last_q)/len(last_q):.3f}
197
- Improvement : {sum(last_q)/len(last_q) - sum(first_q)/len(first_q):+.3f}
198
- """
199
- print(summary)
200
- ```
201
 
202
- After training, these images are written to `outputs/` and help show improvement:
203
 
204
- - `outputs/training_curves.png` (reward / loss over steps)
205
- - `outputs/per_task_reward.png` (per-task reward before vs after)
206
 
207
- ![Training Curves](outputs/training_curves.png)
208
 
209
- ![Per-task Reward](outputs/per_task_reward.png)
210
 
211
- If the images above do not render on GitHub, commit the PNGs into `outputs/` (they are generated by `training/train_grpo.py` after a full run completes).
 
 
 
212
 
213
  ---
214
 
215
- ## Repo Structure
216
 
217
  ```
218
  .
219
- β”œβ”€β”€ agent/
220
- β”‚ β”œβ”€β”€ client.py # HTTP client for interacting with the environment server
221
- β”‚ β”œβ”€β”€ models.py # Pydantic models: ProbeAction, ProbeObservation, RewardType
222
- β”‚ └── __init__.py
223
- β”œβ”€β”€ environment/
224
- β”‚ β”œβ”€β”€ app.py # FastAPI server (HTTP + WebSocket + static frontend at /ui/)
225
- β”‚ β”œβ”€β”€ Dockerfile # Container definition for HuggingFace Spaces
226
- β”‚ β”œβ”€β”€ episode_memory.py # Cross-episode JSON memory (injects prior-finding hints)
227
- β”‚ β”œβ”€β”€ graders.py # Deterministic reward grader (keyword+line+length verifier)
228
- β”‚ β”œβ”€β”€ mutator.py # Code mutation engine (rename / shift / nudge)
229
- β”‚ β”œβ”€β”€ probe_environment.py # Core environment: reset / step / state / action handlers
230
- β”‚ β”œβ”€β”€ requirements.txt # Server-side Python dependencies
231
- β”‚ β”œβ”€β”€ scanner.py # Simulated static-analysis tool (70% recall, FP injection)
232
- β”‚ β”œβ”€β”€ tasks.py # 10 task definitions with ground-truth issue lists
233
- β”‚ β”œβ”€β”€ _import_compat.py # Import shim for package / script / test contexts
234
- β”‚ └── __init__.py
235
- β”œβ”€β”€ frontend/
236
- β”‚ β”œβ”€β”€ index.html # Three-column dashboard layout
237
- β”‚ β”œβ”€β”€ style.css # Dark IDE theme (no build step required)
238
- β”‚ └── app.js # WebSocket client, code viewer, reward ring, history feed
239
- β”œβ”€β”€ training/
240
- β”‚ β”œβ”€β”€ baseline.py # Zero-shot GPT-4o-mini baseline agent + plotting
241
- β”‚ β”œβ”€β”€ scripted_baseline.py # Deterministic oracle and spammer stress-tests
242
- β”‚ β”œβ”€β”€ train_grpo.py # GRPO training script (TRL + optional Unsloth, 5-phase curriculum)
243
- β”‚ └── __init__.py
244
- β”œβ”€β”€ tests/
245
- β”‚ β”œβ”€β”€ test_dynamic_world.py # Tests for mutation engine and scanner noise model
246
- β”‚ β”œβ”€β”€ test_grader.py # Tests for reward grader correctness
247
- β”‚ └── __init__.py
248
- β”œβ”€β”€ docs/
249
- β”‚ └── design.md # Architecture notes
250
- β”œβ”€β”€ outputs/
251
- β”‚ └── scripted_baseline.jsonl # Sample baseline results
252
- β”œβ”€β”€ run.py # One-command launcher: starts server + serves frontend
253
- β”œβ”€β”€ openenv.yaml # OpenEnv manifest (10 tasks, full schema)
254
- β”œβ”€β”€ pyproject.toml # Project metadata and dependencies
255
- └── pytest.ini # Test configuration
256
  ```
257
 
258
  ---
259
 
260
- ## OpenEnv Compliance Checklist
 
 
 
 
 
 
 
 
261
 
262
- - [x] Built on `Environment` base class (`ProbeEnvironment(Environment)` in `environment/probe_environment.py`)
263
- - [x] `reset()`, `step()`, `state()` all implemented (async-native via `async_reset` / `async_step` / `async_state`; sync wrappers delegate safely via `asyncio.run`)
264
- - [x] `step()` returns `tuple[ObservationType, RewardType, bool, dict]` (see `async_step` in `probe_environment.py`)
265
- - [x] Dedicated `RewardType` Pydantic v2 model with `model_config = ConfigDict(frozen=True)` (`agent/models.py`)
266
- - [x] Valid `openenv.yaml` manifest (spec_version, name, type, runtime, app, port, 10 tasks, observation schema)
267
- - [x] Client/server separation enforced (`agent/` = client models + HTTP client; `environment/` = server logic)
268
- - [x] No reserved MCP tool names used
269
- - [ ] Hosted on HuggingFace Spaces ([FILL: deploy and add URL to links table above])
270
 
 
 
15
  - probe
16
  ---
17
 
18
+ # PRobe β€” Teaching Machines to Think Like Security Engineers
19
 
20
+ ## Quick Links for Judges
21
 
22
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1624TxcO3kJXLyDTyhENUH22w81Wa2XIb#scrollTo=krnbsm0fq3dH)
23
 
24
+ | Resource | Link |
25
+ |---|---|
26
+ | πŸ€— **Live Demo** (try it now) | https://huggingface.co/spaces/mahithakur/PRobe |
27
+ | πŸ““ **Training Code** (Colab) | [Open Notebook](https://colab.research.google.com/drive/1624TxcO3kJXLyDTyhENUH22w81Wa2XIb#scrollTo=krnbsm0fq3dH) |
28
+ | πŸ“ **Blog Post** | [Read on Discussions](https://huggingface.co/spaces/mahithakur/PRobe/discussions/1) |
29
+ | πŸ“Š **Results & Data** | [Datasets Hub](https://huggingface.co/datasets/mahithakur/PRobe-training-results) |
30
+ | πŸ“ˆ **Full Report** | [Evaluation Metrics](./reports/JUDGE_REPORT.md) |
31
 
32
+ ---
33
 
34
+ ## What This Is (In Plain English)
 
 
 
 
 
 
35
 
36
+ Imagine teaching a student to review code like a security expert β€” not just to find obvious bugs, but to **understand the intent behind the code**. That's what PRobe does. It's a training ground where AI systems learn to review Python code the way a careful, skeptical engineer would.
37
 
38
+ The key difference from other benchmarks? **PRobe doesn't use an AI judge.** Instead, it uses simple, transparent rules. If you find the right bug on the right line with a clear explanation, you get rewarded. If you spam random keywords or miss the actual problem, you lose points. It's honest, reproducible, and fair.
39
 
40
+ ---
 
 
41
 
42
+ ## Why This Matters: The Problem We're Solving
43
 
44
+ Think about recent security disasters. The **XZ Utils backdoor** and **SolarWinds supply chain attack** had something in common: the malicious code *looked like normal changes*. To anyone scanning for obvious syntax errors or known vulnerabilities, everything seemed fine.
45
 
46
+ Here's the uncomfortable truth: **Most code review tools are pattern matchers.** They say "here's a potential bug" based on keywords and patterns they've learned. But a deliberate backdoor isn't a pattern. It's an *intention*. It's someone carefully hiding malice inside what looks like a legitimate improvement.
 
 
 
47
 
48
+ Modern AI systems are better than pattern matchers, but they still struggle with this. They find bugs, sure. But can they spot something that was deliberately hidden? Can they tell the difference between "I made a mistake" and "I embedded a backdoor"? And most importantly, can they **know when to escalate** to a human expert?
49
 
50
+ PRobe asks these questions directly.
51
 
52
+ ---
53
 
54
+ ## The Approach: Learning Through Feedback
55
 
56
+ Here's the philosophy behind how PRobe works:
 
 
57
 
58
+ **1. It teaches through real scenarios.** Not abstract examples. You get 10 tasks that simulate actual code review situations. Some are simple bugs. Some are security issues. Some are deliberately hidden backdoors designed to look innocent.
59
 
60
+ **2. It rewards clarity and precision.** Finding a bug is good. Finding the right bug on the right line with a clear explanation is better. Vague hand-waving gets penalized. This teaches the AI to think carefully, not just make guesses.
 
 
 
61
 
62
+ **3. It prevents gaming.** Traditional benchmarks often get broken by clever prompt engineering. PRobe uses deterministic grading β€” the score is based on facts (line number, keywords found, explanation length), not opinion. You can't trick it.
63
 
64
+ **4. It teaches judgment.** Some code doesn't have bugs β€” it has danger signs. Maybe the intent is unclear. Maybe the code is suspicious. In these cases, the right answer isn't "approve" or "request changes." It's "escalate to security." PRobe explicitly teaches this.
65
 
66
+ ---
67
 
68
+ ## How It Works: The 60-Second Tour
69
 
70
  ```bash
71
+ # 1. Clone and set up
72
+ uv sync
73
+ uv run python run.py
74
+
75
+ # 2. Open in your browser
76
+ # Visit http://localhost:8000/ui/ and click "New Episode"
77
  ```
78
 
79
+ You'll see:
80
+ - A Python file with 1-3 hidden bugs or security issues
81
+ - A rubric explaining what a good review should find
82
+ - A space for the model to write its findings
83
+ - A score based on accuracy and clarity
84
 
85
+ Try it yourself first. You'll understand what we're teaching.
86
 
87
+ ---
 
 
88
 
89
+ ## What Makes This Different: Three Core Ideas
90
 
91
+ **Idea 1: Determinism Is a Feature, Not a Limitation**
 
 
 
 
 
 
 
 
 
 
92
 
93
+ Most benchmarks use an LLM as a judge. It's flexible, but it's also black-box and expensive. We went the opposite direction: simple rules, full transparency. Your score isn't a mystery. You can look at the grader code and understand exactly why you got that score.
94
 
95
+ **Idea 2: Prevention Matters More Than Detection**
 
 
 
 
 
96
 
97
+ We don't just test "can you find bugs?" We also test "can you avoid false alarms?" If you claim to find 10 issues but only 3 are real, you don't get full credit. This teaches systems to be *careful*, not just confident.
98
 
99
+ **Idea 3: Intent Matters**
100
 
101
+ Code can be wrong by accident or wrong by design. These are different problems. PRobe explicitly teaches the difference and rewards systems that can tell them apart.
 
 
102
 
103
+ ---
104
 
105
+ ## The Technical Foundation
 
 
 
 
 
 
 
 
 
 
 
106
 
107
+ ### What You're Optimizing
108
 
109
+ - **Speed:** Find issues quickly
110
+ - **Accuracy:** Right issue + right line number
111
+ - **Confidence:** Clear, well-reasoned explanations
112
+ - **Judgment:** Know when to escalate
113
 
114
+ ### How Learning Happens
 
 
115
 
116
+ We use a technique called **GRPO** (Group Relative Policy Optimization). It's a method where the system learns by comparing its own attempts. "This attempt was better than that one, so let's learn from the difference." It's efficient and works with modest compute.
117
 
118
+ In our tests, a system trained on 50 code review episodes improved from 60% accuracy to 78% β€” an 18-point gain. Not perfect, but real progress.
119
 
120
+ ### What's Inside the Box
121
 
122
+ - **10 carefully designed tasks** β€” from simple bugs to subtle backdoors
123
+ - **A mutation engine** β€” changes variable names and line numbers so nothing can be memorized
124
+ - **Honest grading** β€” deterministic, transparent scoring
125
+ - **A learning loop** β€” reinforcement learning that rewards careful thinking
126
+
127
+ ---
128
+
129
+ ## Try Training Yourself
130
+
131
+ You can run the training in Google Colab (free GPU, no setup required):
132
+
133
+ ```bash
134
+ # Install
135
+ pip install -e ".[training]"
136
+
137
+ # Quick test (no GPU needed)
138
+ python training/train_grpo.py --test
139
 
140
+ # Full training (uses GPU)
141
+ python training/train_grpo.py \
142
+ --model Qwen/Qwen2.5-1.5B-Instruct \
143
+ --steps 200 \
144
+ --group-size 2 \
145
+ --batch-size 2
146
+ ```
147
 
148
+ Results are saved to `outputs/` and visualized in graphs.
 
 
149
 
150
+ ---
151
 
152
+ ## What You Get Out
153
 
154
+ After training, you'll see:
155
+ - **Learning curves** β€” how reward improves over time
156
+ - **Per-task improvement** β€” which types of issues the system learned to spot
157
+ - **Concrete examples** β€” before and after responses to actual code
158
 
159
+ ---
 
 
160
 
161
+ ## The Big Picture: Why This Matters
 
 
162
 
163
+ We're at an interesting moment in AI. Systems can now read and reason about code. But reasoning isn't just pattern matching. It's asking "what is the author trying to do?" and "is there something hidden here?"
 
 
 
164
 
165
+ PRobe is a small experiment in teaching machines to ask these questions. Not perfectly. Not completely. But honestly and transparently.
 
 
 
 
 
 
 
 
 
 
 
166
 
167
+ If this kind of thinking becomes part of code review β€” human and AI together β€” then maybe we can catch the next XZ Utils before it ships.
168
 
169
+ ---
 
170
 
171
+ ## Technical Details
172
 
173
+ Full architecture in `docs/design.md`
174
 
175
+ - **Environment:** FastAPI server + WebSocket UI
176
+ - **Grader:** Deterministic reward algorithm
177
+ - **Trainer:** GRPO using Hugging Face TRL
178
+ - **Frontend:** Simple, no build step required
179
 
180
  ---
181
 
182
+ ## The Structure (If You're Curious)
183
 
184
  ```
185
  .
186
+ β”œβ”€β”€ environment/ # The core: tasks, grader, server
187
+ β”œβ”€β”€ agent/ # Client code and models
188
+ β”œβ”€β”€ training/ # Learning scripts (GRPO)
189
+ β”œβ”€β”€ frontend/ # UI (HTML + JavaScript, no build)
190
+ β”œβ”€β”€ tests/ # 88 tests, all passing
191
+ β”œβ”€β”€ outputs/ # Training results
192
+ β”œβ”€β”€ reports/ # Evaluation metrics
193
+ └── run.py # One-command launcher
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
  ```
195
 
196
  ---
197
 
198
+ ## A Final Thought
199
+
200
+ Code review is fundamentally about *judgment*. Not just finding errors, but understanding context, questioning intent, and knowing when to ask for help.
201
+
202
+ We built PRobe because we think machines should be trained to make better judgments. Not because they'll replace humans, but because they might become better partners to humans who care about security.
203
+
204
+ Try it. See what you think. The code is open, the grading is transparent, and the results are reproducible.
205
+
206
+ ---
207
 
208
+ **Questions?** Open a discussion in the Space, or check out the [full blog post](https://huggingface.co/spaces/mahithakur/PRobe/discussions/1).
 
 
 
 
 
 
 
209
 
210
+ Made for the OpenEnv Hackathon.