Spaces:
Runtime error
Runtime error
Commit Β·
7b76f88
1
Parent(s): df53ef9
Rewrite README with philosophical, accessible prose for broader audience
Browse files
README.md
CHANGED
|
@@ -15,256 +15,196 @@ tags:
|
|
| 15 |
- probe
|
| 16 |
---
|
| 17 |
|
| 18 |
-
# PRobe β
|
| 19 |
|
| 20 |
-
##
|
| 21 |
|
| 22 |
[](https://colab.research.google.com/drive/1624TxcO3kJXLyDTyhENUH22w81Wa2XIb#scrollTo=krnbsm0fq3dH)
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
|
|
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|---|---|
|
| 28 |
-
| π€ HuggingFace Space (live environment) | https://huggingface.co/spaces/mahithakur/PRobe |
|
| 29 |
-
| π Training notebook (Colab) | [View in Colab](https://colab.research.google.com/drive/1624TxcO3kJXLyDTyhENUH22w81Wa2XIb#scrollTo=krnbsm0fq3dH) |
|
| 30 |
-
| π Mini-blog / writeup (HuggingFace) | [PRobe Discussion](https://huggingface.co/spaces/mahithakur/PRobe/discussions/1) |
|
| 31 |
-
| π Training results (Dataset) | https://huggingface.co/datasets/mahithakur/PRobe-training-results |
|
| 32 |
-
| π Evaluation Report | [View report](./reports/JUDGE_REPORT.md) |
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
-
|
| 39 |
-
- Tell the difference between an honest mistake vs. a deliberate backdoor
|
| 40 |
-
- Decide whether to **approve**, **request changes**, or **escalate to security**
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
| 47 |
-
uv sync
|
| 48 |
-
uv run python run.py
|
| 49 |
-
```
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
-
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
##
|
| 58 |
|
| 59 |
-
|
| 60 |
-
- **Anti-gaming**: keyword spam on random lines gets penalized.
|
| 61 |
-
- **Backdoor escalation**: some tasks require choosing βescalate to securityβ, not just listing bugs.
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
-
- A **mutator** that changes variable names/line numbers so the model canβt memorize answers
|
| 67 |
-
- A **grader** that scores outputs based on βright issue + right place + good explanationβ
|
| 68 |
-
- A lightweight **web UI** so anyone can try an episode in the browser
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
##
|
| 77 |
|
| 78 |
```bash
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
```
|
| 81 |
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
-
|
| 87 |
-
python training/train_grpo.py --test
|
| 88 |
-
```
|
| 89 |
|
| 90 |
-
##
|
| 91 |
|
| 92 |
-
|
| 93 |
-
python training/train_grpo.py \
|
| 94 |
-
--model Qwen/Qwen2.5-1.5B-Instruct \
|
| 95 |
-
--steps 200 \
|
| 96 |
-
--group-size 2 \
|
| 97 |
-
--batch-size 2 \
|
| 98 |
-
--grad-accum 1 \
|
| 99 |
-
--max-seq-len 1024 \
|
| 100 |
-
--max-completion-len 128 \
|
| 101 |
-
--save-steps 50
|
| 102 |
-
```
|
| 103 |
|
| 104 |
-
|
| 105 |
|
| 106 |
-
|
| 107 |
-
python training/train_grpo.py \
|
| 108 |
-
--model Qwen/Qwen2.5-1.5B-Instruct \
|
| 109 |
-
--steps 200 \
|
| 110 |
-
--resume-from outputs/checkpoint-100
|
| 111 |
-
```
|
| 112 |
|
| 113 |
-
|
| 114 |
|
| 115 |
-
|
| 116 |
|
| 117 |
-
|
| 118 |
-
- **Steps**: (100 / 200)
|
| 119 |
-
- **Runtime**: (~__ minutes)
|
| 120 |
|
| 121 |
-
|
| 122 |
|
| 123 |
-
|
| 124 |
-
python training/train_grpo.py \
|
| 125 |
-
--model Qwen/Qwen2.5-1.5B-Instruct \
|
| 126 |
-
--steps 200 \
|
| 127 |
-
--group-size 2 \
|
| 128 |
-
--batch-size 2 \
|
| 129 |
-
--grad-accum 1 \
|
| 130 |
-
--max-seq-len 1024 \
|
| 131 |
-
--max-completion-len 128 \
|
| 132 |
-
--save-steps 50 \
|
| 133 |
-
--output-dir outputs
|
| 134 |
-
```
|
| 135 |
|
| 136 |
-
##
|
| 137 |
|
| 138 |
-
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
-
|
| 141 |
-
- Curves: `training_curves.png`, `per_task_reward.png`
|
| 142 |
-
- Demo traces (adversarial tasks): `demo/before_task*.json`, `demo/after_task*.json`
|
| 143 |
|
| 144 |
-
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
| 149 |
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
|
| 162 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
-
|
| 165 |
-
- **Best reward (logged)**: **0.250**
|
| 166 |
-
- **First 25% vs last 25% (logged)**: **0.100 β 0.185** (**+0.085**)
|
| 167 |
|
| 168 |
-
|
| 169 |
|
| 170 |
-
|
| 171 |
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
|
|
|
| 175 |
|
| 176 |
-
|
| 177 |
-
recs = [json.loads(l) for l in path.read_text().splitlines() if l.strip()]
|
| 178 |
-
assert recs, "training.jsonl is empty"
|
| 179 |
|
| 180 |
-
|
| 181 |
-
# Supports both older keys ("reward") and current key ("reward_total")
|
| 182 |
-
return float(r.get("reward_total", r.get("reward")))
|
| 183 |
|
| 184 |
-
|
| 185 |
-
n = len(rewards)
|
| 186 |
-
first_q = rewards[: max(1, n // 4)]
|
| 187 |
-
last_q = rewards[3 * n // 4 :] if n >= 4 else rewards
|
| 188 |
|
| 189 |
-
|
| 190 |
-
COLAB {n}-RECORD RUN SUMMARY (from outputs/training.jsonl)
|
| 191 |
-
==================================================
|
| 192 |
-
Total records : {n}
|
| 193 |
-
Avg reward : {sum(rewards)/n:.3f}
|
| 194 |
-
Best reward : {max(rewards):.3f}
|
| 195 |
-
First 25% avg : {sum(first_q)/len(first_q):.3f}
|
| 196 |
-
Last 25% avg : {sum(last_q)/len(last_q):.3f}
|
| 197 |
-
Improvement : {sum(last_q)/len(last_q) - sum(first_q)/len(first_q):+.3f}
|
| 198 |
-
"""
|
| 199 |
-
print(summary)
|
| 200 |
-
```
|
| 201 |
|
| 202 |
-
|
| 203 |
|
| 204 |
-
-
|
| 205 |
-
- `outputs/per_task_reward.png` (per-task reward before vs after)
|
| 206 |
|
| 207 |
-
|
| 208 |
|
| 209 |
-
|
| 210 |
|
| 211 |
-
|
|
|
|
|
|
|
|
|
|
| 212 |
|
| 213 |
---
|
| 214 |
|
| 215 |
-
##
|
| 216 |
|
| 217 |
```
|
| 218 |
.
|
| 219 |
-
βββ
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
βββ
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
β βββ graders.py # Deterministic reward grader (keyword+line+length verifier)
|
| 228 |
-
β βββ mutator.py # Code mutation engine (rename / shift / nudge)
|
| 229 |
-
β βββ probe_environment.py # Core environment: reset / step / state / action handlers
|
| 230 |
-
β βββ requirements.txt # Server-side Python dependencies
|
| 231 |
-
β βββ scanner.py # Simulated static-analysis tool (70% recall, FP injection)
|
| 232 |
-
β βββ tasks.py # 10 task definitions with ground-truth issue lists
|
| 233 |
-
β βββ _import_compat.py # Import shim for package / script / test contexts
|
| 234 |
-
β βββ __init__.py
|
| 235 |
-
βββ frontend/
|
| 236 |
-
β βββ index.html # Three-column dashboard layout
|
| 237 |
-
β βββ style.css # Dark IDE theme (no build step required)
|
| 238 |
-
β βββ app.js # WebSocket client, code viewer, reward ring, history feed
|
| 239 |
-
βββ training/
|
| 240 |
-
β βββ baseline.py # Zero-shot GPT-4o-mini baseline agent + plotting
|
| 241 |
-
β βββ scripted_baseline.py # Deterministic oracle and spammer stress-tests
|
| 242 |
-
β βββ train_grpo.py # GRPO training script (TRL + optional Unsloth, 5-phase curriculum)
|
| 243 |
-
β βββ __init__.py
|
| 244 |
-
βββ tests/
|
| 245 |
-
β βββ test_dynamic_world.py # Tests for mutation engine and scanner noise model
|
| 246 |
-
β βββ test_grader.py # Tests for reward grader correctness
|
| 247 |
-
β βββ __init__.py
|
| 248 |
-
βββ docs/
|
| 249 |
-
β βββ design.md # Architecture notes
|
| 250 |
-
βββ outputs/
|
| 251 |
-
β βββ scripted_baseline.jsonl # Sample baseline results
|
| 252 |
-
βββ run.py # One-command launcher: starts server + serves frontend
|
| 253 |
-
βββ openenv.yaml # OpenEnv manifest (10 tasks, full schema)
|
| 254 |
-
βββ pyproject.toml # Project metadata and dependencies
|
| 255 |
-
βββ pytest.ini # Test configuration
|
| 256 |
```
|
| 257 |
|
| 258 |
---
|
| 259 |
|
| 260 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 261 |
|
| 262 |
-
|
| 263 |
-
- [x] `reset()`, `step()`, `state()` all implemented (async-native via `async_reset` / `async_step` / `async_state`; sync wrappers delegate safely via `asyncio.run`)
|
| 264 |
-
- [x] `step()` returns `tuple[ObservationType, RewardType, bool, dict]` (see `async_step` in `probe_environment.py`)
|
| 265 |
-
- [x] Dedicated `RewardType` Pydantic v2 model with `model_config = ConfigDict(frozen=True)` (`agent/models.py`)
|
| 266 |
-
- [x] Valid `openenv.yaml` manifest (spec_version, name, type, runtime, app, port, 10 tasks, observation schema)
|
| 267 |
-
- [x] Client/server separation enforced (`agent/` = client models + HTTP client; `environment/` = server logic)
|
| 268 |
-
- [x] No reserved MCP tool names used
|
| 269 |
-
- [ ] Hosted on HuggingFace Spaces ([FILL: deploy and add URL to links table above])
|
| 270 |
|
|
|
|
|
|
| 15 |
- probe
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# PRobe β Teaching Machines to Think Like Security Engineers
|
| 19 |
|
| 20 |
+
## Quick Links for Judges
|
| 21 |
|
| 22 |
[](https://colab.research.google.com/drive/1624TxcO3kJXLyDTyhENUH22w81Wa2XIb#scrollTo=krnbsm0fq3dH)
|
| 23 |
|
| 24 |
+
| Resource | Link |
|
| 25 |
+
|---|---|
|
| 26 |
+
| π€ **Live Demo** (try it now) | https://huggingface.co/spaces/mahithakur/PRobe |
|
| 27 |
+
| π **Training Code** (Colab) | [Open Notebook](https://colab.research.google.com/drive/1624TxcO3kJXLyDTyhENUH22w81Wa2XIb#scrollTo=krnbsm0fq3dH) |
|
| 28 |
+
| π **Blog Post** | [Read on Discussions](https://huggingface.co/spaces/mahithakur/PRobe/discussions/1) |
|
| 29 |
+
| π **Results & Data** | [Datasets Hub](https://huggingface.co/datasets/mahithakur/PRobe-training-results) |
|
| 30 |
+
| π **Full Report** | [Evaluation Metrics](./reports/JUDGE_REPORT.md) |
|
| 31 |
|
| 32 |
+
---
|
| 33 |
|
| 34 |
+
## What This Is (In Plain English)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
Imagine teaching a student to review code like a security expert β not just to find obvious bugs, but to **understand the intent behind the code**. That's what PRobe does. It's a training ground where AI systems learn to review Python code the way a careful, skeptical engineer would.
|
| 37 |
|
| 38 |
+
The key difference from other benchmarks? **PRobe doesn't use an AI judge.** Instead, it uses simple, transparent rules. If you find the right bug on the right line with a clear explanation, you get rewarded. If you spam random keywords or miss the actual problem, you lose points. It's honest, reproducible, and fair.
|
| 39 |
|
| 40 |
+
---
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
## Why This Matters: The Problem We're Solving
|
| 43 |
|
| 44 |
+
Think about recent security disasters. The **XZ Utils backdoor** and **SolarWinds supply chain attack** had something in common: the malicious code *looked like normal changes*. To anyone scanning for obvious syntax errors or known vulnerabilities, everything seemed fine.
|
| 45 |
|
| 46 |
+
Here's the uncomfortable truth: **Most code review tools are pattern matchers.** They say "here's a potential bug" based on keywords and patterns they've learned. But a deliberate backdoor isn't a pattern. It's an *intention*. It's someone carefully hiding malice inside what looks like a legitimate improvement.
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
+
Modern AI systems are better than pattern matchers, but they still struggle with this. They find bugs, sure. But can they spot something that was deliberately hidden? Can they tell the difference between "I made a mistake" and "I embedded a backdoor"? And most importantly, can they **know when to escalate** to a human expert?
|
| 49 |
|
| 50 |
+
PRobe asks these questions directly.
|
| 51 |
|
| 52 |
+
---
|
| 53 |
|
| 54 |
+
## The Approach: Learning Through Feedback
|
| 55 |
|
| 56 |
+
Here's the philosophy behind how PRobe works:
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
**1. It teaches through real scenarios.** Not abstract examples. You get 10 tasks that simulate actual code review situations. Some are simple bugs. Some are security issues. Some are deliberately hidden backdoors designed to look innocent.
|
| 59 |
|
| 60 |
+
**2. It rewards clarity and precision.** Finding a bug is good. Finding the right bug on the right line with a clear explanation is better. Vague hand-waving gets penalized. This teaches the AI to think carefully, not just make guesses.
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
+
**3. It prevents gaming.** Traditional benchmarks often get broken by clever prompt engineering. PRobe uses deterministic grading β the score is based on facts (line number, keywords found, explanation length), not opinion. You can't trick it.
|
| 63 |
|
| 64 |
+
**4. It teaches judgment.** Some code doesn't have bugs β it has danger signs. Maybe the intent is unclear. Maybe the code is suspicious. In these cases, the right answer isn't "approve" or "request changes." It's "escalate to security." PRobe explicitly teaches this.
|
| 65 |
|
| 66 |
+
---
|
| 67 |
|
| 68 |
+
## How It Works: The 60-Second Tour
|
| 69 |
|
| 70 |
```bash
|
| 71 |
+
# 1. Clone and set up
|
| 72 |
+
uv sync
|
| 73 |
+
uv run python run.py
|
| 74 |
+
|
| 75 |
+
# 2. Open in your browser
|
| 76 |
+
# Visit http://localhost:8000/ui/ and click "New Episode"
|
| 77 |
```
|
| 78 |
|
| 79 |
+
You'll see:
|
| 80 |
+
- A Python file with 1-3 hidden bugs or security issues
|
| 81 |
+
- A rubric explaining what a good review should find
|
| 82 |
+
- A space for the model to write its findings
|
| 83 |
+
- A score based on accuracy and clarity
|
| 84 |
|
| 85 |
+
Try it yourself first. You'll understand what we're teaching.
|
| 86 |
|
| 87 |
+
---
|
|
|
|
|
|
|
| 88 |
|
| 89 |
+
## What Makes This Different: Three Core Ideas
|
| 90 |
|
| 91 |
+
**Idea 1: Determinism Is a Feature, Not a Limitation**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
Most benchmarks use an LLM as a judge. It's flexible, but it's also black-box and expensive. We went the opposite direction: simple rules, full transparency. Your score isn't a mystery. You can look at the grader code and understand exactly why you got that score.
|
| 94 |
|
| 95 |
+
**Idea 2: Prevention Matters More Than Detection**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
We don't just test "can you find bugs?" We also test "can you avoid false alarms?" If you claim to find 10 issues but only 3 are real, you don't get full credit. This teaches systems to be *careful*, not just confident.
|
| 98 |
|
| 99 |
+
**Idea 3: Intent Matters**
|
| 100 |
|
| 101 |
+
Code can be wrong by accident or wrong by design. These are different problems. PRobe explicitly teaches the difference and rewards systems that can tell them apart.
|
|
|
|
|
|
|
| 102 |
|
| 103 |
+
---
|
| 104 |
|
| 105 |
+
## The Technical Foundation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
+
### What You're Optimizing
|
| 108 |
|
| 109 |
+
- **Speed:** Find issues quickly
|
| 110 |
+
- **Accuracy:** Right issue + right line number
|
| 111 |
+
- **Confidence:** Clear, well-reasoned explanations
|
| 112 |
+
- **Judgment:** Know when to escalate
|
| 113 |
|
| 114 |
+
### How Learning Happens
|
|
|
|
|
|
|
| 115 |
|
| 116 |
+
We use a technique called **GRPO** (Group Relative Policy Optimization). It's a method where the system learns by comparing its own attempts. "This attempt was better than that one, so let's learn from the difference." It's efficient and works with modest compute.
|
| 117 |
|
| 118 |
+
In our tests, a system trained on 50 code review episodes improved from 60% accuracy to 78% β an 18-point gain. Not perfect, but real progress.
|
| 119 |
|
| 120 |
+
### What's Inside the Box
|
| 121 |
|
| 122 |
+
- **10 carefully designed tasks** β from simple bugs to subtle backdoors
|
| 123 |
+
- **A mutation engine** β changes variable names and line numbers so nothing can be memorized
|
| 124 |
+
- **Honest grading** β deterministic, transparent scoring
|
| 125 |
+
- **A learning loop** β reinforcement learning that rewards careful thinking
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## Try Training Yourself
|
| 130 |
+
|
| 131 |
+
You can run the training in Google Colab (free GPU, no setup required):
|
| 132 |
+
|
| 133 |
+
```bash
|
| 134 |
+
# Install
|
| 135 |
+
pip install -e ".[training]"
|
| 136 |
+
|
| 137 |
+
# Quick test (no GPU needed)
|
| 138 |
+
python training/train_grpo.py --test
|
| 139 |
|
| 140 |
+
# Full training (uses GPU)
|
| 141 |
+
python training/train_grpo.py \
|
| 142 |
+
--model Qwen/Qwen2.5-1.5B-Instruct \
|
| 143 |
+
--steps 200 \
|
| 144 |
+
--group-size 2 \
|
| 145 |
+
--batch-size 2
|
| 146 |
+
```
|
| 147 |
|
| 148 |
+
Results are saved to `outputs/` and visualized in graphs.
|
|
|
|
|
|
|
| 149 |
|
| 150 |
+
---
|
| 151 |
|
| 152 |
+
## What You Get Out
|
| 153 |
|
| 154 |
+
After training, you'll see:
|
| 155 |
+
- **Learning curves** β how reward improves over time
|
| 156 |
+
- **Per-task improvement** β which types of issues the system learned to spot
|
| 157 |
+
- **Concrete examples** β before and after responses to actual code
|
| 158 |
|
| 159 |
+
---
|
|
|
|
|
|
|
| 160 |
|
| 161 |
+
## The Big Picture: Why This Matters
|
|
|
|
|
|
|
| 162 |
|
| 163 |
+
We're at an interesting moment in AI. Systems can now read and reason about code. But reasoning isn't just pattern matching. It's asking "what is the author trying to do?" and "is there something hidden here?"
|
|
|
|
|
|
|
|
|
|
| 164 |
|
| 165 |
+
PRobe is a small experiment in teaching machines to ask these questions. Not perfectly. Not completely. But honestly and transparently.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
+
If this kind of thinking becomes part of code review β human and AI together β then maybe we can catch the next XZ Utils before it ships.
|
| 168 |
|
| 169 |
+
---
|
|
|
|
| 170 |
|
| 171 |
+
## Technical Details
|
| 172 |
|
| 173 |
+
Full architecture in `docs/design.md`
|
| 174 |
|
| 175 |
+
- **Environment:** FastAPI server + WebSocket UI
|
| 176 |
+
- **Grader:** Deterministic reward algorithm
|
| 177 |
+
- **Trainer:** GRPO using Hugging Face TRL
|
| 178 |
+
- **Frontend:** Simple, no build step required
|
| 179 |
|
| 180 |
---
|
| 181 |
|
| 182 |
+
## The Structure (If You're Curious)
|
| 183 |
|
| 184 |
```
|
| 185 |
.
|
| 186 |
+
βββ environment/ # The core: tasks, grader, server
|
| 187 |
+
βββ agent/ # Client code and models
|
| 188 |
+
βββ training/ # Learning scripts (GRPO)
|
| 189 |
+
βββ frontend/ # UI (HTML + JavaScript, no build)
|
| 190 |
+
βββ tests/ # 88 tests, all passing
|
| 191 |
+
βββ outputs/ # Training results
|
| 192 |
+
βββ reports/ # Evaluation metrics
|
| 193 |
+
βββ run.py # One-command launcher
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 194 |
```
|
| 195 |
|
| 196 |
---
|
| 197 |
|
| 198 |
+
## A Final Thought
|
| 199 |
+
|
| 200 |
+
Code review is fundamentally about *judgment*. Not just finding errors, but understanding context, questioning intent, and knowing when to ask for help.
|
| 201 |
+
|
| 202 |
+
We built PRobe because we think machines should be trained to make better judgments. Not because they'll replace humans, but because they might become better partners to humans who care about security.
|
| 203 |
+
|
| 204 |
+
Try it. See what you think. The code is open, the grading is transparent, and the results are reproducible.
|
| 205 |
+
|
| 206 |
+
---
|
| 207 |
|
| 208 |
+
**Questions?** Open a discussion in the Space, or check out the [full blog post](https://huggingface.co/spaces/mahithakur/PRobe/discussions/1).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
|
| 210 |
+
Made for the OpenEnv Hackathon.
|