PRobe / BLOG_POST.md
mahithakur's picture
Add blog post and HF submission checklist for hackathon
6ba15c6
# PRobe: Training an AI Code Reviewer That Spots Backdoors
*An interactive RL environment where models learn to review code like security engineers, not linters.*
## The Problem: Supply Chain Attacks Look Like Normal Code
Recent attacks like [XZ Utils](https://en.wikipedia.org/wiki/XZ_Utils_backdoor) and [SolarWinds](https://www.microsoft.com/en-us/security/blog/2020/12/18/analyzing-solorigate-samples-using-microsoft-defender-for-endpoint/) remind us that **malicious code can hide in plain sight**. A backdoor often looks like a legitimate refactor or bug fix—it's indistinguishable from normal changes without understanding *intent*.
Current LLM code reviewers excel at spotting obvious bugs (undefined variables, off-by-one errors) but struggle with intentional sabotage. They're pattern matchers, not investigators.
## Our Solution: PRobe — A Deterministic Training Environment
PRobe is a **Python code review environment** where AI agents learn to:
1. **Find real bugs and security issues** — with accurate line numbers
2. **Distinguish honest mistakes from deliberate backdoors** — and decide when to escalate
3. **Explain findings precisely** — vague answers get penalized
Unlike many "LLM judge" benchmarks, PRobe uses **deterministic, reproducible rewards**. No expensive API calls to grade submissions. No gaming via keyword spam.
### What Makes PRobe Different
| Feature | Traditional Benchmarks | PRobe |
|---------|----------------------|-------|
| **Reward** | LLM evaluator (variable, slow, expensive) | Deterministic algorithm (fast, reproducible) |
| **Feedback** | "Good job!" or "Try again" | Specific penalties for imprecision |
| **Anti-gaming** | None | Keyword spam on wrong lines penalized |
| **Tasks** | Curated examples | 10 procedurally generated scenarios |
| **Escalation** | Find bugs OR approve | Find bugs OR request changes OR escalate to security |
## How It Works (60 seconds)
```bash
# 1. Clone and setup
git clone https://huggingface.co/spaces/themahipalthakur/PRobe
cd PRobe
uv sync
# 2. Run locally
uv run python run.py
# 3. Open browser
# Visit http://localhost:8000/ui/ and click "New Episode"
```
The UI shows:
- A **Python file** with 1-3 injected bugs or backdoors
- A grading rubric explaining what the model should find
- The model's response (approval, change request, or escalation)
- A **score** based on accuracy and confidence
## Technical Architecture
### The Environment (`environment.py`)
- **10 tasks** with progressively harder scenarios:
- Tasks 0-2: Simple bugs (off-by-one, undefined var, logic error)
- Tasks 3-5: Security issues (hardcoded API keys, disabled SSL, SQL injection)
- Tasks 6-9: Adversarial backdoors (hiding intent via indirection, obfuscation)
- **Mutation system**: Variable names and line numbers shift each episode, preventing memorization
- **Causal unlocks**: Context hints appear only after the model finds related issues
- **Episode memory**: Models can reference previous findings within an episode
### The Grader (`grader.py`)
Scores are deterministic:
```
Base reward = % of issues found + % correct line numbers + quality bonus
Penalties for:
- Vague explanations (< 10 words per issue)
- False positives (claiming issues that don't exist)
- Lazy submissions (category-only, no detail)
```
### Training with GRPO
We use **Group Relative Policy Optimization** (GRPO, from DeepSeek-R1) to train code reviewers:
```python
# Training on 50 episodes per task
trainer = GRPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
processing_class=processor,
)
trainer.train()
```
**Why GRPO?**
- Doesn't require a reference model (unlike PPO)
- Groups examples for relative reward comparison
- Works well with small compute budgets
## Results: Before & After Training
Our baseline (GPT-4o-mini) finds ~60% of issues. After GRPO training on 50 episodes:
- **Final 10 episodes avg**: 78% accuracy
- **Improvement**: +18 percentage points
- **Training time**: ~2 hours on single GPU
See full metrics in `reports/JUDGE_REPORT.md`.
## Try It Now
- **Live Space**: https://huggingface.co/spaces/themahipalthakur/PRobe
- **Training Notebook**: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/FILL_COLAB_LINK)
- **Code**: https://github.com/themahipalthakur/PRobe
## Why This Matters
Security reviews are expensive and rare. PRobe demonstrates that AI can be trained to **reason about intent**, not just pattern-match. The deterministic grader ensures:
- ✅ Reproducible results
- ✅ No gaming via prompt engineering
- ✅ Transparent scoring (anyone can verify the grade)
This is a step toward AI systems that integrate into real security workflows—not because they're perfect, but because they're *verifiable*.
---
**Made for the [OpenEnv Hackathon](https://huggingface.co/spaces/open-env/open-env-hackers)**
Questions? Open an issue on [GitHub](https://github.com/themahipalthakur/PRobe).