Spaces:
Runtime error
PRobe: Training an AI Code Reviewer That Spots Backdoors
An interactive RL environment where models learn to review code like security engineers, not linters.
The Problem: Supply Chain Attacks Look Like Normal Code
Recent attacks like XZ Utils and SolarWinds remind us that malicious code can hide in plain sight. A backdoor often looks like a legitimate refactor or bug fix—it's indistinguishable from normal changes without understanding intent.
Current LLM code reviewers excel at spotting obvious bugs (undefined variables, off-by-one errors) but struggle with intentional sabotage. They're pattern matchers, not investigators.
Our Solution: PRobe — A Deterministic Training Environment
PRobe is a Python code review environment where AI agents learn to:
- Find real bugs and security issues — with accurate line numbers
- Distinguish honest mistakes from deliberate backdoors — and decide when to escalate
- Explain findings precisely — vague answers get penalized
Unlike many "LLM judge" benchmarks, PRobe uses deterministic, reproducible rewards. No expensive API calls to grade submissions. No gaming via keyword spam.
What Makes PRobe Different
| Feature | Traditional Benchmarks | PRobe |
|---|---|---|
| Reward | LLM evaluator (variable, slow, expensive) | Deterministic algorithm (fast, reproducible) |
| Feedback | "Good job!" or "Try again" | Specific penalties for imprecision |
| Anti-gaming | None | Keyword spam on wrong lines penalized |
| Tasks | Curated examples | 10 procedurally generated scenarios |
| Escalation | Find bugs OR approve | Find bugs OR request changes OR escalate to security |
How It Works (60 seconds)
# 1. Clone and setup
git clone https://huggingface.co/spaces/themahipalthakur/PRobe
cd PRobe
uv sync
# 2. Run locally
uv run python run.py
# 3. Open browser
# Visit http://localhost:8000/ui/ and click "New Episode"
The UI shows:
- A Python file with 1-3 injected bugs or backdoors
- A grading rubric explaining what the model should find
- The model's response (approval, change request, or escalation)
- A score based on accuracy and confidence
Technical Architecture
The Environment (environment.py)
10 tasks with progressively harder scenarios:
- Tasks 0-2: Simple bugs (off-by-one, undefined var, logic error)
- Tasks 3-5: Security issues (hardcoded API keys, disabled SSL, SQL injection)
- Tasks 6-9: Adversarial backdoors (hiding intent via indirection, obfuscation)
Mutation system: Variable names and line numbers shift each episode, preventing memorization
Causal unlocks: Context hints appear only after the model finds related issues
Episode memory: Models can reference previous findings within an episode
The Grader (grader.py)
Scores are deterministic:
Base reward = % of issues found + % correct line numbers + quality bonus
Penalties for:
- Vague explanations (< 10 words per issue)
- False positives (claiming issues that don't exist)
- Lazy submissions (category-only, no detail)
Training with GRPO
We use Group Relative Policy Optimization (GRPO, from DeepSeek-R1) to train code reviewers:
# Training on 50 episodes per task
trainer = GRPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
processing_class=processor,
)
trainer.train()
Why GRPO?
- Doesn't require a reference model (unlike PPO)
- Groups examples for relative reward comparison
- Works well with small compute budgets
Results: Before & After Training
Our baseline (GPT-4o-mini) finds ~60% of issues. After GRPO training on 50 episodes:
- Final 10 episodes avg: 78% accuracy
- Improvement: +18 percentage points
- Training time: ~2 hours on single GPU
See full metrics in reports/JUDGE_REPORT.md.
Try It Now
- Live Space: https://huggingface.co/spaces/themahipalthakur/PRobe
- Training Notebook:
- Code: https://github.com/themahipalthakur/PRobe
Why This Matters
Security reviews are expensive and rare. PRobe demonstrates that AI can be trained to reason about intent, not just pattern-match. The deterministic grader ensures:
- ✅ Reproducible results
- ✅ No gaming via prompt engineering
- ✅ Transparent scoring (anyone can verify the grade)
This is a step toward AI systems that integrate into real security workflows—not because they're perfect, but because they're verifiable.
Made for the OpenEnv Hackathon
Questions? Open an issue on GitHub.