Spaces:
Runtime error
Training an AI Code Reviewer That Spots Backdoors
An interactive RL environment where models learn to review code like security engineers, not linters.
The Problem: Supply Chain Attacks Look Like Normal Code
Recent attacks like XZ Utils and SolarWinds remind us that malicious code can hide in plain sight. A backdoor often looks like a
legitimate refactor or bug fix—it's indistinguishable from normal changes without understanding intent.
Current LLM code reviewers excel at spotting obvious bugs but struggle with intentional sabotage. They're pattern matchers, not
investigators.
Our Solution: PRobe — A Deterministic Training Environment
PRobe is a Python code review environment where AI agents learn to:
- Find real bugs and security issues — with accurate line numbers
- Distinguish honest mistakes from deliberate backdoors — and decide when to escalate
- Explain findings precisely — vague answers get penalized
Unlike many "LLM judge" benchmarks, PRobe uses deterministic, reproducible rewards. No expensive API calls to grade submissions.
Key Features
- No LLM judge: reward is deterministic and reproducible
- Anti-gaming: keyword spam on wrong lines gets penalized
- 10 tasks that simulate real review situations (bugs + adversarial backdoors)
- Mutator: changes variable names/line numbers so the model can't memorize answers
- Deterministic grader: scores based on "right issue + right place + good explanation"
Results
Our baseline (GPT-4o-mini) finds ~60% of issues. After GRPO training on 50 episodes:
- Final 10 episodes avg: 78% accuracy
- Improvement: +18 percentage points
- Training time: ~2 hours on single GPU
See full metrics in /reports/JUDGE_REPORT.md
Try It Now
- Live Space: https://huggingface.co/spaces/mahithakur/PRobe
- Code: https://huggingface.co/spaces/mahithakur/PRobe
Made for the OpenEnv Hackathon