# PRobe: Training an AI Code Reviewer That Spots Backdoors *An interactive RL environment where models learn to review code like security engineers, not linters.* ## The Problem: Supply Chain Attacks Look Like Normal Code Recent attacks like [XZ Utils](https://en.wikipedia.org/wiki/XZ_Utils_backdoor) and [SolarWinds](https://www.microsoft.com/en-us/security/blog/2020/12/18/analyzing-solorigate-samples-using-microsoft-defender-for-endpoint/) remind us that **malicious code can hide in plain sight**. A backdoor often looks like a legitimate refactor or bug fix—it's indistinguishable from normal changes without understanding *intent*. Current LLM code reviewers excel at spotting obvious bugs (undefined variables, off-by-one errors) but struggle with intentional sabotage. They're pattern matchers, not investigators. ## Our Solution: PRobe — A Deterministic Training Environment PRobe is a **Python code review environment** where AI agents learn to: 1. **Find real bugs and security issues** — with accurate line numbers 2. **Distinguish honest mistakes from deliberate backdoors** — and decide when to escalate 3. **Explain findings precisely** — vague answers get penalized Unlike many "LLM judge" benchmarks, PRobe uses **deterministic, reproducible rewards**. No expensive API calls to grade submissions. No gaming via keyword spam. ### What Makes PRobe Different | Feature | Traditional Benchmarks | PRobe | |---------|----------------------|-------| | **Reward** | LLM evaluator (variable, slow, expensive) | Deterministic algorithm (fast, reproducible) | | **Feedback** | "Good job!" or "Try again" | Specific penalties for imprecision | | **Anti-gaming** | None | Keyword spam on wrong lines penalized | | **Tasks** | Curated examples | 10 procedurally generated scenarios | | **Escalation** | Find bugs OR approve | Find bugs OR request changes OR escalate to security | ## How It Works (60 seconds) ```bash # 1. Clone and setup git clone https://huggingface.co/spaces/themahipalthakur/PRobe cd PRobe uv sync # 2. Run locally uv run python run.py # 3. Open browser # Visit http://localhost:8000/ui/ and click "New Episode" ``` The UI shows: - A **Python file** with 1-3 injected bugs or backdoors - A grading rubric explaining what the model should find - The model's response (approval, change request, or escalation) - A **score** based on accuracy and confidence ## Technical Architecture ### The Environment (`environment.py`) - **10 tasks** with progressively harder scenarios: - Tasks 0-2: Simple bugs (off-by-one, undefined var, logic error) - Tasks 3-5: Security issues (hardcoded API keys, disabled SSL, SQL injection) - Tasks 6-9: Adversarial backdoors (hiding intent via indirection, obfuscation) - **Mutation system**: Variable names and line numbers shift each episode, preventing memorization - **Causal unlocks**: Context hints appear only after the model finds related issues - **Episode memory**: Models can reference previous findings within an episode ### The Grader (`grader.py`) Scores are deterministic: ``` Base reward = % of issues found + % correct line numbers + quality bonus Penalties for: - Vague explanations (< 10 words per issue) - False positives (claiming issues that don't exist) - Lazy submissions (category-only, no detail) ``` ### Training with GRPO We use **Group Relative Policy Optimization** (GRPO, from DeepSeek-R1) to train code reviewers: ```python # Training on 50 episodes per task trainer = GRPOTrainer( model=model, args=training_args, train_dataset=dataset, processing_class=processor, ) trainer.train() ``` **Why GRPO?** - Doesn't require a reference model (unlike PPO) - Groups examples for relative reward comparison - Works well with small compute budgets ## Results: Before & After Training Our baseline (GPT-4o-mini) finds ~60% of issues. After GRPO training on 50 episodes: - **Final 10 episodes avg**: 78% accuracy - **Improvement**: +18 percentage points - **Training time**: ~2 hours on single GPU See full metrics in `reports/JUDGE_REPORT.md`. ## Try It Now - **Live Space**: https://huggingface.co/spaces/themahipalthakur/PRobe - **Training Notebook**: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/FILL_COLAB_LINK) - **Code**: https://github.com/themahipalthakur/PRobe ## Why This Matters Security reviews are expensive and rare. PRobe demonstrates that AI can be trained to **reason about intent**, not just pattern-match. The deterministic grader ensures: - ✅ Reproducible results - ✅ No gaming via prompt engineering - ✅ Transparent scoring (anyone can verify the grade) This is a step toward AI systems that integrate into real security workflows—not because they're perfect, but because they're *verifiable*. --- **Made for the [OpenEnv Hackathon](https://huggingface.co/spaces/open-env/open-env-hackers)** Questions? Open an issue on [GitHub](https://github.com/themahipalthakur/PRobe).