Spaces:

mahithakur
/

PRobe

Runtime error

App Files Files Community

PRobe / BLOG_POST.md

mahithakur

Add blog post and HF submission checklist for hackathon

6ba15c6 about 1 month ago

preview code

raw

history blame contribute delete

5.03 kB

	# PRobe: Training an AI Code Reviewer That Spots Backdoors

	An interactive RL environment where models learn to review code like security engineers, not linters.

	## The Problem: Supply Chain Attacks Look Like Normal Code

	Recent attacks like [XZ Utils](https://en.wikipedia.org/wiki/XZ_Utils_backdoor) and [SolarWinds](https://www.microsoft.com/en-us/security/blog/2020/12/18/analyzing-solorigate-samples-using-microsoft-defender-for-endpoint/) remind us that malicious code can hide in plain sight. A backdoor often looks like a legitimate refactor or bug fix—it's indistinguishable from normal changes without understanding intent.

	Current LLM code reviewers excel at spotting obvious bugs (undefined variables, off-by-one errors) but struggle with intentional sabotage. They're pattern matchers, not investigators.

	## Our Solution: PRobe — A Deterministic Training Environment

	PRobe is a Python code review environment where AI agents learn to:

	1. Find real bugs and security issues — with accurate line numbers
	2. Distinguish honest mistakes from deliberate backdoors — and decide when to escalate
	3. Explain findings precisely — vague answers get penalized

	Unlike many "LLM judge" benchmarks, PRobe uses deterministic, reproducible rewards. No expensive API calls to grade submissions. No gaming via keyword spam.

	### What Makes PRobe Different

	\| Feature \| Traditional Benchmarks \| PRobe \|
	\|---------\|----------------------\|-------\|
	\| Reward \| LLM evaluator (variable, slow, expensive) \| Deterministic algorithm (fast, reproducible) \|
	\| Feedback \| "Good job!" or "Try again" \| Specific penalties for imprecision \|
	\| Anti-gaming \| None \| Keyword spam on wrong lines penalized \|
	\| Tasks \| Curated examples \| 10 procedurally generated scenarios \|
	\| Escalation \| Find bugs OR approve \| Find bugs OR request changes OR escalate to security \|

	## How It Works (60 seconds)

	```bash
	# 1. Clone and setup
	git clone https://huggingface.co/spaces/themahipalthakur/PRobe
	cd PRobe
	uv sync

	# 2. Run locally
	uv run python run.py

	# 3. Open browser
	# Visit http://localhost:8000/ui/ and click "New Episode"
	```

	The UI shows:
	- A Python file with 1-3 injected bugs or backdoors
	- A grading rubric explaining what the model should find
	- The model's response (approval, change request, or escalation)
	- A score based on accuracy and confidence

	## Technical Architecture

	### The Environment (`environment.py`)

	- 10 tasks with progressively harder scenarios:
	- Tasks 0-2: Simple bugs (off-by-one, undefined var, logic error)
	- Tasks 3-5: Security issues (hardcoded API keys, disabled SSL, SQL injection)
	- Tasks 6-9: Adversarial backdoors (hiding intent via indirection, obfuscation)

	- Mutation system: Variable names and line numbers shift each episode, preventing memorization
	- Causal unlocks: Context hints appear only after the model finds related issues
	- Episode memory: Models can reference previous findings within an episode

	### The Grader (`grader.py`)

	Scores are deterministic:

	```
	Base reward = % of issues found + % correct line numbers + quality bonus
	Penalties for:
	- Vague explanations (< 10 words per issue)
	- False positives (claiming issues that don't exist)
	- Lazy submissions (category-only, no detail)
	```

	### Training with GRPO

	We use Group Relative Policy Optimization (GRPO, from DeepSeek-R1) to train code reviewers:

	```python
	# Training on 50 episodes per task
	trainer = GRPOTrainer(
	model=model,
	args=training_args,
	train_dataset=dataset,
	processing_class=processor,
	)
	trainer.train()
	```

	Why GRPO?
	- Doesn't require a reference model (unlike PPO)
	- Groups examples for relative reward comparison
	- Works well with small compute budgets

	## Results: Before & After Training

	Our baseline (GPT-4o-mini) finds ~60% of issues. After GRPO training on 50 episodes:

	- Final 10 episodes avg: 78% accuracy
	- Improvement: +18 percentage points
	- Training time: ~2 hours on single GPU

	See full metrics in `reports/JUDGE_REPORT.md`.

	## Try It Now

	- Live Space: https://huggingface.co/spaces/themahipalthakur/PRobe
	- Training Notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/FILL_COLAB_LINK)
	- Code: https://github.com/themahipalthakur/PRobe

	## Why This Matters

	Security reviews are expensive and rare. PRobe demonstrates that AI can be trained to reason about intent, not just pattern-match. The deterministic grader ensures:

	- ✅ Reproducible results
	- ✅ No gaming via prompt engineering
	- ✅ Transparent scoring (anyone can verify the grade)

	This is a step toward AI systems that integrate into real security workflows—not because they're perfect, but because they're verifiable.

	---

	Made for the [OpenEnv Hackathon](https://huggingface.co/spaces/open-env/open-env-hackers)

	Questions? Open an issue on [GitHub](https://github.com/themahipalthakur/PRobe).