Spaces:

mahithakur
/

PRobe

Runtime error

App Files Files Community

PRobe / BLOG_POST.md

mahithakur

Add blog post and HF submission checklist for hackathon

6ba15c6 about 1 month ago

preview code

raw

history blame contribute delete

5.03 kB

PRobe: Training an AI Code Reviewer That Spots Backdoors

An interactive RL environment where models learn to review code like security engineers, not linters.

The Problem: Supply Chain Attacks Look Like Normal Code

Recent attacks like XZ Utils and SolarWinds remind us that malicious code can hide in plain sight. A backdoor often looks like a legitimate refactor or bug fix—it's indistinguishable from normal changes without understanding intent.

Current LLM code reviewers excel at spotting obvious bugs (undefined variables, off-by-one errors) but struggle with intentional sabotage. They're pattern matchers, not investigators.

Our Solution: PRobe — A Deterministic Training Environment

PRobe is a Python code review environment where AI agents learn to:

Find real bugs and security issues — with accurate line numbers
Distinguish honest mistakes from deliberate backdoors — and decide when to escalate
Explain findings precisely — vague answers get penalized

Unlike many "LLM judge" benchmarks, PRobe uses deterministic, reproducible rewards. No expensive API calls to grade submissions. No gaming via keyword spam.

What Makes PRobe Different

Feature	Traditional Benchmarks	PRobe
Reward	LLM evaluator (variable, slow, expensive)	Deterministic algorithm (fast, reproducible)
Feedback	"Good job!" or "Try again"	Specific penalties for imprecision
Anti-gaming	None	Keyword spam on wrong lines penalized
Tasks	Curated examples	10 procedurally generated scenarios
Escalation	Find bugs OR approve	Find bugs OR request changes OR escalate to security

How It Works (60 seconds)

# 1. Clone and setup
git clone https://huggingface.co/spaces/themahipalthakur/PRobe
cd PRobe
uv sync

# 2. Run locally
uv run python run.py

# 3. Open browser
# Visit http://localhost:8000/ui/ and click "New Episode"

The UI shows:

A Python file with 1-3 injected bugs or backdoors
A grading rubric explaining what the model should find
The model's response (approval, change request, or escalation)
A score based on accuracy and confidence

Technical Architecture

The Environment (`environment.py`)

10 tasks with progressively harder scenarios:
- Tasks 0-2: Simple bugs (off-by-one, undefined var, logic error)
- Tasks 3-5: Security issues (hardcoded API keys, disabled SSL, SQL injection)
- Tasks 6-9: Adversarial backdoors (hiding intent via indirection, obfuscation)
Mutation system: Variable names and line numbers shift each episode, preventing memorization
Causal unlocks: Context hints appear only after the model finds related issues
Episode memory: Models can reference previous findings within an episode

The Grader (`grader.py`)

Scores are deterministic:

Base reward = % of issues found + % correct line numbers + quality bonus
Penalties for:
  - Vague explanations (< 10 words per issue)
  - False positives (claiming issues that don't exist)
  - Lazy submissions (category-only, no detail)

Training with GRPO

We use Group Relative Policy Optimization (GRPO, from DeepSeek-R1) to train code reviewers:

# Training on 50 episodes per task
trainer = GRPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=processor,
)
trainer.train()

Why GRPO?

Doesn't require a reference model (unlike PPO)
Groups examples for relative reward comparison
Works well with small compute budgets

Results: Before & After Training

Our baseline (GPT-4o-mini) finds ~60% of issues. After GRPO training on 50 episodes:

Final 10 episodes avg: 78% accuracy
Improvement: +18 percentage points
Training time: ~2 hours on single GPU

See full metrics in reports/JUDGE_REPORT.md.

Try It Now

Live Space: https://huggingface.co/spaces/themahipalthakur/PRobe
Training Notebook:
Code: https://github.com/themahipalthakur/PRobe

Why This Matters

Security reviews are expensive and rare. PRobe demonstrates that AI can be trained to reason about intent, not just pattern-match. The deterministic grader ensures:

✅ Reproducible results
✅ No gaming via prompt engineering
✅ Transparent scoring (anyone can verify the grade)

This is a step toward AI systems that integrate into real security workflows—not because they're perfect, but because they're verifiable.

Made for the OpenEnv Hackathon

Questions? Open an issue on GitHub.