Spaces:

mahithakur
/

PRobe

Runtime error

Training an AI Code Reviewer That Spots Backdoors

by mahithakur - opened 19 days ago

 An interactive RL environment where models learn to review code like security engineers, not linters.

The Problem: Supply Chain Attacks Look Like Normal Code

Recent attacks like XZ Utils and SolarWinds remind us that malicious code can hide in plain sight. A backdoor often looks like a
legitimate refactor or bug fix—it's indistinguishable from normal changes without understanding intent.

Current LLM code reviewers excel at spotting obvious bugs but struggle with intentional sabotage. They're pattern matchers, not
investigators.

Our Solution: PRobe — A Deterministic Training Environment

PRobe is a Python code review environment where AI agents learn to:

Find real bugs and security issues — with accurate line numbers
Distinguish honest mistakes from deliberate backdoors — and decide when to escalate
Explain findings precisely — vague answers get penalized

Unlike many "LLM judge" benchmarks, PRobe uses deterministic, reproducible rewards. No expensive API calls to grade submissions.

Key Features

No LLM judge: reward is deterministic and reproducible
Anti-gaming: keyword spam on wrong lines gets penalized
10 tasks that simulate real review situations (bugs + adversarial backdoors)
Mutator: changes variable names/line numbers so the model can't memorize answers
Deterministic grader: scores based on "right issue + right place + good explanation"

Results

Our baseline (GPT-4o-mini) finds ~60% of issues. After GRPO training on 50 episodes:

Final 10 episodes avg: 78% accuracy
Improvement: +18 percentage points
Training time: ~2 hours on single GPU

See full metrics in /reports/JUDGE_REPORT.md

Try It Now

Live Space: https://huggingface.co/spaces/mahithakur/PRobe
Code: https://huggingface.co/spaces/mahithakur/PRobe

Made for the OpenEnv Hackathon

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment