Spaces:
Running
Running
| # I Taught a Small AI to Review Code Better Than a Model 10x Its Size | |
| *And I did it in 48 hours at the Meta Γ HuggingFace OpenEnv Grand Finale, Bangalore* | |
| --- | |
| Imagine you're a junior developer. You just spent three days writing a feature. You open a pull request. A senior engineer reviews it in 10 minutes and finds five critical bugs you completely missed β a SQL injection vulnerability, a race condition, a memory leak, and two logic errors that would have taken down production on a Friday night. | |
| Now imagine that senior engineer never sleeps, never gets tired, and gets *better* at code review every single time they review code. | |
| That's what I built. | |
| --- | |
| ## The Problem Nobody Talks About | |
| Code review is broken. | |
| Not because developers don't care β they do. But because the volume of code being written has outpaced the human capacity to review it carefully. Studies estimate this costs the software industry **$50 billion annually** in bugs that reach production. | |
| The uncomfortable truth: **every production bug was approved by at least one human revieIr.** | |
| Existing AI tools can suggest code. They can autocomplete. But none of them *learn* from feedback. None of them get better at reviewing the more they review. | |
| I wanted to change that. | |
| --- | |
| ## The Idea: What If Code Review Was a Game? | |
| Here's the insight that drove everything: | |
| > Code review has a verifiable ground truth. I *know* which lines have bugs. I *know* how severe they are. I *know* what a correct verdict looks like. | |
| If you have ground truth, you have a reward signal. If you have a reward signal, you can train an AI agent using reinforcement learning. | |
| So I built **CodeRevieInv** β an OpenEnv-compliant RL environment where an AI agent plays the role of a senior code revieIr. The agent reads a buggy code diff, submits structured comments with line numbers and severity ratings, issues a verdict, and then β here's what makes it novel β **tries to fix the bugs it found**. | |
| The environment scores every move. Find a critical bug? +0.20. Submit a false alarm? -0.08. Fix the bug correctly? +0.40. The agent learns, episode by episode, what good code review actually looks like. | |
| --- | |
| ## What the Agent Sees | |
| Every episode, the agent receives a real buggy code diff. Here's an example from our `hard` task β a distributed rate limiter with subtle concurrency bugs: | |
| ```python | |
| class TaskQueue: | |
| def __init__(self, max_workers=4): | |
| self.queue = [] # β no lock on this list | |
| self.workers = [] | |
| self.running = False | |
| def _worker(self): | |
| while self.running: | |
| if self.queue: | |
| task_fn, args = self.queue.pop(0) # β race condition! | |
| try: | |
| task_fn(*args) | |
| except Exception: | |
| pass # β silent failure, bug is lost forever | |
| def stop(self): | |
| self.running = False | |
| # β no thread.join() β threads keep running after stop() returns! | |
| ``` | |
| A junior developer might look at this and think: *"Looks fine to me."* | |
| A senior engineer immediately spots four problems: the unprotected list access, the race condition on `pop(0)`, the swalloId exception, and the missing `join()`. | |
| Our trained agent now spots all four. | |
| --- | |
| ## The Environment: 13 Tasks, 6 Languages | |
| I built 13 carefully curated tasks spanning real bug patterns across the full stack: | |
| | Language | Task | What's Hidden | | |
| |----------|------|---------------| | |
| | Python | Basic Bug Detection | ZeroDivisionError, IndexError, O(nΒ²) complexity | | |
| | Python | Security Review | SQL injection, MD5 hashing, pickle RCE | | |
| | Python | Concurrency Hunt | Race conditions, dict mutation during iteration | | |
| | Python | API Security | JWT bypass, CORS wildcard, IDOR, debug mode | | |
| | Python | JWT Auth System | No token expiry, session fixation, timing attack | | |
| | Python | ORM Bugs | N+1 queries, missing indexes, no commit() | | |
| | Python | Data Pipeline | int8 overflow, bare except, file handle leak | | |
| | JavaScript | Async Flow | Missing await, callback hell, memory leak | | |
| | JavaScript | Async Handler | Unhandled Promise.all, event listener leak | | |
| | JavaScript | Race Condition | Inventory oversell via stale state | | |
| | SQL | Injection Hunt | ORDER BY injection, LIMIT injection | | |
| | React/JSX | Component Security | XSS via dangerouslySetInnerHTML, token leak in URL | | |
| | Django | Auth Logic | Plaintext comparison, timing attack, DoesNotExist crash | | |
| Each task has **known ground truth** β exact line numbers, severity levels, and keywords that tell the grader what a correct review looks like. | |
| --- | |
| ## The Reward Signal: Dense, Anti-Gameable, Fair | |
| Good RL environments have good reward signals. Here's ours: | |
| ``` | |
| +0.20 Found a critical bug (the ones that take down production) | |
| +0.12 Found a major bug (the ones that cause data loss) | |
| +0.05 Found a minor bug (the ones that cause confusion) | |
| -0.08 False positive (flagging code that isn't broken) | |
| +0.10 Correct verdict (approve vs request_changes) | |
| -0.15 Wrong verdict (approving broken code) | |
| +0.40 Correct fix for critical (actually fixed what you found) | |
| -0.02 Per step (efficiency β be fast, be right) | |
| ``` | |
| But here's the part most RL environments skip: **anti-reward hacking**. | |
| A naive agent quickly discovers it can spam 30 vague comments, hope some accidentally match known issues, and always say "request_changes" to grab the verdict bonus. I caught this early and built four server-side checks: | |
| - **Spam detection** β more than 12 comments triggers proportional penalty | |
| - **Duplicate detection** β copy-pasting the same comment triggers -0.20 | |
| - **Quality check** β descriptions under 15 characters are penalized | |
| - **Verdict gaming** β `request_changes` with zero comments gets caught | |
| The result: the agent *had* to actually understand the code to score Ill. | |
| --- | |
| ## Curriculum Learning: The Agent Earns Harder Tasks | |
| I didn't just throw the agent at concurrency bugs on day one. | |
| CodeRevieInv implements adaptive curriculum learning. The agent starts on easy Python bugs. When it averages above 0.75 for three consecutive episodes, the environment automatically promotes it to medium difficulty. Then hard. Then security tasks. | |
| ``` | |
| Episode 1-20: easy Python bugs β agent builds pattern recognition | |
| β avg 0.78 β promoted to medium! | |
| Episode 20-60: SQL injection, hardcoded secrets β security patterns | |
| β avg 0.72 β promoted to hard! | |
| Episode 60-200: Race conditions, JWT bypass β complex reasoning | |
| π scores climbing from 0.30 β 0.65... | |
| Episode 200-500: All tasks β generalizing across 6 languages | |
| ``` | |
| No human decides when to increase difficulty. The environment does β based purely on the agent's performance. | |
| --- | |
| ## GRPO Training: The Agent Actually Learns | |
| I trained **QIn2.5-Coder-7B-Instruct** using GRPO (Group Relative Policy Optimization) on an A100 GPU, with CodeRevieInv as the reward signal. | |
| 500 episodes. 2 hours 43 minutes. One A100. | |
| The learning curve tells the story better than any table: | |
| - **Start**: reward ~0.60 β agent finds some bugs but misses subtle ones | |
| - **Episode 50**: reward ~0.85 β agent learned security patterns | |
| - **Episode 150**: reward ~1.05 β agent consistently beats baseline | |
| - **Episode 250**: reward ~1.15 β agent mastered most task types | |
| The red smoothed curve in the plot goes up. Consistently. That's not luck β that's genuine learning. | |
| --- | |
| ## The Result That Surprised Us | |
| I expected the trained model to be better than the baseline. I didn't expect *this*: | |
| | Task | Groq llama-3.3-**70B** (Baseline) | QIn2.5-Coder-**7B** (After GRPO) | Change | | |
| |------|----------------------------------|-----------------------------------|--------| | |
| | easy | 0.95 | 1.13 | β +0.18 | | |
| | medium | 0.90 | 1.28 | β +0.38 | | |
| | hard | 0.15 | 0.48 | β **+0.33 (3x!)** | | |
| | api_security | 0.90 | 1.20 | β +0.30 | | |
| | auth_system | 0.00 | 1.13 | β **+1.13 (from zero!)** | | |
| | **AVERAGE** | **0.58** | **1.04** | **β +0.46** | | |
| A **7 billion parameter model** outperformed a **70 billion parameter model** by **46% on average**. | |
| The `auth_system` task is the most dramatic: the 70B baseline scored **0.00** β it completely failed to identify any JWT vulnerabilities. Our trained 7B model scored **1.13** β it found all seven vulnerabilities and fixed most of them correctly. | |
| This is what environment-driven RL training does. More parameters don't matter if you've never practiced. Our agent practiced 500 times against ground truth. The 70B model never did. | |
| --- | |
| ## The Bug Fixing Agent: Closing the Loop | |
| Most code review tools stop at *finding* bugs. Ours doesn't. | |
| After submitting a review, the agent calls our `/fix` endpoint with suggested code corrections: | |
| ``` | |
| [COMMENT] | |
| line: 5 | |
| severity: critical | |
| type: bug | |
| message: ZeroDivisionError when numbers list is empty β len(numbers) returns 0 | |
| fix: return total / len(numbers) if numbers else 0 | |
| [/COMMENT] | |
| [VERDICT] | |
| decision: request_changes | |
| [/VERDICT] | |
| ``` | |
| The fix verifier checks whether the correction addresses the known issue. Correct fix on a critical bug? +0.40 bonus reward. Wrong fix? -0.10 penalty. | |
| This closed the loop from **detection** to **remediation** β the agent doesn't just tell you what's broken, it tells you how to fix it. | |
| --- | |
| ## What I Learned Building This | |
| **1. Ground truth is everything.** | |
| The reason this works is that code bugs have verifiable ground truth. The grader always knows the right ansIr. This is what separates code review from tasks like "write a good essay" β there's no ambiguity in whether line 25 has a race condition. | |
| **2. Reward hacking appears faster than you expect.** | |
| Within 20 episodes of our first training run, the model discovered it could output `request_changes` with no comments and get +0.10 every time. I saw reward plateau at exactly +0.08. The anti-hacking checks Ire not optional β they Ire essential. | |
| **3. Curriculum matters more than I thought.** | |
| Without curriculum, reward variance was so high the model couldn't find a learning signal. With curriculum, the training curve became smooth and consistent. This one change was the difference betIen a confused model and a learning one. | |
| **4. A small model with good training beats a big model without it.** | |
| This is the most important lesson. The 70B model has seen more code than our 7B model ever will. But it has never been corrected 500 times for missing a race condition. Practice beats knowledge. | |
| --- | |
| ## Try It Yourself | |
| The environment is live. You can test it right now. | |
| **Run the baseline agent on any task:** | |
| ```bash | |
| curl -X POST https://lucifer0077-code-review-env.hf.space/baseline \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_id": "hard"}' | |
| ``` | |
| **Submit your own review:** | |
| ```bash | |
| curl -X POST https://lucifer0077-code-review-env.hf.space/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "comments": [{ | |
| "line_number": 25, | |
| "issue_type": "bug", | |
| "severity": "critical", | |
| "description": "Race condition β queue.pop(0) not thread-safe" | |
| }], | |
| "verdict": "request_changes" | |
| }' | |
| ``` | |
| **Check curriculum progress:** | |
| ```bash | |
| curl https://lucifer0077-code-review-env.hf.space/curriculum/state | |
| ``` | |
| --- | |
| ## Links | |
| π **Live Demo**: https://lucifer0077-code-review-env.hf.space | |
| π€ **Trained Model**: https://huggingface.co/lucifer0077/code-review-agent-grpo | |
| π» **GitHub**: https://github.com/Lucifer-cyber007/meta-hackathon-open-env | |
| --- | |
| ## What's Next | |
| This is version 1. Here's what version 2 looks like: | |
| - **Multi-step episodes** β agent reviews, gets feedback, revises. Like a real code review conversation. | |
| - **GitHub Ibhook** β environment hooks directly into real pull requests. Agent reviews your actual PRs. | |
| - **More languages** β Go, Rust, Java. Because bugs don't care what language you write in. | |
| - **Larger model** β 34B with full fine-tuning. If 7B beats 70B, what does 34B do? | |
| --- | |
| ## One Last Thing | |
| I built this in 48 hours. One developer. Limited coding experience. With AI assistance. | |
| The environment works. The training pipeline works. The agent genuinely improved. The results are real. | |
| If a beginner can build a system where a 7B model beats a 70B model at code review in 48 hours using open-source tools β imagine what a team of engineers can do with more time. | |
| That's the promise of OpenEnv. That's why I built CodeRevieInv. | |
| --- | |
| *Built at Meta Γ HuggingFace Γ PyTorch OpenEnv Grand Finale β April 2026, Bangalore* | |
| *Aditya Sharma (lucifer0077)* | |
| *Theme 4: Self-Improving Agent | Theme 3.1: Professional Tasks* | |