Spaces:
Sleeping
Sleeping
File size: 15,797 Bytes
90be6c7 7f4c57d 90be6c7 7f4c57d 90be6c7 b098526 90be6c7 e448fed 90be6c7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 | # CodeArena: Teaching LLMs to Debug Code Through Reinforcement Learning
**An OpenEnv-compatible RL environment for iterative code repair with adaptive difficulty, hybrid grading, and self-improving agent memory.**
[](https://huggingface.co/spaces/ceoavinash/codearena-rl)
[](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
[](https://github.com/havinashpatil/meta)
---
## The Problem: Why We Built CodeArena
Every major AI coding assistant β GitHub Copilot, Cursor, Devin β is benchmarked on **code generation**. Can it write a function? Can it complete a snippet?
But here's the gap nobody is talking about: **what happens when the code breaks?**
In production, code breaks constantly. A real developer doesn't just generate code β they spend the majority of their time **reading error logs, reasoning about failure, iterating on fixes, and recovering from mistakes.** This iterative debugging loop is the core skill that separates a junior developer from a senior one.
Yet there is no standardized RL environment to train or evaluate an LLM on this capability. HumanEval measures one-shot generation. MBPP measures function completion. Neither measures what happens across multiple repair attempts when the first fix doesn't work.
**CodeArena** is the first open-source, OpenEnv-compatible reinforcement learning environment built specifically for **iterative code repair**.
---
## How CodeArena Works
### The Loop
CodeArena simulates the real-world debugging workflow:
```
1. Agent receives buggy Python code + error log
2. Agent proposes a fix
3. Environment executes the fix in a sandboxed subprocess
4. Environment runs unit tests and scores the fix
5. Agent receives reward + updated error log
6. Repeat up to 5 steps
```
This is fundamentally different from one-shot code generation benchmarks. The agent must:
- **Read and interpret error messages** from previous attempts
- **Track what it has already tried** (repeated fixes are penalized)
- **Decide whether to patch locally or rewrite entirely**
- **Optimize for efficiency**, not just correctness
### Architecture
```
Agent βββ POST /reset βββ CodeArena Server βββ Returns buggy_code + error_log
β β
β βββ Task Loader (9 tasks across 5 categories)
β βββ Sandboxed Executor (subprocess + timeout)
β βββ Hybrid Grader (tests + LLM judge)
β βββ Algorithm Detector (complexity analysis)
β βββ Agent Memory (self-improving store)
β
βββ POST /step βββββββββ Returns observation, reward, done, info
```
The server is a standard FastAPI application that implements the OpenEnv specification (`/reset`, `/step`, `/state`). The `openenv.yaml` manifest defines the observation space (buggy code, error log, test results, previous attempts) and the action space (proposed fix).
---
## What Makes CodeArena Special (Environment Innovation)
### 1. Hybrid Grader: Tests + LLM-as-Judge
Most coding benchmarks use a single signal: did the tests pass? This creates a fundamental problem β agents learn to produce code that passes weak tests through reward-hacking (e.g., hardcoding expected outputs, or producing syntactically correct but semantically broken code).
CodeArena uses a **Hybrid Grader** with six weighted components:
| Component | Weight | What It Measures |
|---|---|---|
| `compile_score` | 15% | Code compiles without syntax errors |
| `test_pass_ratio` | 35% | Fraction of unit tests passed |
| `efficiency_score` | 30% | Execution time vs. optimal runtime |
| `llm_correctness` | 10% | LLM judge: is the fix logically correct? |
| `llm_security` | 5% | LLM judge: does the fix introduce vulnerabilities? |
| `llm_quality` | 5% | LLM judge: is the code readable and maintainable? |
Additionally, two penalties are applied:
- **Step penalty** (`-0.01 Γ step_count`): Rewards faster fixes
- **Novelty penalty** (`-0.10`): Penalizes submitting the same fix twice
The LLM judge is called via the OpenAI-compatible API (configurable to GPT-4o-mini, local Ollama, or HuggingFace Inference). When no API key is available, it falls back to neutral scores (0.5), ensuring the environment always runs.
**Why this matters for training:** The heavy 30% weight on efficiency means that an agent that passes all tests with an O(nΒ²) brute-force solution gets a significantly lower reward than one that uses an O(n) algorithm. This forces the model to learn *algorithmic reasoning*, not just syntax repair.
### 2. Adaptive Curriculum (Theme #4: Self-Improvement)
CodeArena doesn't use a fixed task set. It features an **Adaptive Curriculum** that tracks the agent's rolling average reward over recent episodes and automatically adjusts difficulty:
| Condition | Transition |
|---|---|
| avg reward > 0.80 on Easy | β Medium |
| avg reward > 0.75 on Medium | β Hard |
| avg reward < 0.35 on Hard | β Medium (de-escalate) |
| avg reward < 0.35 on Medium | β Easy (de-escalate) |
This is activated by passing `task_id: "auto"` to the `/reset` endpoint.
**Why this matters:** The agent cannot plateau by memorizing solutions to easy tasks. As soon as it masters syntax errors, the environment pushes it to algorithmic logic bugs. If it struggles, it recovers on easier tasks before trying again. This creates a natural *recursive skill amplification* loop β the environment drives the agent's own capability growth.
### 3. Algorithm Detection + Adaptive Prompting
CodeArena includes a built-in **Algorithm Detector** (`server/algorithm_detector.py`) that:
1. **Classifies the problem type** (max subarray, two-sum, binary search, sliding window, etc.) from code patterns
2. **Estimates time complexity** by analyzing loop nesting depth (O(1) β O(n) β O(nΒ²) β O(nΒ³))
3. **Generates targeted optimization hints** (e.g., "Use Kadane's Algorithm O(n): `curr = max(num, curr+num)`")
When the AI fixer generates a repair, the algorithm detector provides **adaptive prompt suffixes** based on the current reward level:
- Low reward (< 0.4): "Focus on correctness. Fix syntax errors first."
- Medium reward (0.4β0.7): "Fix edge cases and logic bugs."
- High reward (> 0.7): "Optimize for performance. Use O(n) algorithms."
### 4. Self-Improving Agent Memory
CodeArena includes a persistent **Agent Memory** system (`server/memory.py`) that stores the best solution found for each task. When the agent encounters the same task type again, it can retrieve its previous best solution as a starting point.
This creates a genuine self-improvement loop:
- Episode 1: Agent fixes syntax β reward 0.45
- Episode 5: Agent recalls its best previous fix, optimizes further β reward 0.72
- Episode 10: Agent has accumulated enough memory to skip basic fixes entirely β reward 0.88
The memory is persisted to `agent_memory.json` and survives server restarts.
### 5. Rich Task Diversity
CodeArena ships with **9 tasks across 5 categories**:
| Category | Tasks | What It Tests |
|---|---|---|
| Easy (syntax) | Missing colons, wrong indentation | Basic Python syntax repair |
| Medium (logic) | Off-by-one errors, wrong conditions | Algorithmic reasoning |
| Hard (optimization) | O(nΒ²) β O(n) refactoring | Algorithm design |
| Type Errors | Wrong types, missing conversions | Type system understanding |
| Security Bugs | SQL injection, path traversal | Security awareness |
Each task includes:
- Buggy source code
- Multiple unit tests
- An optimal execution time baseline (for efficiency scoring)
---
## Training Pipeline: TRL GRPO on CodeArena
We trained a coding model using **Hugging Face TRL's GRPO (Group Relative Policy Optimization)** trainer, connecting it directly to the CodeArena environment as a live reward signal.
### How It Works
```python
# The reward function queries CodeArena's /step endpoint
def codearena_reward_func(completions, prompts):
rewards = []
for completion in completions:
proposed_fix = completion[0].get('content', '').strip()
res = httpx.post("http://localhost:7860/step",
json={"proposed_fix": proposed_fix})
reward = res.json().get('reward', 0.0)
rewards.append(reward)
return rewards
# GRPO training with CodeArena as the reward environment
trainer = GRPOTrainer(
model=model,
reward_funcs=codearena_reward_func,
args=GRPOConfig(
output_dir="./codearena-grpo",
learning_rate=1e-5,
max_steps=50,
per_device_train_batch_size=2,
),
train_dataset=dataset,
)
trainer.train()
```
The key insight is that **the reward is not static** β it comes from actually executing the agent's proposed code against real unit tests in a sandboxed environment, then grading it with the hybrid scorer. This is true environment-in-the-loop RL, not reward modeling on a frozen dataset.
### Training Results
We trained `Qwen/Qwen2.5-Coder-1.5B` on the `m-a-p/Code-Feedback` dataset with CodeArena as the reward environment.

*Fig 1: Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.*

*Fig 2: Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging β exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.*

*Fig 3: Task Difficulty Performance Matrix showing the mean, max, and standard deviation of rewards across difficulty levels.*

*Fig 4: Complexity Distribution highlighting the frequency of O(1) vs O(n) solutions generated by the agent.*

*Fig 5: Reward Distribution by Fixer Method, comparing the performance of the Ollama LLM to the built-in pattern-based fixer.*

*Fig 6: Cumulative Reward over time, highlighting the total accumulated reward across multiple episodes.*

*Fig 7: LLM Fixer Method Performance Comparison scatter plot showing the individual performance data points of Ollama vs Builtin methods.*
### Reproducing the Training
The complete training pipeline is available as a Colab notebook:
π **[Open in Google Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)**
The notebook:
1. Installs all dependencies (`trl`, `transformers`, `httpx`)
2. Clones the CodeArena repository
3. Starts the FastAPI backend server
4. Loads `Qwen2.5-Coder-1.5B` with GRPO configuration
5. Trains against the live environment
6. Logs rewards per step
---
## Live Demo: Try It Now
The fully-functional CodeArena environment is deployed on Hugging Face Spaces with a React frontend dashboard:
π **[https://huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl)**
### What You Can Do on the Live Demo:
1. **Start an Episode**: Select Easy/Medium/Hard difficulty and load a buggy code task
2. **Manual Fix**: Edit the code yourself and click "Run Step" to see your reward
3. **AI Fix**: Click the β¨ AI FIX button to have the built-in AI repair agent (powered by `Qwen2.5-Coder-3B-Instruct` via HuggingFace Serverless Inference) generate a fix
4. **Agent Mode**: Toggle auto-pilot to watch the agent autonomously fix β test β fix β test in a loop
5. **Sandbox Mode**: Paste your own arbitrary Python code and watch the environment evaluate it
The dashboard shows real-time reward components (compile score, test ratio, efficiency), a terminal log of every step, and a reward chart that updates live.
---
## Technical Deep Dive
### Sandboxed Execution
All agent-submitted code runs in an isolated subprocess with:
- **AST syntax validation** before execution (catches syntax errors without running code)
- **Timeout enforcement** (configurable per task, default 5s)
- **Temporary file execution** (code is written to a temp file, executed, then deleted)
- **Structured output parsing** (test results are communicated via a `|CODEARENA_STATS|` sentinel)
This ensures that malicious or infinite-loop code cannot crash the server.
### AI Code Fixer Pipeline
The built-in AI fixer (`server/ai_fixer.py`, 600+ lines) implements a sophisticated multi-fallback pipeline:
1. **TGI / HuggingFace Serverless API** (Priority 1): Calls `Qwen2.5-Coder-3B-Instruct` for high-quality fixes
2. **Local Ollama** (Priority 2): Falls back to a local LLM if available
3. **AST Pattern-Based Fixer** (Priority 3): 20+ pattern rules for common Python bugs:
- Missing colons after `def`, `if`, `for`, `while`
- Missing `return` statements
- Wrong comparison operators (`=` β `==`)
- Missing `self` parameter in class methods
- Incorrect indentation repair
- And many more
The fixer also includes a **code validator** that catches fixes worse than the original (e.g., introduces new syntax errors), and a **self-critique loop** that re-checks the generated code before returning it.
### Complexity-Reward Tracking
Every fix is logged to `complexity_rewards.csv` with:
- Task ID
- Reward achieved
- Detected time complexity
- Fix method (TGI/Ollama/built-in)
This creates a research dataset that proves our core hypothesis: **agents that produce O(n) solutions consistently receive higher rewards than those producing O(nΒ²) solutions.**
---
## Why CodeArena Matters
**Writing code is a solved problem.** GPT-4, Claude, Gemini β they can all generate working functions from natural language descriptions.
**Debugging code autonomously β reasoning about failure, iterating on fixes, recovering from wrong turns β is not solved.**
Every production coding system will eventually face broken code. There is no other standardized RL environment that trains and benchmarks iterative repair at this level. CodeArena fills that gap with:
- A **hybrid grader** that prevents reward-hacking
- An **adaptive curriculum** for continuous self-improvement
- A **persistent memory** for cross-episode learning
- A **rich task library** spanning syntax, logic, algorithms, types, and security
- Full **OpenEnv compatibility** for plug-and-play evaluation
CodeArena is infrastructure. Plug any model in. Run it. Get a number. Compare it against the baseline. Train on it. Watch it improve.
---
## Links & Resources
| Resource | Link |
|---|---|
| π€ Live Demo (HF Space) | [huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl) |
| π Training Notebook (Colab) | [Open in Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb) |
| π» Source Code (GitHub) | [github.com/havinashpatil/meta](https://github.com/havinashpatil/meta) |
| π OpenEnv Manifest | [openenv.yaml](https://github.com/havinashpatil/meta/blob/main/openenv.yaml) |
| πΊ YouTube Video | [https://youtu.be/LmsJvAMTdCY](https://youtu.be/LmsJvAMTdCY) |
---
*Built for the OpenEnv Hackathon India 2026 β Theme #4: Self-Improvement*
|