Spaces:
Sleeping
Sleeping
havinashpatil commited on
Commit Β·
90be6c7
1
Parent(s): 5e35378
Final hackathon submission: polished README + detailed blog writeup
Browse files
BLOG.md
ADDED
|
@@ -0,0 +1,294 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CodeArena: Teaching LLMs to Debug Code Through Reinforcement Learning
|
| 2 |
+
|
| 3 |
+
**An OpenEnv-compatible RL environment for iterative code repair with adaptive difficulty, hybrid grading, and self-improving agent memory.**
|
| 4 |
+
|
| 5 |
+
[](https://huggingface.co/spaces/ceoavinash/codearena-rl)
|
| 6 |
+
[](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
|
| 7 |
+
[](https://github.com/havinashpatil/meta)
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## The Problem: Why We Built CodeArena
|
| 12 |
+
|
| 13 |
+
Every major AI coding assistant β GitHub Copilot, Cursor, Devin β is benchmarked on **code generation**. Can it write a function? Can it complete a snippet?
|
| 14 |
+
|
| 15 |
+
But here's the gap nobody is talking about: **what happens when the code breaks?**
|
| 16 |
+
|
| 17 |
+
In production, code breaks constantly. A real developer doesn't just generate code β they spend the majority of their time **reading error logs, reasoning about failure, iterating on fixes, and recovering from mistakes.** This iterative debugging loop is the core skill that separates a junior developer from a senior one.
|
| 18 |
+
|
| 19 |
+
Yet there is no standardized RL environment to train or evaluate an LLM on this capability. HumanEval measures one-shot generation. MBPP measures function completion. Neither measures what happens across multiple repair attempts when the first fix doesn't work.
|
| 20 |
+
|
| 21 |
+
**CodeArena** is the first open-source, OpenEnv-compatible reinforcement learning environment built specifically for **iterative code repair**.
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## How CodeArena Works
|
| 26 |
+
|
| 27 |
+
### The Loop
|
| 28 |
+
|
| 29 |
+
CodeArena simulates the real-world debugging workflow:
|
| 30 |
+
|
| 31 |
+
```
|
| 32 |
+
1. Agent receives buggy Python code + error log
|
| 33 |
+
2. Agent proposes a fix
|
| 34 |
+
3. Environment executes the fix in a sandboxed subprocess
|
| 35 |
+
4. Environment runs unit tests and scores the fix
|
| 36 |
+
5. Agent receives reward + updated error log
|
| 37 |
+
6. Repeat up to 5 steps
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
This is fundamentally different from one-shot code generation benchmarks. The agent must:
|
| 41 |
+
- **Read and interpret error messages** from previous attempts
|
| 42 |
+
- **Track what it has already tried** (repeated fixes are penalized)
|
| 43 |
+
- **Decide whether to patch locally or rewrite entirely**
|
| 44 |
+
- **Optimize for efficiency**, not just correctness
|
| 45 |
+
|
| 46 |
+
### Architecture
|
| 47 |
+
|
| 48 |
+
```
|
| 49 |
+
Agent βββ POST /reset βββ CodeArena Server βββ Returns buggy_code + error_log
|
| 50 |
+
β β
|
| 51 |
+
β βββ Task Loader (9 tasks across 5 categories)
|
| 52 |
+
β βββ Sandboxed Executor (subprocess + timeout)
|
| 53 |
+
β βββ Hybrid Grader (tests + LLM judge)
|
| 54 |
+
β βββ Algorithm Detector (complexity analysis)
|
| 55 |
+
β βββ Agent Memory (self-improving store)
|
| 56 |
+
β
|
| 57 |
+
βββ POST /step βββββββββ Returns observation, reward, done, info
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
The server is a standard FastAPI application that implements the OpenEnv specification (`/reset`, `/step`, `/state`). The `openenv.yaml` manifest defines the observation space (buggy code, error log, test results, previous attempts) and the action space (proposed fix).
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
## What Makes CodeArena Special (Environment Innovation)
|
| 65 |
+
|
| 66 |
+
### 1. Hybrid Grader: Tests + LLM-as-Judge
|
| 67 |
+
|
| 68 |
+
Most coding benchmarks use a single signal: did the tests pass? This creates a fundamental problem β agents learn to produce code that passes weak tests through reward-hacking (e.g., hardcoding expected outputs, or producing syntactically correct but semantically broken code).
|
| 69 |
+
|
| 70 |
+
CodeArena uses a **Hybrid Grader** with six weighted components:
|
| 71 |
+
|
| 72 |
+
| Component | Weight | What It Measures |
|
| 73 |
+
|---|---|---|
|
| 74 |
+
| `compile_score` | 15% | Code compiles without syntax errors |
|
| 75 |
+
| `test_pass_ratio` | 35% | Fraction of unit tests passed |
|
| 76 |
+
| `efficiency_score` | 30% | Execution time vs. optimal runtime |
|
| 77 |
+
| `llm_correctness` | 10% | LLM judge: is the fix logically correct? |
|
| 78 |
+
| `llm_security` | 5% | LLM judge: does the fix introduce vulnerabilities? |
|
| 79 |
+
| `llm_quality` | 5% | LLM judge: is the code readable and maintainable? |
|
| 80 |
+
|
| 81 |
+
Additionally, two penalties are applied:
|
| 82 |
+
- **Step penalty** (`-0.01 Γ step_count`): Rewards faster fixes
|
| 83 |
+
- **Novelty penalty** (`-0.10`): Penalizes submitting the same fix twice
|
| 84 |
+
|
| 85 |
+
The LLM judge is called via the OpenAI-compatible API (configurable to GPT-4o-mini, local Ollama, or HuggingFace Inference). When no API key is available, it falls back to neutral scores (0.5), ensuring the environment always runs.
|
| 86 |
+
|
| 87 |
+
**Why this matters for training:** The heavy 30% weight on efficiency means that an agent that passes all tests with an O(nΒ²) brute-force solution gets a significantly lower reward than one that uses an O(n) algorithm. This forces the model to learn *algorithmic reasoning*, not just syntax repair.
|
| 88 |
+
|
| 89 |
+
### 2. Adaptive Curriculum (Theme #4: Self-Improvement)
|
| 90 |
+
|
| 91 |
+
CodeArena doesn't use a fixed task set. It features an **Adaptive Curriculum** that tracks the agent's rolling average reward over recent episodes and automatically adjusts difficulty:
|
| 92 |
+
|
| 93 |
+
| Condition | Transition |
|
| 94 |
+
|---|---|
|
| 95 |
+
| avg reward > 0.80 on Easy | β Medium |
|
| 96 |
+
| avg reward > 0.75 on Medium | β Hard |
|
| 97 |
+
| avg reward < 0.35 on Hard | β Medium (de-escalate) |
|
| 98 |
+
| avg reward < 0.35 on Medium | β Easy (de-escalate) |
|
| 99 |
+
|
| 100 |
+
This is activated by passing `task_id: "auto"` to the `/reset` endpoint.
|
| 101 |
+
|
| 102 |
+
**Why this matters:** The agent cannot plateau by memorizing solutions to easy tasks. As soon as it masters syntax errors, the environment pushes it to algorithmic logic bugs. If it struggles, it recovers on easier tasks before trying again. This creates a natural *recursive skill amplification* loop β the environment drives the agent's own capability growth.
|
| 103 |
+
|
| 104 |
+
### 3. Algorithm Detection + Adaptive Prompting
|
| 105 |
+
|
| 106 |
+
CodeArena includes a built-in **Algorithm Detector** (`server/algorithm_detector.py`) that:
|
| 107 |
+
|
| 108 |
+
1. **Classifies the problem type** (max subarray, two-sum, binary search, sliding window, etc.) from code patterns
|
| 109 |
+
2. **Estimates time complexity** by analyzing loop nesting depth (O(1) β O(n) β O(nΒ²) β O(nΒ³))
|
| 110 |
+
3. **Generates targeted optimization hints** (e.g., "Use Kadane's Algorithm O(n): `curr = max(num, curr+num)`")
|
| 111 |
+
|
| 112 |
+
When the AI fixer generates a repair, the algorithm detector provides **adaptive prompt suffixes** based on the current reward level:
|
| 113 |
+
- Low reward (< 0.4): "Focus on correctness. Fix syntax errors first."
|
| 114 |
+
- Medium reward (0.4β0.7): "Fix edge cases and logic bugs."
|
| 115 |
+
- High reward (> 0.7): "Optimize for performance. Use O(n) algorithms."
|
| 116 |
+
|
| 117 |
+
### 4. Self-Improving Agent Memory
|
| 118 |
+
|
| 119 |
+
CodeArena includes a persistent **Agent Memory** system (`server/memory.py`) that stores the best solution found for each task. When the agent encounters the same task type again, it can retrieve its previous best solution as a starting point.
|
| 120 |
+
|
| 121 |
+
This creates a genuine self-improvement loop:
|
| 122 |
+
- Episode 1: Agent fixes syntax β reward 0.45
|
| 123 |
+
- Episode 5: Agent recalls its best previous fix, optimizes further β reward 0.72
|
| 124 |
+
- Episode 10: Agent has accumulated enough memory to skip basic fixes entirely β reward 0.88
|
| 125 |
+
|
| 126 |
+
The memory is persisted to `agent_memory.json` and survives server restarts.
|
| 127 |
+
|
| 128 |
+
### 5. Rich Task Diversity
|
| 129 |
+
|
| 130 |
+
CodeArena ships with **9 tasks across 5 categories**:
|
| 131 |
+
|
| 132 |
+
| Category | Tasks | What It Tests |
|
| 133 |
+
|---|---|---|
|
| 134 |
+
| Easy (syntax) | Missing colons, wrong indentation | Basic Python syntax repair |
|
| 135 |
+
| Medium (logic) | Off-by-one errors, wrong conditions | Algorithmic reasoning |
|
| 136 |
+
| Hard (optimization) | O(nΒ²) β O(n) refactoring | Algorithm design |
|
| 137 |
+
| Type Errors | Wrong types, missing conversions | Type system understanding |
|
| 138 |
+
| Security Bugs | SQL injection, path traversal | Security awareness |
|
| 139 |
+
|
| 140 |
+
Each task includes:
|
| 141 |
+
- Buggy source code
|
| 142 |
+
- Multiple unit tests
|
| 143 |
+
- An optimal execution time baseline (for efficiency scoring)
|
| 144 |
+
|
| 145 |
+
---
|
| 146 |
+
|
| 147 |
+
## Training Pipeline: TRL GRPO on CodeArena
|
| 148 |
+
|
| 149 |
+
We trained a coding model using **Hugging Face TRL's GRPO (Group Relative Policy Optimization)** trainer, connecting it directly to the CodeArena environment as a live reward signal.
|
| 150 |
+
|
| 151 |
+
### How It Works
|
| 152 |
+
|
| 153 |
+
```python
|
| 154 |
+
# The reward function queries CodeArena's /step endpoint
|
| 155 |
+
def codearena_reward_func(completions, prompts):
|
| 156 |
+
rewards = []
|
| 157 |
+
for completion in completions:
|
| 158 |
+
proposed_fix = completion[0].get('content', '').strip()
|
| 159 |
+
res = httpx.post("http://localhost:7860/step",
|
| 160 |
+
json={"proposed_fix": proposed_fix})
|
| 161 |
+
reward = res.json().get('reward', 0.0)
|
| 162 |
+
rewards.append(reward)
|
| 163 |
+
return rewards
|
| 164 |
+
|
| 165 |
+
# GRPO training with CodeArena as the reward environment
|
| 166 |
+
trainer = GRPOTrainer(
|
| 167 |
+
model=model,
|
| 168 |
+
reward_funcs=codearena_reward_func,
|
| 169 |
+
args=GRPOConfig(
|
| 170 |
+
output_dir="./codearena-grpo",
|
| 171 |
+
learning_rate=1e-5,
|
| 172 |
+
max_steps=50,
|
| 173 |
+
per_device_train_batch_size=2,
|
| 174 |
+
),
|
| 175 |
+
train_dataset=dataset,
|
| 176 |
+
)
|
| 177 |
+
trainer.train()
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
The key insight is that **the reward is not static** β it comes from actually executing the agent's proposed code against real unit tests in a sandboxed environment, then grading it with the hybrid scorer. This is true environment-in-the-loop RL, not reward modeling on a frozen dataset.
|
| 181 |
+
|
| 182 |
+
### Training Results
|
| 183 |
+
|
| 184 |
+
We trained `Qwen/Qwen2.5-Coder-1.5B` on the `m-a-p/Code-Feedback` dataset with CodeArena as the reward environment.
|
| 185 |
+
|
| 186 |
+

|
| 187 |
+
*Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.*
|
| 188 |
+
|
| 189 |
+

|
| 190 |
+
*Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging β exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.*
|
| 191 |
+
|
| 192 |
+
### Reproducing the Training
|
| 193 |
+
|
| 194 |
+
The complete training pipeline is available as a Colab notebook:
|
| 195 |
+
οΏ½οΏ½οΏ½ **[Open in Google Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)**
|
| 196 |
+
|
| 197 |
+
The notebook:
|
| 198 |
+
1. Installs all dependencies (`trl`, `transformers`, `httpx`)
|
| 199 |
+
2. Clones the CodeArena repository
|
| 200 |
+
3. Starts the FastAPI backend server
|
| 201 |
+
4. Loads `Qwen2.5-Coder-1.5B` with GRPO configuration
|
| 202 |
+
5. Trains against the live environment
|
| 203 |
+
6. Logs rewards per step
|
| 204 |
+
|
| 205 |
+
---
|
| 206 |
+
|
| 207 |
+
## Live Demo: Try It Now
|
| 208 |
+
|
| 209 |
+
The fully-functional CodeArena environment is deployed on Hugging Face Spaces with a React frontend dashboard:
|
| 210 |
+
|
| 211 |
+
π **[https://huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl)**
|
| 212 |
+
|
| 213 |
+
### What You Can Do on the Live Demo:
|
| 214 |
+
|
| 215 |
+
1. **Start an Episode**: Select Easy/Medium/Hard difficulty and load a buggy code task
|
| 216 |
+
2. **Manual Fix**: Edit the code yourself and click "Run Step" to see your reward
|
| 217 |
+
3. **AI Fix**: Click the β¨ AI FIX button to have the built-in AI repair agent (powered by `Qwen2.5-Coder-3B-Instruct` via HuggingFace Serverless Inference) generate a fix
|
| 218 |
+
4. **Agent Mode**: Toggle auto-pilot to watch the agent autonomously fix β test β fix β test in a loop
|
| 219 |
+
5. **Sandbox Mode**: Paste your own arbitrary Python code and watch the environment evaluate it
|
| 220 |
+
|
| 221 |
+
The dashboard shows real-time reward components (compile score, test ratio, efficiency), a terminal log of every step, and a reward chart that updates live.
|
| 222 |
+
|
| 223 |
+
---
|
| 224 |
+
|
| 225 |
+
## Technical Deep Dive
|
| 226 |
+
|
| 227 |
+
### Sandboxed Execution
|
| 228 |
+
|
| 229 |
+
All agent-submitted code runs in an isolated subprocess with:
|
| 230 |
+
- **AST syntax validation** before execution (catches syntax errors without running code)
|
| 231 |
+
- **Timeout enforcement** (configurable per task, default 5s)
|
| 232 |
+
- **Temporary file execution** (code is written to a temp file, executed, then deleted)
|
| 233 |
+
- **Structured output parsing** (test results are communicated via a `|CODEARENA_STATS|` sentinel)
|
| 234 |
+
|
| 235 |
+
This ensures that malicious or infinite-loop code cannot crash the server.
|
| 236 |
+
|
| 237 |
+
### AI Code Fixer Pipeline
|
| 238 |
+
|
| 239 |
+
The built-in AI fixer (`server/ai_fixer.py`, 600+ lines) implements a sophisticated multi-fallback pipeline:
|
| 240 |
+
|
| 241 |
+
1. **TGI / HuggingFace Serverless API** (Priority 1): Calls `Qwen2.5-Coder-3B-Instruct` for high-quality fixes
|
| 242 |
+
2. **Local Ollama** (Priority 2): Falls back to a local LLM if available
|
| 243 |
+
3. **AST Pattern-Based Fixer** (Priority 3): 20+ pattern rules for common Python bugs:
|
| 244 |
+
- Missing colons after `def`, `if`, `for`, `while`
|
| 245 |
+
- Missing `return` statements
|
| 246 |
+
- Wrong comparison operators (`=` β `==`)
|
| 247 |
+
- Missing `self` parameter in class methods
|
| 248 |
+
- Incorrect indentation repair
|
| 249 |
+
- And many more
|
| 250 |
+
|
| 251 |
+
The fixer also includes a **code validator** that catches fixes worse than the original (e.g., introduces new syntax errors), and a **self-critique loop** that re-checks the generated code before returning it.
|
| 252 |
+
|
| 253 |
+
### Complexity-Reward Tracking
|
| 254 |
+
|
| 255 |
+
Every fix is logged to `complexity_rewards.csv` with:
|
| 256 |
+
- Task ID
|
| 257 |
+
- Reward achieved
|
| 258 |
+
- Detected time complexity
|
| 259 |
+
- Fix method (TGI/Ollama/built-in)
|
| 260 |
+
|
| 261 |
+
This creates a research dataset that proves our core hypothesis: **agents that produce O(n) solutions consistently receive higher rewards than those producing O(nΒ²) solutions.**
|
| 262 |
+
|
| 263 |
+
---
|
| 264 |
+
|
| 265 |
+
## Why CodeArena Matters
|
| 266 |
+
|
| 267 |
+
**Writing code is a solved problem.** GPT-4, Claude, Gemini β they can all generate working functions from natural language descriptions.
|
| 268 |
+
|
| 269 |
+
**Debugging code autonomously β reasoning about failure, iterating on fixes, recovering from wrong turns β is not solved.**
|
| 270 |
+
|
| 271 |
+
Every production coding system will eventually face broken code. There is no other standardized RL environment that trains and benchmarks iterative repair at this level. CodeArena fills that gap with:
|
| 272 |
+
|
| 273 |
+
- A **hybrid grader** that prevents reward-hacking
|
| 274 |
+
- An **adaptive curriculum** for continuous self-improvement
|
| 275 |
+
- A **persistent memory** for cross-episode learning
|
| 276 |
+
- A **rich task library** spanning syntax, logic, algorithms, types, and security
|
| 277 |
+
- Full **OpenEnv compatibility** for plug-and-play evaluation
|
| 278 |
+
|
| 279 |
+
CodeArena is infrastructure. Plug any model in. Run it. Get a number. Compare it against the baseline. Train on it. Watch it improve.
|
| 280 |
+
|
| 281 |
+
---
|
| 282 |
+
|
| 283 |
+
## Links & Resources
|
| 284 |
+
|
| 285 |
+
| Resource | Link |
|
| 286 |
+
|---|---|
|
| 287 |
+
| π€ Live Demo (HF Space) | [huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl) |
|
| 288 |
+
| π Training Notebook (Colab) | [Open in Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb) |
|
| 289 |
+
| π» Source Code (GitHub) | [github.com/havinashpatil/meta](https://github.com/havinashpatil/meta) |
|
| 290 |
+
| π OpenEnv Manifest | [openenv.yaml](https://github.com/havinashpatil/meta/blob/main/openenv.yaml) |
|
| 291 |
+
|
| 292 |
+
---
|
| 293 |
+
|
| 294 |
+
*Built for the OpenEnv Hackathon India 2026 β Theme #4: Self-Improvement*
|
README.md
CHANGED
|
@@ -6,124 +6,232 @@ colorTo: purple
|
|
| 6 |
sdk: docker
|
| 7 |
pinned: true
|
| 8 |
---
|
| 9 |
-
|
|
|
|
| 10 |
[](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
|
| 11 |
-
[]()
|
|
|
|
| 13 |
|
| 14 |
-
# π CodeArena:
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
|
|
|
|
|
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
| 31 |
|
| 32 |
---
|
| 33 |
|
| 34 |
-
## β¨ Environment Innovation (
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
-
Most benchmarks ask a binary question: *did the tests pass?* CodeArena uses a rich **Hybrid Grader**. A deterministic test runner checks correctness, while a built-in LLM Judge (powered by TGI/Hugging Face Serverless) scores the fix on security, readability, and algorithmic complexity (O(N) vs O(NΒ²)). This prevents reward-hacking where agents produce syntactically correct but fundamentally broken code just to pass a weak test.
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
---
|
| 51 |
|
| 52 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |

|
| 59 |
-
*Episode reward over training steps. The rolling 10-step average shows clear learning
|
| 60 |
|
| 61 |

|
| 62 |
-
*Average reward
|
| 63 |
|
| 64 |
-
###
|
| 65 |
-
|
| 66 |
-
|
|
|
|
|
|
|
| 67 |
|
| 68 |
---
|
| 69 |
|
| 70 |
-
##
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
π **[
|
| 75 |
|
| 76 |
-
The
|
| 77 |
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
-
|
| 88 |
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
-
|
| 100 |
-
```bash
|
| 101 |
-
python create_tasks.py
|
| 102 |
-
```
|
| 103 |
|
| 104 |
-
|
| 105 |
-
The backend acts as the OpenEnv entrypoint and serves the compiled React dashboard.
|
| 106 |
-
```bash
|
| 107 |
-
uvicorn server.app:app --port 7860
|
| 108 |
-
```
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
|
|
|
| 116 |
|
| 117 |
---
|
| 118 |
|
| 119 |
-
## π
|
| 120 |
|
| 121 |
| Resource | URL |
|
| 122 |
|---|---|
|
| 123 |
-
| **
|
| 124 |
-
| **
|
| 125 |
-
| **
|
| 126 |
-
| **
|
|
|
|
| 127 |
|
| 128 |
---
|
| 129 |
-
|
|
|
|
|
|
| 6 |
sdk: docker
|
| 7 |
pinned: true
|
| 8 |
---
|
| 9 |
+
|
| 10 |
+
[](https://huggingface.co/spaces/ceoavinash/codearena-rl)
|
| 11 |
[](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
|
| 12 |
+
[](./openenv.yaml)
|
| 13 |
[]()
|
| 14 |
+
[](./BLOG.md)
|
| 15 |
|
| 16 |
+
# π CodeArena: Iterative Code Repair as an RL Environment
|
| 17 |
|
| 18 |
+
> **TL;DR** β An OpenEnv-compatible RL environment where an LLM agent debugs Python code across multiple attempts, graded by unit tests + LLM-as-Judge + algorithmic efficiency. Features adaptive difficulty, agent memory, and a full TRL GRPO training pipeline.
|
| 19 |
|
| 20 |
+
---
|
| 21 |
|
| 22 |
+
## π― The Problem
|
| 23 |
|
| 24 |
+
Every coding AI is benchmarked on **generation** β write a function, complete a snippet. **Nobody benchmarks what happens when the code breaks.** In production, developers spend the majority of their time reading error logs, reasoning about failures, iterating on fixes, and recovering from wrong turns. There is no standardized RL environment for this iterative debugging loop.
|
| 25 |
|
| 26 |
+
**CodeArena fills that gap.** It is the first open-source RL environment built specifically for *iterative code repair*, where an agent must fix buggy Python code over multiple steps, learning from execution feedback after each attempt.
|
| 27 |
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## π§ Theme Alignment: #4 β Self-Improvement
|
| 31 |
|
| 32 |
+
CodeArena directly targets **Theme #4: Self-Improvement** through three mechanisms:
|
| 33 |
|
| 34 |
+
1. **Adaptive Curriculum**: Difficulty escalates automatically when the agent's rolling avg reward exceeds 0.80, and de-escalates when it drops below 0.35. The agent drives its own training progression.
|
| 35 |
+
2. **Persistent Agent Memory**: Best solutions per task are stored in `agent_memory.json` and retrieved in future episodes, creating cross-episode learning.
|
| 36 |
+
3. **Adaptive Prompting**: The AI fixer adjusts its strategy based on current reward level β syntax focus at low rewards, algorithm optimization at high rewards.
|
| 37 |
|
| 38 |
---
|
| 39 |
|
| 40 |
+
## β¨ Environment Innovation (40%)
|
| 41 |
+
|
| 42 |
+
### Hybrid Grader β Tests + LLM-as-Judge
|
| 43 |
+
Most benchmarks ask: *did the tests pass?* CodeArena also asks: *is the fix correct, secure, efficient, and readable?*
|
| 44 |
|
| 45 |
+
| Component | Weight | Signal |
|
| 46 |
+
|---|---|---|
|
| 47 |
+
| `compile_score` | 15% | Code compiles without error |
|
| 48 |
+
| `test_pass_ratio` | 35% | Fraction of unit tests passed |
|
| 49 |
+
| `efficiency_score` | 30% | Execution time vs optimal (O(n) rewarded, O(nΒ²) penalized) |
|
| 50 |
+
| `llm_correctness` | 10% | LLM judge: logical correctness |
|
| 51 |
+
| `llm_security` | 5% | LLM judge: no vulnerabilities introduced |
|
| 52 |
+
| `llm_quality` | 5% | LLM judge: readability and maintainability |
|
| 53 |
|
| 54 |
+
**Penalties:** `-0.01/step` (rewards faster fixes) and `-0.10` for repeating an identical fix (prevents reward-hacking via repetition).
|
|
|
|
| 55 |
|
| 56 |
+
The 30% efficiency weight means an agent that passes all tests with O(nΒ²) brute-force gets a significantly lower reward than one using O(n). This forces the model to learn *algorithmic reasoning*, not just syntax repair.
|
| 57 |
+
|
| 58 |
+
### Algorithm Detector
|
| 59 |
+
A built-in classifier (`server/algorithm_detector.py`) identifies the problem type (Kadane's, Two-Sum, Sliding Window, etc.) and estimates time complexity from loop nesting. This drives targeted optimization hints during repair.
|
| 60 |
+
|
| 61 |
+
### Sandboxed Execution
|
| 62 |
+
All code runs in isolated subprocesses with AST pre-validation, timeout enforcement, and temporary file cleanup. Malicious or infinite-loop code cannot crash the server.
|
| 63 |
+
|
| 64 |
+
### 9 Tasks Across 5 Categories
|
| 65 |
+
|
| 66 |
+
| Category | Example | Tests |
|
| 67 |
+
|---|---|---|
|
| 68 |
+
| Easy (syntax) | Missing colons, indentation | Basic repair |
|
| 69 |
+
| Medium (logic) | Off-by-one, wrong conditions | Reasoning |
|
| 70 |
+
| Hard (algorithms) | O(nΒ²) β O(n) refactoring | Optimization |
|
| 71 |
+
| Type Errors | Wrong types, missing casts | Type safety |
|
| 72 |
+
| Security Bugs | SQL injection, path traversal | Security awareness |
|
| 73 |
|
| 74 |
---
|
| 75 |
|
| 76 |
+
## π Storytelling (30%) β How It Works
|
| 77 |
+
|
| 78 |
+
**Data Flow:** `Agent` β `POST /reset` β receives `buggy_code + error_log` β `POST /step` with `proposed_fix` β sandboxed execution β hybrid grading β `reward + updated error_log` β repeat up to 5 steps.
|
| 79 |
+
|
| 80 |
+
```
|
| 81 |
+
Episode Walkthrough:
|
| 82 |
+
ββββββββββββββββββββββββ
|
| 83 |
+
Step 1: Agent receives def solve(n) print(n)
|
| 84 |
+
β Proposes: def solve(n): print(n)
|
| 85 |
+
β Result: β Compiles, 1/3 tests pass
|
| 86 |
+
β Reward: 0.35
|
| 87 |
|
| 88 |
+
Step 2: Agent reads error: "AssertionError: solve(5) != 25"
|
| 89 |
+
β Proposes: def solve(n): return n**2
|
| 90 |
+
β Result: β 3/3 tests pass, but O(n) expected
|
| 91 |
+
β Reward: 0.72
|
| 92 |
|
| 93 |
+
Step 3: Agent reads hint: "Optimize to O(1)"
|
| 94 |
+
β Proposes: def solve(n): return n*n
|
| 95 |
+
β Result: β 3/3 pass, O(1) optimal
|
| 96 |
+
β Reward: 0.95 β
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
The agent must learn to **read error messages**, **avoid repeating failed fixes**, and **optimize for efficiency** β not just correctness. This mirrors real-world software engineering.
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## π Showing Improvement in Rewards (20%)
|
| 104 |
+
|
| 105 |
+
We trained `Qwen/Qwen2.5-Coder-1.5B` using **TRL GRPO** (Group Relative Policy Optimization) with CodeArena as the live reward environment.
|
| 106 |
|
| 107 |

|
| 108 |
+
*Episode reward over training steps. The rolling 10-step average shows clear learning progression from near-zero to consistent 0.65+ rewards.*
|
| 109 |
|
| 110 |

|
| 111 |
+
*Average reward by task category. Easy/type-error tasks are mastered first; algorithmic optimization remains challenging β exactly the curriculum behavior we designed for.*
|
| 112 |
|
| 113 |
+
### Key Observations:
|
| 114 |
+
- **Initial performance**: Agent produces syntactically broken fixes β reward β 0.01
|
| 115 |
+
- **After 20 steps**: Agent learns to fix syntax β reward β 0.35
|
| 116 |
+
- **After 40 steps**: Agent learns to pass tests β reward β 0.65
|
| 117 |
+
- **Steady improvement**: Rolling average trends upward, with hard tasks remaining the frontier challenge
|
| 118 |
|
| 119 |
---
|
| 120 |
|
| 121 |
+
## π§ Reward & Training Pipeline (10%)
|
| 122 |
|
| 123 |
+
### Training Script (Colab)
|
| 124 |
|
| 125 |
+
π **[Open Training Notebook in Google Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)**
|
| 126 |
|
| 127 |
+
The notebook demonstrates environment-in-the-loop RL:
|
| 128 |
|
| 129 |
+
```python
|
| 130 |
+
def codearena_reward_func(completions, prompts):
|
| 131 |
+
"""Reward function that queries the live CodeArena environment."""
|
| 132 |
+
rewards = []
|
| 133 |
+
for completion in completions:
|
| 134 |
+
proposed_fix = completion[0].get('content', '').strip()
|
| 135 |
+
res = httpx.post("http://localhost:7860/step",
|
| 136 |
+
json={"proposed_fix": proposed_fix})
|
| 137 |
+
reward = res.json().get('reward', 0.0)
|
| 138 |
+
rewards.append(reward)
|
| 139 |
+
return rewards
|
| 140 |
|
| 141 |
+
trainer = GRPOTrainer(
|
| 142 |
+
model=model, # Qwen2.5-Coder-1.5B
|
| 143 |
+
reward_funcs=codearena_reward_func,
|
| 144 |
+
args=GRPOConfig(output_dir="./codearena-grpo",
|
| 145 |
+
learning_rate=1e-5, max_steps=50),
|
| 146 |
+
train_dataset=dataset, # m-a-p/Code-Feedback
|
| 147 |
+
)
|
| 148 |
+
trainer.train()
|
| 149 |
+
```
|
| 150 |
|
| 151 |
+
The reward is **not static** β it comes from actually executing the agent's code in a sandboxed environment, running real unit tests, and scoring with the hybrid grader. This is true environment-in-the-loop RL.
|
| 152 |
|
| 153 |
+
### Inference Evaluation
|
| 154 |
|
| 155 |
+
```bash
|
| 156 |
+
# Evaluate any model against CodeArena
|
| 157 |
+
export MODEL_NAME="codellama:7b-instruct"
|
| 158 |
+
python inference.py --backend openai
|
| 159 |
+
```
|
| 160 |
|
| 161 |
+
Results are logged to `rewards_log.csv` and can be visualized with `python plot_rewards.py`.
|
| 162 |
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
## ποΈ Architecture (OpenEnv Compatible)
|
| 166 |
+
|
| 167 |
+
```
|
| 168 |
+
codearena-rl/
|
| 169 |
+
βββ openenv.yaml # OpenEnv manifest (observation/action spaces)
|
| 170 |
+
βββ server/
|
| 171 |
+
β βββ app.py # FastAPI entrypoint (/reset, /step, /state)
|
| 172 |
+
β βββ models.py # Pydantic schemas (Observation, Action, Task)
|
| 173 |
+
β βββ executor.py # Sandboxed subprocess execution
|
| 174 |
+
β βββ grader.py # Hybrid reward (tests + LLM judge)
|
| 175 |
+
β βββ ai_fixer.py # Multi-fallback AI repair (TGIβOllamaβAST)
|
| 176 |
+
β βββ algorithm_detector.py # Problem classification + complexity detection
|
| 177 |
+
β βββ memory.py # Persistent agent memory (best solutions)
|
| 178 |
+
β βββ raw_runner.py # Sandbox mode executor
|
| 179 |
+
βββ tasks/
|
| 180 |
+
β βββ easy.py, medium.py, hard.py
|
| 181 |
+
β βββ type_errors/ # 3 type error tasks
|
| 182 |
+
β βββ security_bugs/ # 3 security bug tasks
|
| 183 |
+
βββ frontend/ # React + Vite dashboard
|
| 184 |
+
βββ train_grpo.ipynb # TRL GRPO training notebook
|
| 185 |
+
βββ inference.py # CLI evaluation runner
|
| 186 |
+
βββ plot_rewards.py # Reward visualization
|
| 187 |
+
βββ Dockerfile # HF Spaces deployment
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
### Quick Start
|
| 191 |
+
|
| 192 |
+
```bash
|
| 193 |
+
pip install -r requirements.txt
|
| 194 |
+
python create_tasks.py # Generate task database
|
| 195 |
+
uvicorn server.app:app --port 7860 # Start environment
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
### OpenEnv API
|
| 199 |
+
|
| 200 |
+
| Endpoint | Method | Description |
|
| 201 |
+
|---|---|---|
|
| 202 |
+
| `/reset` | POST | Initialize environment with `{"task_id": "easy\|medium\|hard\|auto"}` |
|
| 203 |
+
| `/step` | POST | Submit fix with `{"proposed_fix": "..."}` β reward + observation |
|
| 204 |
+
| `/state` | GET | Current observation |
|
| 205 |
+
| `/health` | GET | Server health check |
|
| 206 |
+
| `/fix` | POST | AI code repair endpoint |
|
| 207 |
+
| `/curriculum` | GET | Adaptive difficulty state |
|
| 208 |
+
| `/stats` | GET | Complexity vs reward analytics |
|
| 209 |
+
| `/memory` | GET | Agent memory contents |
|
| 210 |
|
| 211 |
+
---
|
|
|
|
|
|
|
|
|
|
| 212 |
|
| 213 |
+
## π» Live Demo
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
|
| 215 |
+
π **[https://huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl)**
|
| 216 |
+
|
| 217 |
+
Features:
|
| 218 |
+
- **Real-time dashboard** with reward charts, terminal logs, and code editor
|
| 219 |
+
- **AI Fix button** powered by HuggingFace Serverless Inference (`Qwen2.5-Coder-3B-Instruct`)
|
| 220 |
+
- **Agent Mode** toggle for autonomous fix β test β fix loops
|
| 221 |
+
- **Sandbox Mode** for arbitrary Python code evaluation
|
| 222 |
|
| 223 |
---
|
| 224 |
|
| 225 |
+
## π All Links
|
| 226 |
|
| 227 |
| Resource | URL |
|
| 228 |
|---|---|
|
| 229 |
+
| **π€ HuggingFace Space (Live)** | [huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl) |
|
| 230 |
+
| **π Training Notebook (Colab)** | [Open in Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb) |
|
| 231 |
+
| **π Blog / Writeup** | [BLOG.md](./BLOG.md) |
|
| 232 |
+
| **π» GitHub Repository** | [github.com/havinashpatil/meta](https://github.com/havinashpatil/meta) |
|
| 233 |
+
| **π OpenEnv Manifest** | [openenv.yaml](./openenv.yaml) |
|
| 234 |
|
| 235 |
---
|
| 236 |
+
|
| 237 |
+
*Built for the OpenEnv Hackathon India 2026 β Theme #4: Self-Improvement*
|