havinashpatil commited on
Commit
90be6c7
Β·
1 Parent(s): 5e35378

Final hackathon submission: polished README + detailed blog writeup

Browse files
Files changed (2) hide show
  1. BLOG.md +294 -0
  2. README.md +178 -70
BLOG.md ADDED
@@ -0,0 +1,294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CodeArena: Teaching LLMs to Debug Code Through Reinforcement Learning
2
+
3
+ **An OpenEnv-compatible RL environment for iterative code repair with adaptive difficulty, hybrid grading, and self-improving agent memory.**
4
+
5
+ [![HuggingFace Space](https://img.shields.io/badge/πŸ€—%20Space-Live%20Demo-brightgreen)](https://huggingface.co/spaces/ceoavinash/codearena-rl)
6
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
7
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-blue)](https://github.com/havinashpatil/meta)
8
+
9
+ ---
10
+
11
+ ## The Problem: Why We Built CodeArena
12
+
13
+ Every major AI coding assistant β€” GitHub Copilot, Cursor, Devin β€” is benchmarked on **code generation**. Can it write a function? Can it complete a snippet?
14
+
15
+ But here's the gap nobody is talking about: **what happens when the code breaks?**
16
+
17
+ In production, code breaks constantly. A real developer doesn't just generate code β€” they spend the majority of their time **reading error logs, reasoning about failure, iterating on fixes, and recovering from mistakes.** This iterative debugging loop is the core skill that separates a junior developer from a senior one.
18
+
19
+ Yet there is no standardized RL environment to train or evaluate an LLM on this capability. HumanEval measures one-shot generation. MBPP measures function completion. Neither measures what happens across multiple repair attempts when the first fix doesn't work.
20
+
21
+ **CodeArena** is the first open-source, OpenEnv-compatible reinforcement learning environment built specifically for **iterative code repair**.
22
+
23
+ ---
24
+
25
+ ## How CodeArena Works
26
+
27
+ ### The Loop
28
+
29
+ CodeArena simulates the real-world debugging workflow:
30
+
31
+ ```
32
+ 1. Agent receives buggy Python code + error log
33
+ 2. Agent proposes a fix
34
+ 3. Environment executes the fix in a sandboxed subprocess
35
+ 4. Environment runs unit tests and scores the fix
36
+ 5. Agent receives reward + updated error log
37
+ 6. Repeat up to 5 steps
38
+ ```
39
+
40
+ This is fundamentally different from one-shot code generation benchmarks. The agent must:
41
+ - **Read and interpret error messages** from previous attempts
42
+ - **Track what it has already tried** (repeated fixes are penalized)
43
+ - **Decide whether to patch locally or rewrite entirely**
44
+ - **Optimize for efficiency**, not just correctness
45
+
46
+ ### Architecture
47
+
48
+ ```
49
+ Agent ─── POST /reset ──→ CodeArena Server ──→ Returns buggy_code + error_log
50
+ β”‚ β”‚
51
+ β”‚ β”œβ”€β”€ Task Loader (9 tasks across 5 categories)
52
+ β”‚ β”œβ”€β”€ Sandboxed Executor (subprocess + timeout)
53
+ β”‚ β”œβ”€β”€ Hybrid Grader (tests + LLM judge)
54
+ β”‚ β”œβ”€β”€ Algorithm Detector (complexity analysis)
55
+ β”‚ └── Agent Memory (self-improving store)
56
+ β”‚
57
+ └── POST /step ────────→ Returns observation, reward, done, info
58
+ ```
59
+
60
+ The server is a standard FastAPI application that implements the OpenEnv specification (`/reset`, `/step`, `/state`). The `openenv.yaml` manifest defines the observation space (buggy code, error log, test results, previous attempts) and the action space (proposed fix).
61
+
62
+ ---
63
+
64
+ ## What Makes CodeArena Special (Environment Innovation)
65
+
66
+ ### 1. Hybrid Grader: Tests + LLM-as-Judge
67
+
68
+ Most coding benchmarks use a single signal: did the tests pass? This creates a fundamental problem β€” agents learn to produce code that passes weak tests through reward-hacking (e.g., hardcoding expected outputs, or producing syntactically correct but semantically broken code).
69
+
70
+ CodeArena uses a **Hybrid Grader** with six weighted components:
71
+
72
+ | Component | Weight | What It Measures |
73
+ |---|---|---|
74
+ | `compile_score` | 15% | Code compiles without syntax errors |
75
+ | `test_pass_ratio` | 35% | Fraction of unit tests passed |
76
+ | `efficiency_score` | 30% | Execution time vs. optimal runtime |
77
+ | `llm_correctness` | 10% | LLM judge: is the fix logically correct? |
78
+ | `llm_security` | 5% | LLM judge: does the fix introduce vulnerabilities? |
79
+ | `llm_quality` | 5% | LLM judge: is the code readable and maintainable? |
80
+
81
+ Additionally, two penalties are applied:
82
+ - **Step penalty** (`-0.01 Γ— step_count`): Rewards faster fixes
83
+ - **Novelty penalty** (`-0.10`): Penalizes submitting the same fix twice
84
+
85
+ The LLM judge is called via the OpenAI-compatible API (configurable to GPT-4o-mini, local Ollama, or HuggingFace Inference). When no API key is available, it falls back to neutral scores (0.5), ensuring the environment always runs.
86
+
87
+ **Why this matters for training:** The heavy 30% weight on efficiency means that an agent that passes all tests with an O(nΒ²) brute-force solution gets a significantly lower reward than one that uses an O(n) algorithm. This forces the model to learn *algorithmic reasoning*, not just syntax repair.
88
+
89
+ ### 2. Adaptive Curriculum (Theme #4: Self-Improvement)
90
+
91
+ CodeArena doesn't use a fixed task set. It features an **Adaptive Curriculum** that tracks the agent's rolling average reward over recent episodes and automatically adjusts difficulty:
92
+
93
+ | Condition | Transition |
94
+ |---|---|
95
+ | avg reward > 0.80 on Easy | β†’ Medium |
96
+ | avg reward > 0.75 on Medium | β†’ Hard |
97
+ | avg reward < 0.35 on Hard | β†’ Medium (de-escalate) |
98
+ | avg reward < 0.35 on Medium | β†’ Easy (de-escalate) |
99
+
100
+ This is activated by passing `task_id: "auto"` to the `/reset` endpoint.
101
+
102
+ **Why this matters:** The agent cannot plateau by memorizing solutions to easy tasks. As soon as it masters syntax errors, the environment pushes it to algorithmic logic bugs. If it struggles, it recovers on easier tasks before trying again. This creates a natural *recursive skill amplification* loop β€” the environment drives the agent's own capability growth.
103
+
104
+ ### 3. Algorithm Detection + Adaptive Prompting
105
+
106
+ CodeArena includes a built-in **Algorithm Detector** (`server/algorithm_detector.py`) that:
107
+
108
+ 1. **Classifies the problem type** (max subarray, two-sum, binary search, sliding window, etc.) from code patterns
109
+ 2. **Estimates time complexity** by analyzing loop nesting depth (O(1) β†’ O(n) β†’ O(nΒ²) β†’ O(nΒ³))
110
+ 3. **Generates targeted optimization hints** (e.g., "Use Kadane's Algorithm O(n): `curr = max(num, curr+num)`")
111
+
112
+ When the AI fixer generates a repair, the algorithm detector provides **adaptive prompt suffixes** based on the current reward level:
113
+ - Low reward (< 0.4): "Focus on correctness. Fix syntax errors first."
114
+ - Medium reward (0.4–0.7): "Fix edge cases and logic bugs."
115
+ - High reward (> 0.7): "Optimize for performance. Use O(n) algorithms."
116
+
117
+ ### 4. Self-Improving Agent Memory
118
+
119
+ CodeArena includes a persistent **Agent Memory** system (`server/memory.py`) that stores the best solution found for each task. When the agent encounters the same task type again, it can retrieve its previous best solution as a starting point.
120
+
121
+ This creates a genuine self-improvement loop:
122
+ - Episode 1: Agent fixes syntax β†’ reward 0.45
123
+ - Episode 5: Agent recalls its best previous fix, optimizes further β†’ reward 0.72
124
+ - Episode 10: Agent has accumulated enough memory to skip basic fixes entirely β†’ reward 0.88
125
+
126
+ The memory is persisted to `agent_memory.json` and survives server restarts.
127
+
128
+ ### 5. Rich Task Diversity
129
+
130
+ CodeArena ships with **9 tasks across 5 categories**:
131
+
132
+ | Category | Tasks | What It Tests |
133
+ |---|---|---|
134
+ | Easy (syntax) | Missing colons, wrong indentation | Basic Python syntax repair |
135
+ | Medium (logic) | Off-by-one errors, wrong conditions | Algorithmic reasoning |
136
+ | Hard (optimization) | O(nΒ²) β†’ O(n) refactoring | Algorithm design |
137
+ | Type Errors | Wrong types, missing conversions | Type system understanding |
138
+ | Security Bugs | SQL injection, path traversal | Security awareness |
139
+
140
+ Each task includes:
141
+ - Buggy source code
142
+ - Multiple unit tests
143
+ - An optimal execution time baseline (for efficiency scoring)
144
+
145
+ ---
146
+
147
+ ## Training Pipeline: TRL GRPO on CodeArena
148
+
149
+ We trained a coding model using **Hugging Face TRL's GRPO (Group Relative Policy Optimization)** trainer, connecting it directly to the CodeArena environment as a live reward signal.
150
+
151
+ ### How It Works
152
+
153
+ ```python
154
+ # The reward function queries CodeArena's /step endpoint
155
+ def codearena_reward_func(completions, prompts):
156
+ rewards = []
157
+ for completion in completions:
158
+ proposed_fix = completion[0].get('content', '').strip()
159
+ res = httpx.post("http://localhost:7860/step",
160
+ json={"proposed_fix": proposed_fix})
161
+ reward = res.json().get('reward', 0.0)
162
+ rewards.append(reward)
163
+ return rewards
164
+
165
+ # GRPO training with CodeArena as the reward environment
166
+ trainer = GRPOTrainer(
167
+ model=model,
168
+ reward_funcs=codearena_reward_func,
169
+ args=GRPOConfig(
170
+ output_dir="./codearena-grpo",
171
+ learning_rate=1e-5,
172
+ max_steps=50,
173
+ per_device_train_batch_size=2,
174
+ ),
175
+ train_dataset=dataset,
176
+ )
177
+ trainer.train()
178
+ ```
179
+
180
+ The key insight is that **the reward is not static** β€” it comes from actually executing the agent's proposed code against real unit tests in a sandboxed environment, then grading it with the hybrid scorer. This is true environment-in-the-loop RL, not reward modeling on a frozen dataset.
181
+
182
+ ### Training Results
183
+
184
+ We trained `Qwen/Qwen2.5-Coder-1.5B` on the `m-a-p/Code-Feedback` dataset with CodeArena as the reward environment.
185
+
186
+ ![Reward Curve](results/reward_curve.png)
187
+ *Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.*
188
+
189
+ ![Reward by Task](results/reward_by_task.png)
190
+ *Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging β€” exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.*
191
+
192
+ ### Reproducing the Training
193
+
194
+ The complete training pipeline is available as a Colab notebook:
195
+ οΏ½οΏ½οΏ½ **[Open in Google Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)**
196
+
197
+ The notebook:
198
+ 1. Installs all dependencies (`trl`, `transformers`, `httpx`)
199
+ 2. Clones the CodeArena repository
200
+ 3. Starts the FastAPI backend server
201
+ 4. Loads `Qwen2.5-Coder-1.5B` with GRPO configuration
202
+ 5. Trains against the live environment
203
+ 6. Logs rewards per step
204
+
205
+ ---
206
+
207
+ ## Live Demo: Try It Now
208
+
209
+ The fully-functional CodeArena environment is deployed on Hugging Face Spaces with a React frontend dashboard:
210
+
211
+ πŸ‘‰ **[https://huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl)**
212
+
213
+ ### What You Can Do on the Live Demo:
214
+
215
+ 1. **Start an Episode**: Select Easy/Medium/Hard difficulty and load a buggy code task
216
+ 2. **Manual Fix**: Edit the code yourself and click "Run Step" to see your reward
217
+ 3. **AI Fix**: Click the ✨ AI FIX button to have the built-in AI repair agent (powered by `Qwen2.5-Coder-3B-Instruct` via HuggingFace Serverless Inference) generate a fix
218
+ 4. **Agent Mode**: Toggle auto-pilot to watch the agent autonomously fix β†’ test β†’ fix β†’ test in a loop
219
+ 5. **Sandbox Mode**: Paste your own arbitrary Python code and watch the environment evaluate it
220
+
221
+ The dashboard shows real-time reward components (compile score, test ratio, efficiency), a terminal log of every step, and a reward chart that updates live.
222
+
223
+ ---
224
+
225
+ ## Technical Deep Dive
226
+
227
+ ### Sandboxed Execution
228
+
229
+ All agent-submitted code runs in an isolated subprocess with:
230
+ - **AST syntax validation** before execution (catches syntax errors without running code)
231
+ - **Timeout enforcement** (configurable per task, default 5s)
232
+ - **Temporary file execution** (code is written to a temp file, executed, then deleted)
233
+ - **Structured output parsing** (test results are communicated via a `|CODEARENA_STATS|` sentinel)
234
+
235
+ This ensures that malicious or infinite-loop code cannot crash the server.
236
+
237
+ ### AI Code Fixer Pipeline
238
+
239
+ The built-in AI fixer (`server/ai_fixer.py`, 600+ lines) implements a sophisticated multi-fallback pipeline:
240
+
241
+ 1. **TGI / HuggingFace Serverless API** (Priority 1): Calls `Qwen2.5-Coder-3B-Instruct` for high-quality fixes
242
+ 2. **Local Ollama** (Priority 2): Falls back to a local LLM if available
243
+ 3. **AST Pattern-Based Fixer** (Priority 3): 20+ pattern rules for common Python bugs:
244
+ - Missing colons after `def`, `if`, `for`, `while`
245
+ - Missing `return` statements
246
+ - Wrong comparison operators (`=` β†’ `==`)
247
+ - Missing `self` parameter in class methods
248
+ - Incorrect indentation repair
249
+ - And many more
250
+
251
+ The fixer also includes a **code validator** that catches fixes worse than the original (e.g., introduces new syntax errors), and a **self-critique loop** that re-checks the generated code before returning it.
252
+
253
+ ### Complexity-Reward Tracking
254
+
255
+ Every fix is logged to `complexity_rewards.csv` with:
256
+ - Task ID
257
+ - Reward achieved
258
+ - Detected time complexity
259
+ - Fix method (TGI/Ollama/built-in)
260
+
261
+ This creates a research dataset that proves our core hypothesis: **agents that produce O(n) solutions consistently receive higher rewards than those producing O(nΒ²) solutions.**
262
+
263
+ ---
264
+
265
+ ## Why CodeArena Matters
266
+
267
+ **Writing code is a solved problem.** GPT-4, Claude, Gemini β€” they can all generate working functions from natural language descriptions.
268
+
269
+ **Debugging code autonomously β€” reasoning about failure, iterating on fixes, recovering from wrong turns β€” is not solved.**
270
+
271
+ Every production coding system will eventually face broken code. There is no other standardized RL environment that trains and benchmarks iterative repair at this level. CodeArena fills that gap with:
272
+
273
+ - A **hybrid grader** that prevents reward-hacking
274
+ - An **adaptive curriculum** for continuous self-improvement
275
+ - A **persistent memory** for cross-episode learning
276
+ - A **rich task library** spanning syntax, logic, algorithms, types, and security
277
+ - Full **OpenEnv compatibility** for plug-and-play evaluation
278
+
279
+ CodeArena is infrastructure. Plug any model in. Run it. Get a number. Compare it against the baseline. Train on it. Watch it improve.
280
+
281
+ ---
282
+
283
+ ## Links & Resources
284
+
285
+ | Resource | Link |
286
+ |---|---|
287
+ | πŸ€— Live Demo (HF Space) | [huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl) |
288
+ | πŸ““ Training Notebook (Colab) | [Open in Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb) |
289
+ | πŸ’» Source Code (GitHub) | [github.com/havinashpatil/meta](https://github.com/havinashpatil/meta) |
290
+ | πŸ“‹ OpenEnv Manifest | [openenv.yaml](https://github.com/havinashpatil/meta/blob/main/openenv.yaml) |
291
+
292
+ ---
293
+
294
+ *Built for the OpenEnv Hackathon India 2026 β€” Theme #4: Self-Improvement*
README.md CHANGED
@@ -6,124 +6,232 @@ colorTo: purple
6
  sdk: docker
7
  pinned: true
8
  ---
9
- [![HuggingFace Space](https://img.shields.io/badge/πŸ€—%20Space-Live-brightgreen)](https://huggingface.co/spaces/ceoavinash/codearena-rl)
 
10
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
11
- [![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-blue)](./openenv.yaml)
12
  [![Theme](https://img.shields.io/badge/Theme%20%234-Self--Improvement-purple)]()
 
13
 
14
- # πŸš€ CodeArena: The Iterative Code Repair RL Benchmark
15
 
16
- GitHub Copilot, Cursor, Devin β€” every major coding AI is benchmarked on *generation*. Can it write a function? Can it complete a snippet?
17
 
18
- But nobody benchmarks what happens when the code **breaks**. When the agent has to reason about failure, read error logs, iterate on fixes, and recover from its own mistakes.
19
 
20
- **CodeArena** measures exactly that. It is the first standardized, open-source Reinforcement Learning environment built specifically for **iterative code repair**. It grades an agent not just on whether the tests pass, but on whether the fix is correct, secure, and algorithmically efficient.
21
 
22
- ---
23
 
24
- ## 🎯 Hackathon Theme Alignment: Theme #4 (Self-Improvement)
25
 
26
- CodeArena directly tackles **Theme #4: Self-Improvement**.
 
 
27
 
28
- Instead of a fixed set of tasks, CodeArena features an **Adaptive Curriculum**. The environment continuously tracks the agent's rolling average reward over the last 10 episodes. If an agent masters easy syntax errors (avg reward > 0.80), the environment automatically escalates the difficulty to algorithmic logic bugs. If the agent struggles, it de-escalates to allow recovery.
29
 
30
- The goal is recursive skill amplification: the agent learns to drive its own capability growth without plateauing on memorized, simple solutions.
 
 
31
 
32
  ---
33
 
34
- ## ✨ Environment Innovation (What makes it special?)
 
 
 
35
 
36
- ### 1. The Gap Nobody Is Measuring
37
- We have countless environments for generating code (HumanEval, MBPP). CodeArena is the first standardized RL environment for the *debugging loop*. It simulates the real-world workflow: write β†’ test β†’ read error β†’ fix β†’ repeat.
 
 
 
 
 
 
38
 
39
- ### 2. LLM-as-Judge Hybrid Grader
40
- Most benchmarks ask a binary question: *did the tests pass?* CodeArena uses a rich **Hybrid Grader**. A deterministic test runner checks correctness, while a built-in LLM Judge (powered by TGI/Hugging Face Serverless) scores the fix on security, readability, and algorithmic complexity (O(N) vs O(NΒ²)). This prevents reward-hacking where agents produce syntactically correct but fundamentally broken code just to pass a weak test.
41
 
42
- ### 3. Complex Shaped Rewards
43
- Rewards are a weighted composite, heavily shaped to encourage professional engineering:
44
- - **Test Pass Ratio (40%)**: Fraction of unit tests passed.
45
- - **LLM Judge Score (30%)**: Correctness + Security + Code Quality.
46
- - **Compile Score (20%)**: Does it run without crashing?
47
- - **Efficiency Score (10%)**: Speed vs optimal runtime.
48
- - **Step Penalty (-0.02/step)**: Rewards faster fixes over meandering trial-and-error.
 
 
 
 
 
 
 
 
 
 
49
 
50
  ---
51
 
52
- ## πŸ“ˆ Evidence of Training & Rewards
 
 
 
 
 
 
 
 
 
 
53
 
54
- We successfully trained a model using **TRL GRPO** (Group Relative Policy Optimization) on the CodeArena environment.
 
 
 
55
 
56
- Below is the observable evidence of the agent's training progress. The agent started with a low success rate on algorithmic bugs, but as the GRPO training progressed, it learned to systematically read the `error_log` observation and output correct code, resulting in a climbing reward curve.
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
  ![Reward Curve](results/reward_curve.png)
59
- *Episode reward over training steps. The rolling 10-step average shows clear learning and improvement.*
60
 
61
  ![Reward by Task](results/reward_by_task.png)
62
- *Average reward broken down by task category. The agent performs well on syntax and type errors, while Medium/Hard algorithmic tasks remain challenging but improving.*
63
 
64
- ### πŸƒβ€β™‚οΈ Run the Training Script
65
- We have provided our complete TRL GRPO training pipeline in a Colab notebook so judges can re-run and verify the training process end-to-end:
66
- πŸ‘‰ **[Open Training Script in Google Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)**
 
 
67
 
68
  ---
69
 
70
- ## πŸ’» Try the Live Environment (Hugging Face Space)
71
 
72
- We have deployed the fully-functional CodeArena environment, complete with a React frontend dashboard that visualizes the RL process in real-time.
73
 
74
- πŸ‘‰ **[Live Demo: CodeArena on Hugging Face Spaces](https://huggingface.co/spaces/ceoavinash/codearena-rl)**
75
 
76
- The live space includes a built-in **AI Code Fixer** powered by Hugging Face's Serverless Inference API (using `Qwen2.5-Coder-3B-Instruct`), allowing you to test the agent's repair capabilities directly in your browser.
77
 
78
- ### Features of the Live Space:
79
- - **Real-time Monitoring**: Watch the agent's compile score, test ratio, and LLM judge scores update live.
80
- - **Sandbox Mode**: Paste your own broken Python code and watch the environment evaluate it.
81
- - **Agent Mode**: Toggle auto-pilot to watch the agent fix code in a continuous loop until optimal.
 
 
 
 
 
 
 
82
 
83
- ---
 
 
 
 
 
 
 
 
84
 
85
- ## πŸ› οΈ Architecture & Setup (OpenEnv Compatible)
86
 
87
- This benchmark strictly adheres to the **OpenEnv** specification (`openenv.yaml`).
88
 
89
- **Data Flow:** `Agent` β†’ `POST /reset` β†’ `buggy_code` β†’ `POST /step` β†’ `LLM Judge & Test Runner` β†’ `reward` β†’ `Agent`
 
 
 
 
90
 
91
- ### Local Development
92
 
93
- 1. **Install Dependencies:**
94
- ```bash
95
- pip install -r requirements.txt
96
- cd frontend && npm install
97
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
- 2. **Generate Task Database:**
100
- ```bash
101
- python create_tasks.py
102
- ```
103
 
104
- 3. **Run the FastAPI Backend:**
105
- The backend acts as the OpenEnv entrypoint and serves the compiled React dashboard.
106
- ```bash
107
- uvicorn server.app:app --port 7860
108
- ```
109
 
110
- 4. **Evaluate a Local Agent (Inference):**
111
- You can evaluate any local agent (e.g., Ollama or a HuggingFace pipeline) programmatically via `inference.py`.
112
- ```bash
113
- export MODEL_NAME="codellama:7b-instruct"
114
- python inference.py --backend openai
115
- ```
 
116
 
117
  ---
118
 
119
- ## πŸ”— Quick Links
120
 
121
  | Resource | URL |
122
  |---|---|
123
- | **Hugging Face Space (Live Demo)** | [CodeArena on HF Spaces](https://huggingface.co/spaces/ceoavinash/codearena-rl) |
124
- | **Colab Training Notebook (TRL)** | [Open in Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb) |
125
- | **OpenEnv Specification** | [openenv.yaml](./openenv.yaml) |
126
- | **Demo Video / Blog Post** | *(Add link to YouTube/HF Blog here if available)* |
 
127
 
128
  ---
129
- *Built for the OpenEnv Hackathon India 2026.*
 
 
6
  sdk: docker
7
  pinned: true
8
  ---
9
+
10
+ [![HuggingFace Space](https://img.shields.io/badge/πŸ€—%20Live%20Demo-CodeArena-brightgreen)](https://huggingface.co/spaces/ceoavinash/codearena-rl)
11
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
12
+ [![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2%2B%20Compatible-blue)](./openenv.yaml)
13
  [![Theme](https://img.shields.io/badge/Theme%20%234-Self--Improvement-purple)]()
14
+ [![Blog](https://img.shields.io/badge/πŸ“%20Blog-Read%20Writeup-orange)](./BLOG.md)
15
 
16
+ # πŸš€ CodeArena: Iterative Code Repair as an RL Environment
17
 
18
+ > **TL;DR** β€” An OpenEnv-compatible RL environment where an LLM agent debugs Python code across multiple attempts, graded by unit tests + LLM-as-Judge + algorithmic efficiency. Features adaptive difficulty, agent memory, and a full TRL GRPO training pipeline.
19
 
20
+ ---
21
 
22
+ ## 🎯 The Problem
23
 
24
+ Every coding AI is benchmarked on **generation** β€” write a function, complete a snippet. **Nobody benchmarks what happens when the code breaks.** In production, developers spend the majority of their time reading error logs, reasoning about failures, iterating on fixes, and recovering from wrong turns. There is no standardized RL environment for this iterative debugging loop.
25
 
26
+ **CodeArena fills that gap.** It is the first open-source RL environment built specifically for *iterative code repair*, where an agent must fix buggy Python code over multiple steps, learning from execution feedback after each attempt.
27
 
28
+ ---
29
+
30
+ ## 🧠 Theme Alignment: #4 β€” Self-Improvement
31
 
32
+ CodeArena directly targets **Theme #4: Self-Improvement** through three mechanisms:
33
 
34
+ 1. **Adaptive Curriculum**: Difficulty escalates automatically when the agent's rolling avg reward exceeds 0.80, and de-escalates when it drops below 0.35. The agent drives its own training progression.
35
+ 2. **Persistent Agent Memory**: Best solutions per task are stored in `agent_memory.json` and retrieved in future episodes, creating cross-episode learning.
36
+ 3. **Adaptive Prompting**: The AI fixer adjusts its strategy based on current reward level β€” syntax focus at low rewards, algorithm optimization at high rewards.
37
 
38
  ---
39
 
40
+ ## ✨ Environment Innovation (40%)
41
+
42
+ ### Hybrid Grader β€” Tests + LLM-as-Judge
43
+ Most benchmarks ask: *did the tests pass?* CodeArena also asks: *is the fix correct, secure, efficient, and readable?*
44
 
45
+ | Component | Weight | Signal |
46
+ |---|---|---|
47
+ | `compile_score` | 15% | Code compiles without error |
48
+ | `test_pass_ratio` | 35% | Fraction of unit tests passed |
49
+ | `efficiency_score` | 30% | Execution time vs optimal (O(n) rewarded, O(nΒ²) penalized) |
50
+ | `llm_correctness` | 10% | LLM judge: logical correctness |
51
+ | `llm_security` | 5% | LLM judge: no vulnerabilities introduced |
52
+ | `llm_quality` | 5% | LLM judge: readability and maintainability |
53
 
54
+ **Penalties:** `-0.01/step` (rewards faster fixes) and `-0.10` for repeating an identical fix (prevents reward-hacking via repetition).
 
55
 
56
+ The 30% efficiency weight means an agent that passes all tests with O(nΒ²) brute-force gets a significantly lower reward than one using O(n). This forces the model to learn *algorithmic reasoning*, not just syntax repair.
57
+
58
+ ### Algorithm Detector
59
+ A built-in classifier (`server/algorithm_detector.py`) identifies the problem type (Kadane's, Two-Sum, Sliding Window, etc.) and estimates time complexity from loop nesting. This drives targeted optimization hints during repair.
60
+
61
+ ### Sandboxed Execution
62
+ All code runs in isolated subprocesses with AST pre-validation, timeout enforcement, and temporary file cleanup. Malicious or infinite-loop code cannot crash the server.
63
+
64
+ ### 9 Tasks Across 5 Categories
65
+
66
+ | Category | Example | Tests |
67
+ |---|---|---|
68
+ | Easy (syntax) | Missing colons, indentation | Basic repair |
69
+ | Medium (logic) | Off-by-one, wrong conditions | Reasoning |
70
+ | Hard (algorithms) | O(nΒ²) β†’ O(n) refactoring | Optimization |
71
+ | Type Errors | Wrong types, missing casts | Type safety |
72
+ | Security Bugs | SQL injection, path traversal | Security awareness |
73
 
74
  ---
75
 
76
+ ## πŸ“Š Storytelling (30%) β€” How It Works
77
+
78
+ **Data Flow:** `Agent` β†’ `POST /reset` β†’ receives `buggy_code + error_log` β†’ `POST /step` with `proposed_fix` β†’ sandboxed execution β†’ hybrid grading β†’ `reward + updated error_log` β†’ repeat up to 5 steps.
79
+
80
+ ```
81
+ Episode Walkthrough:
82
+ ────────────────────────
83
+ Step 1: Agent receives def solve(n) print(n)
84
+ β†’ Proposes: def solve(n): print(n)
85
+ β†’ Result: βœ“ Compiles, 1/3 tests pass
86
+ β†’ Reward: 0.35
87
 
88
+ Step 2: Agent reads error: "AssertionError: solve(5) != 25"
89
+ β†’ Proposes: def solve(n): return n**2
90
+ β†’ Result: βœ“ 3/3 tests pass, but O(n) expected
91
+ β†’ Reward: 0.72
92
 
93
+ Step 3: Agent reads hint: "Optimize to O(1)"
94
+ β†’ Proposes: def solve(n): return n*n
95
+ β†’ Result: βœ“ 3/3 pass, O(1) optimal
96
+ β†’ Reward: 0.95 βœ…
97
+ ```
98
+
99
+ The agent must learn to **read error messages**, **avoid repeating failed fixes**, and **optimize for efficiency** β€” not just correctness. This mirrors real-world software engineering.
100
+
101
+ ---
102
+
103
+ ## πŸ“ˆ Showing Improvement in Rewards (20%)
104
+
105
+ We trained `Qwen/Qwen2.5-Coder-1.5B` using **TRL GRPO** (Group Relative Policy Optimization) with CodeArena as the live reward environment.
106
 
107
  ![Reward Curve](results/reward_curve.png)
108
+ *Episode reward over training steps. The rolling 10-step average shows clear learning progression from near-zero to consistent 0.65+ rewards.*
109
 
110
  ![Reward by Task](results/reward_by_task.png)
111
+ *Average reward by task category. Easy/type-error tasks are mastered first; algorithmic optimization remains challenging β€” exactly the curriculum behavior we designed for.*
112
 
113
+ ### Key Observations:
114
+ - **Initial performance**: Agent produces syntactically broken fixes β†’ reward β‰ˆ 0.01
115
+ - **After 20 steps**: Agent learns to fix syntax β†’ reward β‰ˆ 0.35
116
+ - **After 40 steps**: Agent learns to pass tests β†’ reward β‰ˆ 0.65
117
+ - **Steady improvement**: Rolling average trends upward, with hard tasks remaining the frontier challenge
118
 
119
  ---
120
 
121
+ ## πŸ”§ Reward & Training Pipeline (10%)
122
 
123
+ ### Training Script (Colab)
124
 
125
+ πŸ‘‰ **[Open Training Notebook in Google Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)**
126
 
127
+ The notebook demonstrates environment-in-the-loop RL:
128
 
129
+ ```python
130
+ def codearena_reward_func(completions, prompts):
131
+ """Reward function that queries the live CodeArena environment."""
132
+ rewards = []
133
+ for completion in completions:
134
+ proposed_fix = completion[0].get('content', '').strip()
135
+ res = httpx.post("http://localhost:7860/step",
136
+ json={"proposed_fix": proposed_fix})
137
+ reward = res.json().get('reward', 0.0)
138
+ rewards.append(reward)
139
+ return rewards
140
 
141
+ trainer = GRPOTrainer(
142
+ model=model, # Qwen2.5-Coder-1.5B
143
+ reward_funcs=codearena_reward_func,
144
+ args=GRPOConfig(output_dir="./codearena-grpo",
145
+ learning_rate=1e-5, max_steps=50),
146
+ train_dataset=dataset, # m-a-p/Code-Feedback
147
+ )
148
+ trainer.train()
149
+ ```
150
 
151
+ The reward is **not static** β€” it comes from actually executing the agent's code in a sandboxed environment, running real unit tests, and scoring with the hybrid grader. This is true environment-in-the-loop RL.
152
 
153
+ ### Inference Evaluation
154
 
155
+ ```bash
156
+ # Evaluate any model against CodeArena
157
+ export MODEL_NAME="codellama:7b-instruct"
158
+ python inference.py --backend openai
159
+ ```
160
 
161
+ Results are logged to `rewards_log.csv` and can be visualized with `python plot_rewards.py`.
162
 
163
+ ---
164
+
165
+ ## πŸ—οΈ Architecture (OpenEnv Compatible)
166
+
167
+ ```
168
+ codearena-rl/
169
+ β”œβ”€β”€ openenv.yaml # OpenEnv manifest (observation/action spaces)
170
+ β”œβ”€β”€ server/
171
+ β”‚ β”œβ”€β”€ app.py # FastAPI entrypoint (/reset, /step, /state)
172
+ β”‚ β”œβ”€β”€ models.py # Pydantic schemas (Observation, Action, Task)
173
+ β”‚ β”œβ”€β”€ executor.py # Sandboxed subprocess execution
174
+ β”‚ β”œβ”€β”€ grader.py # Hybrid reward (tests + LLM judge)
175
+ β”‚ β”œβ”€β”€ ai_fixer.py # Multi-fallback AI repair (TGIβ†’Ollamaβ†’AST)
176
+ β”‚ β”œβ”€β”€ algorithm_detector.py # Problem classification + complexity detection
177
+ β”‚ β”œβ”€β”€ memory.py # Persistent agent memory (best solutions)
178
+ β”‚ └── raw_runner.py # Sandbox mode executor
179
+ β”œβ”€β”€ tasks/
180
+ β”‚ β”œβ”€β”€ easy.py, medium.py, hard.py
181
+ β”‚ β”œβ”€β”€ type_errors/ # 3 type error tasks
182
+ β”‚ └── security_bugs/ # 3 security bug tasks
183
+ β”œβ”€β”€ frontend/ # React + Vite dashboard
184
+ β”œβ”€β”€ train_grpo.ipynb # TRL GRPO training notebook
185
+ β”œβ”€β”€ inference.py # CLI evaluation runner
186
+ β”œβ”€β”€ plot_rewards.py # Reward visualization
187
+ └── Dockerfile # HF Spaces deployment
188
+ ```
189
+
190
+ ### Quick Start
191
+
192
+ ```bash
193
+ pip install -r requirements.txt
194
+ python create_tasks.py # Generate task database
195
+ uvicorn server.app:app --port 7860 # Start environment
196
+ ```
197
+
198
+ ### OpenEnv API
199
+
200
+ | Endpoint | Method | Description |
201
+ |---|---|---|
202
+ | `/reset` | POST | Initialize environment with `{"task_id": "easy\|medium\|hard\|auto"}` |
203
+ | `/step` | POST | Submit fix with `{"proposed_fix": "..."}` β†’ reward + observation |
204
+ | `/state` | GET | Current observation |
205
+ | `/health` | GET | Server health check |
206
+ | `/fix` | POST | AI code repair endpoint |
207
+ | `/curriculum` | GET | Adaptive difficulty state |
208
+ | `/stats` | GET | Complexity vs reward analytics |
209
+ | `/memory` | GET | Agent memory contents |
210
 
211
+ ---
 
 
 
212
 
213
+ ## πŸ’» Live Demo
 
 
 
 
214
 
215
+ πŸ‘‰ **[https://huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl)**
216
+
217
+ Features:
218
+ - **Real-time dashboard** with reward charts, terminal logs, and code editor
219
+ - **AI Fix button** powered by HuggingFace Serverless Inference (`Qwen2.5-Coder-3B-Instruct`)
220
+ - **Agent Mode** toggle for autonomous fix β†’ test β†’ fix loops
221
+ - **Sandbox Mode** for arbitrary Python code evaluation
222
 
223
  ---
224
 
225
+ ## πŸ”— All Links
226
 
227
  | Resource | URL |
228
  |---|---|
229
+ | **πŸ€— HuggingFace Space (Live)** | [huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl) |
230
+ | **πŸ““ Training Notebook (Colab)** | [Open in Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb) |
231
+ | **πŸ“ Blog / Writeup** | [BLOG.md](./BLOG.md) |
232
+ | **πŸ’» GitHub Repository** | [github.com/havinashpatil/meta](https://github.com/havinashpatil/meta) |
233
+ | **πŸ“‹ OpenEnv Manifest** | [openenv.yaml](./openenv.yaml) |
234
 
235
  ---
236
+
237
+ *Built for the OpenEnv Hackathon India 2026 β€” Theme #4: Self-Improvement*