File size: 12,134 Bytes
c40cb2b
5e35378
034343c
c40cb2b
 
 
5e35378
c40cb2b
90be6c7
 
5e35378
90be6c7
03a7eb9
9143510
dcc8fa3
90be6c7
3f9399a
90be6c7
54a19c9
90be6c7
03a7eb9
90be6c7
03a7eb9
90be6c7
03a7eb9
90be6c7
54a19c9
90be6c7
 
 
3f9399a
90be6c7
3f9399a
90be6c7
 
 
3f9399a
5e35378
3f9399a
90be6c7
 
 
 
3f9399a
90be6c7
 
 
 
 
 
 
 
3f9399a
90be6c7
3f9399a
90be6c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54a19c9
5e35378
54a19c9
90be6c7
 
 
 
 
 
 
 
 
 
 
3f9399a
90be6c7
 
 
 
54a19c9
90be6c7
 
 
 
 
 
 
 
 
 
 
 
 
3f9399a
7f4c57d
 
3f9399a
7f4c57d
 
 
 
 
 
 
 
 
 
 
5e35378
b098526
 
 
 
 
 
90be6c7
 
 
 
 
7f4c57d
3f9399a
5e35378
 
90be6c7
5e35378
90be6c7
03a7eb9
90be6c7
3f9399a
90be6c7
3f9399a
90be6c7
 
 
 
 
 
 
 
 
 
 
03a7eb9
90be6c7
 
 
 
 
 
 
 
 
5e35378
90be6c7
03a7eb9
90be6c7
3f9399a
90be6c7
 
 
 
 
5e35378
90be6c7
54a19c9
90be6c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54a19c9
90be6c7
54a19c9
90be6c7
271cc02
90be6c7
 
 
 
 
 
 
5e35378
 
 
90be6c7
03a7eb9
 
 
90be6c7
 
9143510
90be6c7
 
e448fed
5e35378
 
90be6c7
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
title: CodeArena RL Benchmark
emoji: πŸš€
colorFrom: blue
colorTo: purple
sdk: docker
pinned: true
---

[![HuggingFace Space](https://img.shields.io/badge/πŸ€—%20Live%20Demo-CodeArena-brightgreen)](https://huggingface.co/spaces/ceoavinash/codearena-rl)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)
[![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2%2B%20Compatible-blue)](./openenv.yaml)
[![Theme](https://img.shields.io/badge/Theme%20%234-Self--Improvement-purple)]()
[![Blog](https://img.shields.io/badge/πŸ“%20Blog-Read%20Writeup-orange)](https://huggingface.co/spaces/ceoavinash/codearena-rl/discussions/1)

# πŸš€ CodeArena: Iterative Code Repair as an RL Environment

> **TL;DR** β€” An OpenEnv-compatible RL environment where an LLM agent debugs Python code across multiple attempts, graded by unit tests + LLM-as-Judge + algorithmic efficiency. Features adaptive difficulty, agent memory, and a full TRL GRPO training pipeline.

---

## 🎯 The Problem

Every coding AI is benchmarked on **generation** β€” write a function, complete a snippet. **Nobody benchmarks what happens when the code breaks.** In production, developers spend the majority of their time reading error logs, reasoning about failures, iterating on fixes, and recovering from wrong turns. There is no standardized RL environment for this iterative debugging loop.

**CodeArena fills that gap.** It is the first open-source RL environment built specifically for *iterative code repair*, where an agent must fix buggy Python code over multiple steps, learning from execution feedback after each attempt.

---

## 🧠 Theme Alignment: #4 β€” Self-Improvement

CodeArena directly targets **Theme #4: Self-Improvement** through three mechanisms:

1. **Adaptive Curriculum**: Difficulty escalates automatically when the agent's rolling avg reward exceeds 0.80, and de-escalates when it drops below 0.35. The agent drives its own training progression.
2. **Persistent Agent Memory**: Best solutions per task are stored in `agent_memory.json` and retrieved in future episodes, creating cross-episode learning.
3. **Adaptive Prompting**: The AI fixer adjusts its strategy based on current reward level β€” syntax focus at low rewards, algorithm optimization at high rewards.

---

## ✨ Environment Innovation (40%)

### Hybrid Grader β€” Tests + LLM-as-Judge
Most benchmarks ask: *did the tests pass?* CodeArena also asks: *is the fix correct, secure, efficient, and readable?*

| Component | Weight | Signal |
|---|---|---|
| `compile_score` | 15% | Code compiles without error |
| `test_pass_ratio` | 35% | Fraction of unit tests passed |
| `efficiency_score` | 30% | Execution time vs optimal (O(n) rewarded, O(nΒ²) penalized) |
| `llm_correctness` | 10% | LLM judge: logical correctness |
| `llm_security` | 5% | LLM judge: no vulnerabilities introduced |
| `llm_quality` | 5% | LLM judge: readability and maintainability |

**Penalties:** `-0.01/step` (rewards faster fixes) and `-0.10` for repeating an identical fix (prevents reward-hacking via repetition).

The 30% efficiency weight means an agent that passes all tests with O(nΒ²) brute-force gets a significantly lower reward than one using O(n). This forces the model to learn *algorithmic reasoning*, not just syntax repair.

### Algorithm Detector
A built-in classifier (`server/algorithm_detector.py`) identifies the problem type (Kadane's, Two-Sum, Sliding Window, etc.) and estimates time complexity from loop nesting. This drives targeted optimization hints during repair.

### Sandboxed Execution
All code runs in isolated subprocesses with AST pre-validation, timeout enforcement, and temporary file cleanup. Malicious or infinite-loop code cannot crash the server.

### 9 Tasks Across 5 Categories

| Category | Example | Tests |
|---|---|---|
| Easy (syntax) | Missing colons, indentation | Basic repair |
| Medium (logic) | Off-by-one, wrong conditions | Reasoning |
| Hard (algorithms) | O(nΒ²) β†’ O(n) refactoring | Optimization |
| Type Errors | Wrong types, missing casts | Type safety |
| Security Bugs | SQL injection, path traversal | Security awareness |

---

## πŸ“Š Storytelling (30%) β€” How It Works

**Data Flow:** `Agent` β†’ `POST /reset` β†’ receives `buggy_code + error_log` β†’ `POST /step` with `proposed_fix` β†’ sandboxed execution β†’ hybrid grading β†’ `reward + updated error_log` β†’ repeat up to 5 steps.

```
Episode Walkthrough:
────────────────────────
Step 1: Agent receives def solve(n) print(n)
        β†’ Proposes:     def solve(n): print(n)
        β†’ Result:       βœ“ Compiles, 1/3 tests pass
        β†’ Reward:       0.35

Step 2: Agent reads error: "AssertionError: solve(5) != 25"
        β†’ Proposes:     def solve(n): return n**2
        β†’ Result:       βœ“ 3/3 tests pass, but O(n) expected
        β†’ Reward:       0.72

Step 3: Agent reads hint: "Optimize to O(1)"
        β†’ Proposes:     def solve(n): return n*n
        β†’ Result:       βœ“ 3/3 pass, O(1) optimal
        β†’ Reward:       0.95 βœ…
```

The agent must learn to **read error messages**, **avoid repeating failed fixes**, and **optimize for efficiency** β€” not just correctness. This mirrors real-world software engineering.

---

## πŸ“ˆ Showing Improvement in Rewards (20%)

We trained `Qwen/Qwen2.5-Coder-1.5B` using **TRL GRPO** (Group Relative Policy Optimization) with CodeArena as the live reward environment.

![Fig 1: Reward Curve](results/reward_curve.png)
*Fig 1: Episode reward over training steps. The rolling 10-step average shows clear learning progression from near-zero to consistent 0.65+ rewards.*

![Fig 2: Reward by Task](results/reward_by_task.png)
*Fig 2: Average reward by task category. Easy/type-error tasks are mastered first; algorithmic optimization remains challenging β€” exactly the curriculum behavior we designed for.*

![Fig 3: Task Performance Matrix](results/task_performance_matrix.png)
*Fig 3: Task Difficulty Performance Matrix showing the mean, max, and standard deviation of rewards across difficulty levels.*

![Fig 4: Complexity Distribution](results/complexity_distribution.png)
*Fig 4: Complexity Distribution highlighting the frequency of O(1) vs O(n) solutions generated by the agent.*

![Fig 5: Fixer Method Boxplot](results/method_boxplot.png)
*Fig 5: Reward Distribution by Fixer Method, comparing the performance of the Ollama LLM to the built-in pattern-based fixer.*

![Fig 6: Cumulative Reward](results/cumulative_reward.png)
*Fig 6: Cumulative Reward over time, highlighting the total accumulated reward across multiple episodes.*

![Fig 7: Method Performance Comparison](results/method_performance.png)
*Fig 7: LLM Fixer Method Performance Comparison scatter plot showing the individual performance data points of Ollama vs Builtin methods.*

### Key Observations:
- **Initial performance**: Agent produces syntactically broken fixes β†’ reward β‰ˆ 0.01
- **After 20 steps**: Agent learns to fix syntax β†’ reward β‰ˆ 0.35
- **After 40 steps**: Agent learns to pass tests β†’ reward β‰ˆ 0.65
- **Steady improvement**: Rolling average trends upward, with hard tasks remaining the frontier challenge
- **Method Effectiveness (Fig 5)**: The LLM-based fixer significantly outperforms the static pattern-based approach.

---

## πŸ”§ Reward & Training Pipeline (10%)

### Training Script (Colab)

πŸ‘‰ **[Open Training Notebook in Google Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb)**

The notebook demonstrates environment-in-the-loop RL:

```python
def codearena_reward_func(completions, prompts):
    """Reward function that queries the live CodeArena environment."""
    rewards = []
    for completion in completions:
        proposed_fix = completion[0].get('content', '').strip()
        res = httpx.post("http://localhost:7860/step",
                         json={"proposed_fix": proposed_fix})
        reward = res.json().get('reward', 0.0)
        rewards.append(reward)
    return rewards

trainer = GRPOTrainer(
    model=model,  # Qwen2.5-Coder-1.5B
    reward_funcs=codearena_reward_func,
    args=GRPOConfig(output_dir="./codearena-grpo",
                    learning_rate=1e-5, max_steps=50),
    train_dataset=dataset,  # m-a-p/Code-Feedback
)
trainer.train()
```

The reward is **not static** β€” it comes from actually executing the agent's code in a sandboxed environment, running real unit tests, and scoring with the hybrid grader. This is true environment-in-the-loop RL.

### Inference Evaluation

```bash
# Evaluate any model against CodeArena
export MODEL_NAME="codellama:7b-instruct"
python inference.py --backend openai
```

Results are logged to `rewards_log.csv` and can be visualized with `python plot_rewards.py`.

---

## πŸ—οΈ Architecture (OpenEnv Compatible)

```
codearena-rl/
β”œβ”€β”€ openenv.yaml              # OpenEnv manifest (observation/action spaces)
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                # FastAPI entrypoint (/reset, /step, /state)
β”‚   β”œβ”€β”€ models.py             # Pydantic schemas (Observation, Action, Task)
β”‚   β”œβ”€β”€ executor.py           # Sandboxed subprocess execution
β”‚   β”œβ”€β”€ grader.py             # Hybrid reward (tests + LLM judge)
β”‚   β”œβ”€β”€ ai_fixer.py           # Multi-fallback AI repair (TGIβ†’Ollamaβ†’AST)
β”‚   β”œβ”€β”€ algorithm_detector.py # Problem classification + complexity detection
β”‚   β”œβ”€β”€ memory.py             # Persistent agent memory (best solutions)
β”‚   └── raw_runner.py         # Sandbox mode executor
β”œβ”€β”€ tasks/
β”‚   β”œβ”€β”€ easy.py, medium.py, hard.py
β”‚   β”œβ”€β”€ type_errors/          # 3 type error tasks
β”‚   └── security_bugs/        # 3 security bug tasks
β”œβ”€β”€ frontend/                 # React + Vite dashboard
β”œβ”€β”€ train_grpo.ipynb          # TRL GRPO training notebook
β”œβ”€β”€ inference.py              # CLI evaluation runner
β”œβ”€β”€ plot_rewards.py           # Reward visualization
└── Dockerfile                # HF Spaces deployment
```

### Quick Start

```bash
pip install -r requirements.txt
python create_tasks.py           # Generate task database
uvicorn server.app:app --port 7860  # Start environment
```

### OpenEnv API

| Endpoint | Method | Description |
|---|---|---|
| `/reset` | POST | Initialize environment with `{"task_id": "easy\|medium\|hard\|auto"}` |
| `/step` | POST | Submit fix with `{"proposed_fix": "..."}` β†’ reward + observation |
| `/state` | GET | Current observation |
| `/health` | GET | Server health check |
| `/fix` | POST | AI code repair endpoint |
| `/curriculum` | GET | Adaptive difficulty state |
| `/stats` | GET | Complexity vs reward analytics |
| `/memory` | GET | Agent memory contents |

---

## πŸ’» Live Demo

πŸ‘‰ **[https://huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl)**

Features:
- **Real-time dashboard** with reward charts, terminal logs, and code editor
- **AI Fix button** powered by HuggingFace Serverless Inference (`Qwen2.5-Coder-3B-Instruct`)
- **Agent Mode** toggle for autonomous fix β†’ test β†’ fix loops
- **Sandbox Mode** for arbitrary Python code evaluation

---

## πŸ”— All Links

| Resource | URL |
|---|---|
| **πŸ€— HuggingFace Space (Live)** | [huggingface.co/spaces/ceoavinash/codearena-rl](https://huggingface.co/spaces/ceoavinash/codearena-rl) |
| **πŸ““ Training Notebook (Colab)** | [Open in Colab](https://colab.research.google.com/github/havinashpatil/meta/blob/main/train_grpo.ipynb) |
| **πŸ“ Blog / Writeup** | [Read on HuggingFace](https://huggingface.co/spaces/ceoavinash/codearena-rl/discussions/1) |
| **πŸ’» GitHub Repository** | [github.com/havinashpatil/meta](https://github.com/havinashpatil/meta) |
| **πŸ“‹ OpenEnv Manifest** | [openenv.yaml](./openenv.yaml) |
| **πŸ“Ί YouTube Video** | [https://youtu.be/LmsJvAMTdCY](https://youtu.be/LmsJvAMTdCY) |

---

*Built for the OpenEnv Hackathon India 2026 β€” Theme #4: Self-Improvement*