File size: 9,872 Bytes
ec566e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d743e4d
ec566e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d743e4d
 
ec566e9
 
 
 
 
d743e4d
ec566e9
 
 
 
 
d743e4d
ec566e9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
# πŸ” CodeReview OpenEnv

An **OpenEnv-compliant AI training environment** that simulates professional Python code review. Agents learn to identify bugs, security vulnerabilities, performance bottlenecks, style issues, and documentation gaps β€” exactly as a senior engineer would in a real pull-request workflow.

---

## Why Code Review?

Code review is one of the highest-leverage tasks in software engineering. It is:

- **Real-world**: Every professional software team does it daily
- **Structured enough to grade**: Issues have objectively correct or incorrect assessments
- **Rich in partial signal**: An agent that spots 3/5 critical issues is measurably better than one that spots 1/5
- **Scalable in difficulty**: Easy (bugs only) β†’ Hard (all categories + written summary)

This makes it an ideal domain for training and evaluating LLM-based agents on multi-step reasoning and quality estimation tasks.

---

## Environment Description

```
CodeReviewEnv
β”œβ”€β”€ Task 1 – Easy    : Bug detection + Code style        (calculator.py, 31 lines)
β”œβ”€β”€ Task 2 – Medium  : Security + Performance audit      (user_service.py, 55 lines)
└── Task 3 – Hard    : Full review, all 5 categories     (data_pipeline.py, 49 lines)
```

Each task presents a Python snippet containing intentional flaws. The agent submits `ReviewComment` objects across one or more steps, then finalises with `submit=True`. A deterministic grader scores the review against ground-truth issues.

---

## Observation Space

What the agent sees on each step:

| Field | Type | Description |
|---|---|---|
| `task_id` | `str` | Active task identifier |
| `step` | `int` | Current step (0-indexed) |
| `snippet.file_name` | `str` | Logical file name (e.g. `auth.py`) |
| `snippet.source` | `str` | Full Python source code |
| `instructions` | `str` | Review scope, difficulty, and guidance |
| `previous_comments` | `list[ReviewComment]` | All comments submitted so far |
| `feedback` | `str \| None` | Env feedback on the last action |
| `done` | `bool` | Whether the episode has ended |

---

## Action Space

What the agent submits on each step:

```json
{
  "comments": [
    {
      "line": 10,
      "category": "security",
      "severity": "critical",
      "message": "SQL injection via string interpolation in query.",
      "suggestion": "Use parameterised queries: cursor.execute('...', (username,))"
    }
  ],
  "summary": "Overall review summary (required for task_3_hard)",
  "submit": true
}
```

| Field | Type | Values |
|---|---|---|
| `comments[].line` | `int \| null` | 1-indexed line number; `null` for file-level |
| `comments[].category` | `enum` | `bug`, `security`, `performance`, `style`, `documentation` |
| `comments[].severity` | `enum` | `low`, `medium`, `high`, `critical` |
| `comments[].message` | `str` | 5–500 chars |
| `comments[].suggestion` | `str \| null` | Optional fix suggestion |
| `summary` | `str \| null` | Required for `task_3_hard`, optional otherwise |
| `submit` | `bool` | `true` finalises the review and triggers the grader |

---

## Reward Function

Rewards are shaped to provide signal over the **full trajectory**, not just on terminal submit.

### Per-step (incremental) rewards

| Event | Reward |
|---|---|
| New valid comment added | `+0.05` per comment (max `+0.15`) |
| Progress signal (grader score delta) | `+0.5 Γ— Ξ”score` |
| Empty step (no new comments) | `βˆ’0.05` |
| Spam (> 2.5Γ— expected comments) | `βˆ’0.10` |

### On `submit=True` (terminal)

```
submit_reward = score Γ— 0.8 + (0.2 if score β‰₯ threshold else βˆ’0.2)
```

### Per-category penalties (applied to terminal grader score)

| Event | Penalty |
|---|---|
| False positive (fabricated issue) | `βˆ’0.08–0.12` per comment |
| Missed CRITICAL security issue | `βˆ’0.15–0.20` |
| Missed HIGH issue | `βˆ’0.08–0.10` |
| No summary on task 3 | `βˆ’0.10` |

All rewards are clipped to `[βˆ’1.0, 1.0]`.

---

## Task Descriptions

### Task 1 – Easy: Bug Detection & Style Review
**File**: `calculator.py` (31 lines) | **Max steps**: 5 | **Pass threshold**: 0.55

Covers basic utility functions: `divide`, `average`, `celsius_to_fahrenheit`, `find_max`, `count_words`.

**Ground-truth issues (6)**:
- `divide()` β€” no zero-division guard (HIGH bug)
- `average()` β€” crashes on empty list (HIGH bug)
- `celsius_to_fahrenheit` β€” off-by-one (+31 vs +32) (MEDIUM bug)
- `find_max()` β€” crashes on empty list (MEDIUM bug)
- `for i in range(len(lst))` β€” unpythonic iteration (LOW style)
- Manual `Counter` reimplementation (LOW style)

---

### Task 2 – Medium: Security & Performance Audit
**File**: `user_service.py` (55 lines) | **Max steps**: 7 | **Pass threshold**: 0.60

A SQLite-backed user management service with authentication.

**Ground-truth issues (6)**:
- SQL injection in `get_user()` β€” f-string query (CRITICAL security)
- MD5 password hashing in `create_user()` (CRITICAL security)
- SQL injection in `delete_user()` (CRITICAL security)
- MD5 reuse in `authenticate()` (HIGH security)
- `fetchall()` on unbounded table (HIGH performance)
- New DB connection per query, no pooling (MEDIUM performance)

---

### Task 3 – Hard: Comprehensive Code Review
**File**: `data_pipeline.py` (49 lines) | **Max steps**: 10 | **Pass threshold**: 0.65

An analytics data pipeline with CSV loading, row transformation, caching, and stats.

**Ground-truth issues (13 across all 5 categories)**:
- `subprocess.run(shell=True)` with user input β€” OS command injection (CRITICAL security)
- `pickle.loads()` on arbitrary cache data β€” RCE risk (CRITICAL security)
- Pickling into module-level dict (HIGH security)
- `compute_stats()` ZeroDivisionError on empty data (HIGH bug)
- Missing `"value"` key β†’ silent KeyError (MEDIUM bug)
- `open()` without encoding (MEDIUM bug)
- Two-pass iteration in `compute_stats` (MEDIUM performance)
- Subprocess per row instead of batching (MEDIUM performance)
- `str(stats)` instead of JSON export (LOW style)
- Module-level mutable global cache (LOW style)
- `load_data()` missing docstring (LOW documentation)
- `process_row()` missing docstring (LOW documentation)
- Insufficient module-level docstring (LOW documentation)

A **written summary** is required (`summary` field) β€” absence incurs a `βˆ’0.10` score penalty.

---

## Expected Baseline Scores (gpt-4o)

| Task | Score | Pass? | Notes |
|---|---|---|---|
| `task_1_easy` | ~0.75 | βœ… | GPT-4o reliably spots ZeroDivisionError and off-by-one |
| `task_2_medium` | ~0.65 | βœ… | SQL injection found; MD5 usually flagged; perf issues partial |
| `task_3_hard` | ~0.55 | βœ… | Pickle RCE and shell injection found; docs often missed |

---

## Setup & Usage

### Option A β€” Docker (recommended)

```bash
# Build
docker build -t code-review-env .

# Run (port 7860)
docker run -p 7860:7860 code-review-env

# Test it
curl http://localhost:7860/health
```

### Option B β€” Local Python

```bash
# Install dependencies
pip install -r requirements.txt

# Start the server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# Open docs
open http://localhost:7860/docs
```

### Run the test suite

```bash
pytest tests/ -v
# Expected: 25 passed
```

### Run the baseline agent

```bash
export OPENAI_API_KEY=sk-...

# All tasks (direct mode β€” no server needed)
python baseline_agent.py

# Single task
python baseline_agent.py --task task_2_medium

# Against a running HTTP server
python baseline_agent.py --mode http --base-url http://localhost:7860
```

---

## API Reference

| Endpoint | Method | Description |
|---|---|---|
| `/` | GET | HTML landing page |
| `/health` | GET | Health check |
| `/tasks` | GET | List all task specs |
| `/reset` | POST | Start or restart an episode |
| `/step` | POST | Submit an action |
| `/state` | GET | Get full serialisable state |
| `/docs` | GET | Interactive Swagger UI |

### Example: Full episode via curl

```bash
# 1. Reset
curl -X POST http://localhost:7860/reset \
  -H 'Content-Type: application/json' \
  -d '{"task_id": "task_1_easy", "session_id": "demo"}'

# 2. Step
curl -X POST http://localhost:7860/step \
  -H 'Content-Type: application/json' \
  -d '{
    "session_id": "demo",
    "action": {
      "comments": [
        {
          "line": 2,
          "category": "bug",
          "severity": "high",
          "message": "divide() will raise ZeroDivisionError when b is 0.",
          "suggestion": "Guard with: if b == 0: raise ValueError"
        }
      ],
      "submit": true
    }
  }'

# 3. Check state
curl "http://localhost:7860/state?session_id=demo"
```

---

## Project Structure

```
openenv-code-review/
β”œβ”€β”€ app.py                  # FastAPI HTTP server
β”œβ”€β”€ openenv.yaml            # OpenEnv spec metadata
β”œβ”€β”€ Dockerfile              # Container definition
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ baseline_agent.py       # gpt-4o baseline inference script
β”‚
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ models.py           # Pydantic typed models (Observation, Action, Reward, …)
β”‚   └── environment.py      # CodeReviewEnv β€” step() / reset() / state()
β”‚
β”œβ”€β”€ corpus/
β”‚   └── snippets.py         # Python snippets with ground-truth issues
β”‚
β”œβ”€β”€ graders/
β”‚   └── graders.py          # Task1Grader, Task2Grader, Task3Grader
β”‚
└── tests/
    └── test_env.py         # 25-test pytest suite (all passing)
```

---

## Deploying to Hugging Face Spaces

1. Create a new Space with **Docker** SDK
2. Push this repository to the Space
3. Set `OPENAI_API_KEY` as a Space secret (only needed for baseline script)
4. The Space will auto-build and expose port 7860

```yaml
# README.md frontmatter for HF Spaces
---
title: CodeReview OpenEnv
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
tags:
  - openenv
  - code-review
  - ai-agent
  - evaluation
---
```

---

## License

MIT