File size: 9,379 Bytes
1256abd
14dc79c
 
 
 
 
 
 
 
bb30ed3
14dc79c
 
 
 
 
 
77e1c62
319df19
77e1c62
319df19
0bbb422
 
77e1c62
319df19
 
 
 
 
77e1c62
319df19
0bbb422
 
 
 
 
 
 
319df19
 
 
0bbb422
77e1c62
 
319df19
0bbb422
319df19
0bbb422
319df19
0bbb422
319df19
0bbb422
 
 
 
 
319df19
0bbb422
319df19
0bbb422
1256abd
0bbb422
1256abd
0bbb422
 
 
 
 
 
319df19
0bbb422
319df19
0bbb422
319df19
0bbb422
1256abd
0bbb422
 
 
 
 
 
1256abd
0bbb422
1256abd
0bbb422
 
 
 
 
 
 
 
 
 
 
 
 
1256abd
0bbb422
 
 
 
 
1256abd
0bbb422
1256abd
0bbb422
1256abd
0bbb422
1256abd
0bbb422
 
 
 
1256abd
0bbb422
1256abd
0bbb422
 
 
 
 
 
 
 
 
1256abd
0bbb422
1256abd
0bbb422
 
 
 
 
319df19
77e1c62
319df19
0bbb422
 
 
 
 
 
319df19
77e1c62
319df19
0bbb422
 
 
 
 
 
 
319df19
77e1c62
319df19
0bbb422
1256abd
77e1c62
1256abd
77e1c62
1256abd
0bbb422
 
 
 
319df19
77e1c62
319df19
77e1c62
 
0bbb422
 
 
 
 
 
 
 
 
 
 
 
 
77e1c62
0bbb422
 
 
 
77e1c62
0bbb422
 
 
 
 
 
319df19
 
77e1c62
319df19
1256abd
77e1c62
0bbb422
77e1c62
 
319df19
 
0bbb422
77e1c62
0bbb422
319df19
 
 
1256abd
c3a9860
0bbb422
c3a9860
0bbb422
 
 
c3a9860
0bbb422
c3a9860
d1cfa81
77e1c62
 
 
319df19
1256abd
 
77e1c62
 
c3a9860
77e1c62
 
c3a9860
 
0bbb422
319df19
0bbb422
 
 
 
 
 
 
 
 
319df19
0bbb422
1256abd
0bbb422
 
 
 
319df19
77e1c62
d1cfa81
0bbb422
 
 
 
319df19
77e1c62
319df19
77e1c62
319df19
0bbb422
77e1c62
319df19
77e1c62
319df19
77e1c62
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
title: Code Review Environment
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
short_description: AI agent code review environment benchmark
tags:
  - openenv
  - reinforcement-learning
  - code-review
---

# Code Review OpenEnv Benchmark

## πŸš€ Scaler March 2026 Hackathon Submission

**Author:** Dolphin-Syndrom
**Type:** OpenEnv Benchmark Environment
**Focus:** Evaluating LLM agents on security-aware code review tasks

---

## ⚑ TL;DR

A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews.

- **5 tasks** with progressive difficulty (extra_easy β†’ easy β†’ medium β†’ hard β†’ expert)
- **12-tag issue taxonomy** covering security, logic, and robustness flaws
- **Multi-dimensional grading**: recall + quality bonus + severity bonus βˆ’ precision penalty
- **Iterative refinement**: feedback-driven multi-step improvement within episodes
- **32 unit tests** covering graders, environment lifecycle, and task coverage
- Deterministic scoring (0.0–1.0), deployable via Docker on Hugging Face Spaces
- Fully OpenEnv compliant

---

> Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops with iterative refinement.
>
> Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives.

## What Makes This Environment Unique

### 1. Iterative Refinement Mechanic

Unlike single-shot evaluation environments, this benchmark provides **structured feedback after each step** that tells agents what categories of issues they missed (without revealing exact tags). This creates a genuine multi-step learning loop:

```
Step 1: Agent submits initial review β†’ receives "Hint: look for security vulnerability"
Step 2: Agent refines review based on hint β†’ finds missed sql_injection β†’ score improves
Step 3: Final attempt with all accumulated feedback
```

This models how real code review works β€” reviewers iterate based on discussion and feedback.

### 2. Multi-Dimensional Reward Function

The grading system evaluates four orthogonal dimensions simultaneously:

| Component | Value | Signal |
|---|---|---|
| **Recall reward** | `|correct| / |planted|` | Comprehensive detection |
| **Quality bonus** | +0.05 per issue | Keyword-rich explanations |
| **Severity bonus** | +0.05 | Correct risk assessment |
| **Precision penalty** | βˆ’0.10 per FP | Anti-hallucination |

This forces agents to balance thoroughness against precision β€” a core tension in real code review.

### 3. Full 12-Tag Taxonomy Coverage

Every tag in the taxonomy is exercised across the 5 tasks:

| Category | Tags | Task Coverage |
|---|---|---|
| Logic errors | `null_pointer`, `missing_return`, `index_out_of_bounds` | extra_easy, easy |
| Security | `sql_injection`, `hardcoded_secret`, `path_traversal` | medium, expert |
| Robustness | `race_condition`, `timing_attack`, `improper_error_handling` | hard |
| Input handling | `type_error`, `integer_overflow`, `missing_input_validation` | expert |

## Architecture

```mermaid
graph TB
    Agent[AI Agent / inference.py] -->|POST /reset| Server[FastAPI Server]
    Agent -->|POST /step| Server
    Server --> Env[CodeReviewEnvironment]
    Env --> Tasks[Task Registry - 5 tasks]
    Env --> Grader[Deterministic Grader]
    Grader -->|recall + quality + severity βˆ’ penalty| Score[Score 0.0-1.0]
    Score -->|observation + reward + feedback| Agent
    Server -->|GET /health| Health[Health Check]
    Server -->|POST /grader| Grader
    Server -->|POST /baseline| Baseline[Rule-Based Baseline]
    Server -->|Gradio UI| Dashboard[Analytics Dashboard]

    style Agent fill:#58a6ff,stroke:#333
    style Server fill:#3fb950,stroke:#333
    style Grader fill:#f0883e,stroke:#333
    style Dashboard fill:#bc8cff,stroke:#333
```

## Environment Specification

### Objective

For each episode, the agent sees a Python code snippet containing planted issues and must:

1. Identify issues using tags from a 12-item `ISSUE_TAXONOMY`
2. Assess overall severity (`low`, `medium`, `high`, `critical`)
3. Articulate findings in a human-readable `review_comment`
4. Iteratively refine based on environment feedback across up to 3 steps

### Observation Space

| Field | Type | Description |
|---|---|---|
| `task_id` | string | Current task identifier |
| `file_name` | string | File under review |
| `task_description` | string | Review instructions |
| `code_snippet` | string | Python code with planted issues |
| `feedback` | string | Previous step feedback with refinement hints |
| `step_number` | integer | Current step (0 after reset) |
| `available_issue_tags` | array | Allowed taxonomy tags |

### Action Space

| Field | Type | Description |
|---|---|---|
| `issues_found` | list[str] | Tags from ISSUE_TAXONOMY |
| `severity` | enum | `low` / `medium` / `high` / `critical` |
| `review_comment` | string | Explanation of identified issues |

### Episode Flow

1. `reset(task_id)` loads a task and returns the initial observation
2. Agent receives code snippet and available tags
3. Agent submits review via `step(action)`
4. Environment returns observation with score, feedback, and refinement hints
5. Agent can use feedback to improve on subsequent steps
6. Episode ends when score β‰₯ 0.95 or step limit (3) reached

## Tasks

| Task | Difficulty | Planted Issues | File |
|---|---|---|---|
| `task_extra_easy` | Extra Easy | `index_out_of_bounds` | data_utils.py |
| `task_easy` | Easy | `null_pointer`, `missing_return` | user_service.py |
| `task_medium` | Medium | `sql_injection`, `hardcoded_secret` | auth.py |
| `task_hard` | Hard | `race_condition`, `improper_error_handling`, `timing_attack` | payments.py |
| `task_expert` | Expert | `path_traversal`, `integer_overflow`, `missing_input_validation`, `type_error` | file_processor.py |

## Reward Design

**Summary:** Correct behavior yields positive reward (~1.0), random strategies are penalized, ensuring meaningful learning signals.

The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end.

Core components:

- **Recall reward**: fractional points for correctly identified issues
- **Quality bonus**: +0.05 per correct issue with a matching keyword in the comment
- **Severity bonus**: +0.05 when severity matches expected level for task difficulty
- **Precision penalty**: βˆ’0.10 for hallucinated or false-positive issues

## Project Structure

```text
.
β”œβ”€β”€ __init__.py              # Package exports
β”œβ”€β”€ client.py                # WebSocket client for agent interaction
β”œβ”€β”€ models.py                # Typed Pydantic models (Action, Observation, State)
β”œβ”€β”€ inference.py             # Baseline inference script with LLM + rule fallback
β”œβ”€β”€ openenv.yaml             # OpenEnv specification
β”œβ”€β”€ pyproject.toml           # Project config with pytest setup
β”œβ”€β”€ requirements.txt         # Pip dependencies
β”œβ”€β”€ Dockerfile               # Production container with health check
β”œβ”€β”€ conftest.py              # Pytest root configuration
β”œβ”€β”€ README.md
β”œβ”€β”€ scripts/
β”‚   └── validate-submission.sh
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py               # FastAPI + Gradio dashboard
β”‚   β”œβ”€β”€ code_review_env_environment.py  # Environment with iterative refinement
β”‚   β”œβ”€β”€ graders.py            # Multi-dimensional deterministic grader
β”‚   β”œβ”€β”€ tasks.py              # 5 task definitions with planted issues
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── Dockerfile
└── tests/
    β”œβ”€β”€ conftest.py
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ test_graders.py       # 19 grader tests
    └── test_environment.py   # 13 environment lifecycle tests
```

## Setup

```bash
uv sync --frozen
# OR:
pip install -r requirements.txt
pip install -r server/requirements.txt
```

## Running

### Start the server

```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Run tests

```bash
uv run pytest tests/ -v
```

### Run baseline inference

```bash
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=your-token
python inference.py
```

## Docker

```bash
docker build -t code-review-openenv -f Dockerfile .
docker run -p 8000:8000 code-review-openenv
```

## πŸ”Œ API Endpoints

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/health` | Health check |
| `GET` | `/tasks` | List all tasks with schemas |
| `POST` | `/reset` | Reset environment for a task |
| `POST` | `/step` | Submit a review action |
| `GET` | `/state` | Get current episode state |
| `POST` | `/grader` | Score a review against a task |
| `POST` | `/baseline` | Run rule-based baseline |

## Validation

```bash
openenv validate .
./scripts/validate-submission.sh http://localhost:8000 .
```

## 🏁 Submission Status

-  All 5 OpenEnv validation checks passing
-  32/32 unit tests passing
-  Docker build and deployment verified
-  End-to-end inference and grading pipeline tested

---

## πŸ”— Links

- GitHub: https://github.com/Dolphin-Syndrom/code-review-env
- Hugging Face Space: https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env

## License

BSD-3-Clause