Spaces:
Configuration error
Configuration error
File size: 9,872 Bytes
ec566e9 d743e4d ec566e9 d743e4d ec566e9 d743e4d ec566e9 d743e4d ec566e9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 | # π CodeReview OpenEnv
An **OpenEnv-compliant AI training environment** that simulates professional Python code review. Agents learn to identify bugs, security vulnerabilities, performance bottlenecks, style issues, and documentation gaps β exactly as a senior engineer would in a real pull-request workflow.
---
## Why Code Review?
Code review is one of the highest-leverage tasks in software engineering. It is:
- **Real-world**: Every professional software team does it daily
- **Structured enough to grade**: Issues have objectively correct or incorrect assessments
- **Rich in partial signal**: An agent that spots 3/5 critical issues is measurably better than one that spots 1/5
- **Scalable in difficulty**: Easy (bugs only) β Hard (all categories + written summary)
This makes it an ideal domain for training and evaluating LLM-based agents on multi-step reasoning and quality estimation tasks.
---
## Environment Description
```
CodeReviewEnv
βββ Task 1 β Easy : Bug detection + Code style (calculator.py, 31 lines)
βββ Task 2 β Medium : Security + Performance audit (user_service.py, 55 lines)
βββ Task 3 β Hard : Full review, all 5 categories (data_pipeline.py, 49 lines)
```
Each task presents a Python snippet containing intentional flaws. The agent submits `ReviewComment` objects across one or more steps, then finalises with `submit=True`. A deterministic grader scores the review against ground-truth issues.
---
## Observation Space
What the agent sees on each step:
| Field | Type | Description |
|---|---|---|
| `task_id` | `str` | Active task identifier |
| `step` | `int` | Current step (0-indexed) |
| `snippet.file_name` | `str` | Logical file name (e.g. `auth.py`) |
| `snippet.source` | `str` | Full Python source code |
| `instructions` | `str` | Review scope, difficulty, and guidance |
| `previous_comments` | `list[ReviewComment]` | All comments submitted so far |
| `feedback` | `str \| None` | Env feedback on the last action |
| `done` | `bool` | Whether the episode has ended |
---
## Action Space
What the agent submits on each step:
```json
{
"comments": [
{
"line": 10,
"category": "security",
"severity": "critical",
"message": "SQL injection via string interpolation in query.",
"suggestion": "Use parameterised queries: cursor.execute('...', (username,))"
}
],
"summary": "Overall review summary (required for task_3_hard)",
"submit": true
}
```
| Field | Type | Values |
|---|---|---|
| `comments[].line` | `int \| null` | 1-indexed line number; `null` for file-level |
| `comments[].category` | `enum` | `bug`, `security`, `performance`, `style`, `documentation` |
| `comments[].severity` | `enum` | `low`, `medium`, `high`, `critical` |
| `comments[].message` | `str` | 5β500 chars |
| `comments[].suggestion` | `str \| null` | Optional fix suggestion |
| `summary` | `str \| null` | Required for `task_3_hard`, optional otherwise |
| `submit` | `bool` | `true` finalises the review and triggers the grader |
---
## Reward Function
Rewards are shaped to provide signal over the **full trajectory**, not just on terminal submit.
### Per-step (incremental) rewards
| Event | Reward |
|---|---|
| New valid comment added | `+0.05` per comment (max `+0.15`) |
| Progress signal (grader score delta) | `+0.5 Γ Ξscore` |
| Empty step (no new comments) | `β0.05` |
| Spam (> 2.5Γ expected comments) | `β0.10` |
### On `submit=True` (terminal)
```
submit_reward = score Γ 0.8 + (0.2 if score β₯ threshold else β0.2)
```
### Per-category penalties (applied to terminal grader score)
| Event | Penalty |
|---|---|
| False positive (fabricated issue) | `β0.08β0.12` per comment |
| Missed CRITICAL security issue | `β0.15β0.20` |
| Missed HIGH issue | `β0.08β0.10` |
| No summary on task 3 | `β0.10` |
All rewards are clipped to `[β1.0, 1.0]`.
---
## Task Descriptions
### Task 1 β Easy: Bug Detection & Style Review
**File**: `calculator.py` (31 lines) | **Max steps**: 5 | **Pass threshold**: 0.55
Covers basic utility functions: `divide`, `average`, `celsius_to_fahrenheit`, `find_max`, `count_words`.
**Ground-truth issues (6)**:
- `divide()` β no zero-division guard (HIGH bug)
- `average()` β crashes on empty list (HIGH bug)
- `celsius_to_fahrenheit` β off-by-one (+31 vs +32) (MEDIUM bug)
- `find_max()` β crashes on empty list (MEDIUM bug)
- `for i in range(len(lst))` β unpythonic iteration (LOW style)
- Manual `Counter` reimplementation (LOW style)
---
### Task 2 β Medium: Security & Performance Audit
**File**: `user_service.py` (55 lines) | **Max steps**: 7 | **Pass threshold**: 0.60
A SQLite-backed user management service with authentication.
**Ground-truth issues (6)**:
- SQL injection in `get_user()` β f-string query (CRITICAL security)
- MD5 password hashing in `create_user()` (CRITICAL security)
- SQL injection in `delete_user()` (CRITICAL security)
- MD5 reuse in `authenticate()` (HIGH security)
- `fetchall()` on unbounded table (HIGH performance)
- New DB connection per query, no pooling (MEDIUM performance)
---
### Task 3 β Hard: Comprehensive Code Review
**File**: `data_pipeline.py` (49 lines) | **Max steps**: 10 | **Pass threshold**: 0.65
An analytics data pipeline with CSV loading, row transformation, caching, and stats.
**Ground-truth issues (13 across all 5 categories)**:
- `subprocess.run(shell=True)` with user input β OS command injection (CRITICAL security)
- `pickle.loads()` on arbitrary cache data β RCE risk (CRITICAL security)
- Pickling into module-level dict (HIGH security)
- `compute_stats()` ZeroDivisionError on empty data (HIGH bug)
- Missing `"value"` key β silent KeyError (MEDIUM bug)
- `open()` without encoding (MEDIUM bug)
- Two-pass iteration in `compute_stats` (MEDIUM performance)
- Subprocess per row instead of batching (MEDIUM performance)
- `str(stats)` instead of JSON export (LOW style)
- Module-level mutable global cache (LOW style)
- `load_data()` missing docstring (LOW documentation)
- `process_row()` missing docstring (LOW documentation)
- Insufficient module-level docstring (LOW documentation)
A **written summary** is required (`summary` field) β absence incurs a `β0.10` score penalty.
---
## Expected Baseline Scores (gpt-4o)
| Task | Score | Pass? | Notes |
|---|---|---|---|
| `task_1_easy` | ~0.75 | β
| GPT-4o reliably spots ZeroDivisionError and off-by-one |
| `task_2_medium` | ~0.65 | β
| SQL injection found; MD5 usually flagged; perf issues partial |
| `task_3_hard` | ~0.55 | β
| Pickle RCE and shell injection found; docs often missed |
---
## Setup & Usage
### Option A β Docker (recommended)
```bash
# Build
docker build -t code-review-env .
# Run (port 7860)
docker run -p 7860:7860 code-review-env
# Test it
curl http://localhost:7860/health
```
### Option B β Local Python
```bash
# Install dependencies
pip install -r requirements.txt
# Start the server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
# Open docs
open http://localhost:7860/docs
```
### Run the test suite
```bash
pytest tests/ -v
# Expected: 25 passed
```
### Run the baseline agent
```bash
export OPENAI_API_KEY=sk-...
# All tasks (direct mode β no server needed)
python baseline_agent.py
# Single task
python baseline_agent.py --task task_2_medium
# Against a running HTTP server
python baseline_agent.py --mode http --base-url http://localhost:7860
```
---
## API Reference
| Endpoint | Method | Description |
|---|---|---|
| `/` | GET | HTML landing page |
| `/health` | GET | Health check |
| `/tasks` | GET | List all task specs |
| `/reset` | POST | Start or restart an episode |
| `/step` | POST | Submit an action |
| `/state` | GET | Get full serialisable state |
| `/docs` | GET | Interactive Swagger UI |
### Example: Full episode via curl
```bash
# 1. Reset
curl -X POST http://localhost:7860/reset \
-H 'Content-Type: application/json' \
-d '{"task_id": "task_1_easy", "session_id": "demo"}'
# 2. Step
curl -X POST http://localhost:7860/step \
-H 'Content-Type: application/json' \
-d '{
"session_id": "demo",
"action": {
"comments": [
{
"line": 2,
"category": "bug",
"severity": "high",
"message": "divide() will raise ZeroDivisionError when b is 0.",
"suggestion": "Guard with: if b == 0: raise ValueError"
}
],
"submit": true
}
}'
# 3. Check state
curl "http://localhost:7860/state?session_id=demo"
```
---
## Project Structure
```
openenv-code-review/
βββ app.py # FastAPI HTTP server
βββ openenv.yaml # OpenEnv spec metadata
βββ Dockerfile # Container definition
βββ requirements.txt
βββ baseline_agent.py # gpt-4o baseline inference script
β
βββ env/
β βββ models.py # Pydantic typed models (Observation, Action, Reward, β¦)
β βββ environment.py # CodeReviewEnv β step() / reset() / state()
β
βββ corpus/
β βββ snippets.py # Python snippets with ground-truth issues
β
βββ graders/
β βββ graders.py # Task1Grader, Task2Grader, Task3Grader
β
βββ tests/
βββ test_env.py # 25-test pytest suite (all passing)
```
---
## Deploying to Hugging Face Spaces
1. Create a new Space with **Docker** SDK
2. Push this repository to the Space
3. Set `OPENAI_API_KEY` as a Space secret (only needed for baseline script)
4. The Space will auto-build and expose port 7860
```yaml
# README.md frontmatter for HF Spaces
---
title: CodeReview OpenEnv
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
tags:
- openenv
- code-review
- ai-agent
- evaluation
---
```
---
## License
MIT
|