Spaces:
Sleeping
title: CodeReview OpenEnv
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 7860
π CodeReview OpenEnv
An OpenEnv-compliant AI training environment that simulates professional Python code review. Agents learn to identify bugs, security vulnerabilities, performance bottlenecks, style issues, and documentation gaps β exactly as a senior engineer would in a real pull-request workflow.
Why Code Review?
Code review is one of the highest-leverage tasks in software engineering. It is:
- Real-world: Every professional software team does it daily
- Structured enough to grade: Issues have objectively correct or incorrect assessments
- Rich in partial signal: An agent that spots 3/5 critical issues is measurably better than one that spots 1/5
- Scalable in difficulty: Easy (bugs only) β Hard (all categories + written summary)
This makes it an ideal domain for training and evaluating LLM-based agents on multi-step reasoning and quality estimation tasks.
Environment Description
CodeReviewEnv
βββ Task 1 β Easy : Bug detection + Code style (calculator.py, 31 lines)
βββ Task 2 β Medium : Security + Performance audit (user_service.py, 55 lines)
βββ Task 3 β Hard : Full review, all 5 categories (data_pipeline.py, 49 lines)
Each task presents a Python snippet containing intentional flaws. The agent submits ReviewComment objects across one or more steps, then finalises with submit=True. A deterministic grader scores the review against ground-truth issues.
Observation Space
What the agent sees on each step:
| Field | Type | Description |
|---|---|---|
task_id |
str |
Active task identifier |
step |
int |
Current step (0-indexed) |
snippet.file_name |
str |
Logical file name (e.g. auth.py) |
snippet.source |
str |
Full Python source code |
instructions |
str |
Review scope, difficulty, and guidance |
previous_comments |
list[ReviewComment] |
All comments submitted so far |
feedback |
str | None |
Env feedback on the last action |
done |
bool |
Whether the episode has ended |
Action Space
What the agent submits on each step:
{
"comments": [
{
"line": 10,
"category": "security",
"severity": "critical",
"message": "SQL injection via string interpolation in query.",
"suggestion": "Use parameterised queries: cursor.execute('...', (username,))"
}
],
"summary": "Overall review summary (required for task_3_hard)",
"submit": true
}
| Field | Type | Values |
|---|---|---|
comments[].line |
int | null |
1-indexed line number; null for file-level |
comments[].category |
enum |
bug, security, performance, style, documentation |
comments[].severity |
enum |
low, medium, high, critical |
comments[].message |
str |
5β500 chars |
comments[].suggestion |
str | null |
Optional fix suggestion |
summary |
str | null |
Required for task_3_hard, optional otherwise |
submit |
bool |
true finalises the review and triggers the grader |
Reward Function
Rewards are shaped to provide signal over the full trajectory, not just on terminal submit.
Per-step (incremental) rewards
| Event | Reward |
|---|---|
| New valid comment added | +0.05 per comment (max +0.15) |
| Progress signal (grader score delta) | +0.5 Γ Ξscore |
| Empty step (no new comments) | β0.05 |
| Spam (> 2.5Γ expected comments) | β0.10 |
On submit=True (terminal)
submit_reward = score Γ 0.8 + (0.2 if score β₯ threshold else β0.2)
Per-category penalties (applied to terminal grader score)
| Event | Penalty |
|---|---|
| False positive (fabricated issue) | β0.08β0.12 per comment |
| Missed CRITICAL security issue | β0.15β0.20 |
| Missed HIGH issue | β0.08β0.10 |
| No summary on task 3 | β0.10 |
All rewards are clipped to [β1.0, 1.0].
Task Descriptions
Task 1 β Easy: Bug Detection & Style Review
File: calculator.py (31 lines) | Max steps: 5 | Pass threshold: 0.55
Covers basic utility functions: divide, average, celsius_to_fahrenheit, find_max, count_words.
Ground-truth issues (6):
divide()β no zero-division guard (HIGH bug)average()β crashes on empty list (HIGH bug)celsius_to_fahrenheitβ off-by-one (+31 vs +32) (MEDIUM bug)find_max()β crashes on empty list (MEDIUM bug)for i in range(len(lst))β unpythonic iteration (LOW style)- Manual
Counterreimplementation (LOW style)
Task 2 β Medium: Security & Performance Audit
File: user_service.py (55 lines) | Max steps: 7 | Pass threshold: 0.60
A SQLite-backed user management service with authentication.
Ground-truth issues (6):
- SQL injection in
get_user()β f-string query (CRITICAL security) - MD5 password hashing in
create_user()(CRITICAL security) - SQL injection in
delete_user()(CRITICAL security) - MD5 reuse in
authenticate()(HIGH security) fetchall()on unbounded table (HIGH performance)- New DB connection per query, no pooling (MEDIUM performance)
Task 3 β Hard: Comprehensive Code Review
File: data_pipeline.py (49 lines) | Max steps: 10 | Pass threshold: 0.65
An analytics data pipeline with CSV loading, row transformation, caching, and stats.
Ground-truth issues (13 across all 5 categories):
subprocess.run(shell=True)with user input β OS command injection (CRITICAL security)pickle.loads()on arbitrary cache data β RCE risk (CRITICAL security)- Pickling into module-level dict (HIGH security)
compute_stats()ZeroDivisionError on empty data (HIGH bug)- Missing
"value"key β silent KeyError (MEDIUM bug) open()without encoding (MEDIUM bug)- Two-pass iteration in
compute_stats(MEDIUM performance) - Subprocess per row instead of batching (MEDIUM performance)
str(stats)instead of JSON export (LOW style)- Module-level mutable global cache (LOW style)
load_data()missing docstring (LOW documentation)process_row()missing docstring (LOW documentation)- Insufficient module-level docstring (LOW documentation)
A written summary is required (summary field) β absence incurs a β0.10 score penalty.
Expected Baseline Scores (gpt-4o)
| Task | Score | Pass? | Notes |
|---|---|---|---|
task_1_easy |
~0.75 | β | GPT-4o reliably spots ZeroDivisionError and off-by-one |
task_2_medium |
~0.65 | β | SQL injection found; MD5 usually flagged; perf issues partial |
task_3_hard |
~0.55 | β | Pickle RCE and shell injection found; docs often missed |
Setup & Usage
Option A β Docker (recommended)
# Build
docker build -t code-review-env .
# Run (port 7860)
docker run -p 7860:7860 code-review-env
# Test it
curl http://localhost:7860/health
Option B β Local Python
# Install dependencies
pip install -r requirements.txt
# Start the server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
# Open docs
open http://localhost:7860/docs
Run the test suite
pytest tests/ -v
# Expected: 25 passed
Run the baseline agent
export OPENAI_API_KEY=sk-...
# All tasks (direct mode β no server needed)
python baseline_agent.py
# Single task
python baseline_agent.py --task task_2_medium
# Against a running HTTP server
python baseline_agent.py --mode http --base-url http://localhost:7860
API Reference
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | HTML landing page |
/health |
GET | Health check |
/tasks |
GET | List all task specs |
/reset |
POST | Start or restart an episode |
/step |
POST | Submit an action |
/state |
GET | Get full serialisable state |
/docs |
GET | Interactive Swagger UI |
Example: Full episode via curl
# 1. Reset
curl -X POST http://localhost:7860/reset \
-H 'Content-Type: application/json' \
-d '{"task_id": "task_1_easy", "session_id": "demo"}'
# 2. Step
curl -X POST http://localhost:7860/step \
-H 'Content-Type: application/json' \
-d '{
"session_id": "demo",
"action": {
"comments": [
{
"line": 2,
"category": "bug",
"severity": "high",
"message": "divide() will raise ZeroDivisionError when b is 0.",
"suggestion": "Guard with: if b == 0: raise ValueError"
}
],
"submit": true
}
}'
# 3. Check state
curl "http://localhost:7860/state?session_id=demo"
Project Structure
openenv-code-review/
βββ app.py # FastAPI HTTP server
βββ openenv.yaml # OpenEnv spec metadata
βββ Dockerfile # Container definition
βββ requirements.txt
βββ baseline_agent.py # gpt-4o baseline inference script
β
βββ env/
β βββ models.py # Pydantic typed models (Observation, Action, Reward, β¦)
β βββ environment.py # CodeReviewEnv β step() / reset() / state()
β
βββ corpus/
β βββ snippets.py # Python snippets with ground-truth issues
β
βββ graders/
β βββ graders.py # Task1Grader, Task2Grader, Task3Grader
β
βββ tests/
βββ test_env.py # 25-test pytest suite (all passing)
Deploying to Hugging Face Spaces
- Create a new Space with Docker SDK
- Push this repository to the Space
- Set
OPENAI_API_KEYas a Space secret (only needed for baseline script) - The Space will auto-build and expose port 7860
# README.md frontmatter for HF Spaces
---
title: CodeReview OpenEnv
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
tags:
- openenv
- code-review
- ai-agent
- evaluation
---
License
MIT