codereview / README.md
Avnishjain's picture
Upload 34 files
7b2a69c verified
# πŸ” CodeReview OpenEnv
An **OpenEnv-compliant AI training environment** that simulates professional Python code review. Agents learn to identify bugs, security vulnerabilities, performance bottlenecks, style issues, and documentation gaps β€” exactly as a senior engineer would in a real pull-request workflow.
---
## Why Code Review?
Code review is one of the highest-leverage tasks in software engineering. It is:
- **Real-world**: Every professional software team does it daily
- **Structured enough to grade**: Issues have objectively correct or incorrect assessments
- **Rich in partial signal**: An agent that spots 3/5 critical issues is measurably better than one that spots 1/5
- **Scalable in difficulty**: Easy (bugs only) β†’ Hard (all categories + written summary)
This makes it an ideal domain for training and evaluating LLM-based agents on multi-step reasoning and quality estimation tasks.
---
## Environment Description
```
CodeReviewEnv
β”œβ”€β”€ Task 1 – Easy : Bug detection + Code style (calculator.py, 31 lines)
β”œβ”€β”€ Task 2 – Medium : Security + Performance audit (user_service.py, 55 lines)
└── Task 3 – Hard : Full review, all 5 categories (data_pipeline.py, 49 lines)
```
Each task presents a Python snippet containing intentional flaws. The agent submits `ReviewComment` objects across one or more steps, then finalises with `submit=True`. A deterministic grader scores the review against ground-truth issues.
---
## Observation Space
What the agent sees on each step:
| Field | Type | Description |
|---|---|---|
| `task_id` | `str` | Active task identifier |
| `step` | `int` | Current step (0-indexed) |
| `snippet.file_name` | `str` | Logical file name (e.g. `auth.py`) |
| `snippet.source` | `str` | Full Python source code |
| `instructions` | `str` | Review scope, difficulty, and guidance |
| `previous_comments` | `list[ReviewComment]` | All comments submitted so far |
| `feedback` | `str \| None` | Env feedback on the last action |
| `done` | `bool` | Whether the episode has ended |
---
## Action Space
What the agent submits on each step:
```json
{
"comments": [
{
"line": 10,
"category": "security",
"severity": "critical",
"message": "SQL injection via string interpolation in query.",
"suggestion": "Use parameterised queries: cursor.execute('...', (username,))"
}
],
"summary": "Overall review summary (required for task_3_hard)",
"submit": true
}
```
| Field | Type | Values |
|---|---|---|
| `comments[].line` | `int \| null` | 1-indexed line number; `null` for file-level |
| `comments[].category` | `enum` | `bug`, `security`, `performance`, `style`, `documentation` |
| `comments[].severity` | `enum` | `low`, `medium`, `high`, `critical` |
| `comments[].message` | `str` | 5–500 chars |
| `comments[].suggestion` | `str \| null` | Optional fix suggestion |
| `summary` | `str \| null` | Required for `task_3_hard`, optional otherwise |
| `submit` | `bool` | `true` finalises the review and triggers the grader |
---
## Reward Function
Rewards are shaped to provide signal over the **full trajectory**, not just on terminal submit.
### Per-step (incremental) rewards
| Event | Reward |
|---|---|
| New valid comment added | `+0.05` per comment (max `+0.15`) |
| Progress signal (grader score delta) | `+0.5 Γ— Ξ”score` |
| Empty step (no new comments) | `βˆ’0.05` |
| Spam (> 2.5Γ— expected comments) | `βˆ’0.10` |
### On `submit=True` (terminal)
```
submit_reward = score Γ— 0.8 + (0.2 if score β‰₯ threshold else βˆ’0.2)
```
### Per-category penalties (applied to terminal grader score)
| Event | Penalty |
|---|---|
| False positive (fabricated issue) | `βˆ’0.08–0.12` per comment |
| Missed CRITICAL security issue | `βˆ’0.15–0.20` |
| Missed HIGH issue | `βˆ’0.08–0.10` |
| No summary on task 3 | `βˆ’0.10` |
All rewards are clipped to `[βˆ’1.0, 1.0]`.
---
## Task Descriptions
### Task 1 – Easy: Bug Detection & Style Review
**File**: `calculator.py` (31 lines) | **Max steps**: 5 | **Pass threshold**: 0.55
Covers basic utility functions: `divide`, `average`, `celsius_to_fahrenheit`, `find_max`, `count_words`.
**Ground-truth issues (6)**:
- `divide()` β€” no zero-division guard (HIGH bug)
- `average()` β€” crashes on empty list (HIGH bug)
- `celsius_to_fahrenheit` β€” off-by-one (+31 vs +32) (MEDIUM bug)
- `find_max()` β€” crashes on empty list (MEDIUM bug)
- `for i in range(len(lst))` β€” unpythonic iteration (LOW style)
- Manual `Counter` reimplementation (LOW style)
---
### Task 2 – Medium: Security & Performance Audit
**File**: `user_service.py` (55 lines) | **Max steps**: 7 | **Pass threshold**: 0.60
A SQLite-backed user management service with authentication.
**Ground-truth issues (6)**:
- SQL injection in `get_user()` β€” f-string query (CRITICAL security)
- MD5 password hashing in `create_user()` (CRITICAL security)
- SQL injection in `delete_user()` (CRITICAL security)
- MD5 reuse in `authenticate()` (HIGH security)
- `fetchall()` on unbounded table (HIGH performance)
- New DB connection per query, no pooling (MEDIUM performance)
---
### Task 3 – Hard: Comprehensive Code Review
**File**: `data_pipeline.py` (49 lines) | **Max steps**: 10 | **Pass threshold**: 0.65
An analytics data pipeline with CSV loading, row transformation, caching, and stats.
**Ground-truth issues (13 across all 5 categories)**:
- `subprocess.run(shell=True)` with user input β€” OS command injection (CRITICAL security)
- `pickle.loads()` on arbitrary cache data β€” RCE risk (CRITICAL security)
- Pickling into module-level dict (HIGH security)
- `compute_stats()` ZeroDivisionError on empty data (HIGH bug)
- Missing `"value"` key β†’ silent KeyError (MEDIUM bug)
- `open()` without encoding (MEDIUM bug)
- Two-pass iteration in `compute_stats` (MEDIUM performance)
- Subprocess per row instead of batching (MEDIUM performance)
- `str(stats)` instead of JSON export (LOW style)
- Module-level mutable global cache (LOW style)
- `load_data()` missing docstring (LOW documentation)
- `process_row()` missing docstring (LOW documentation)
- Insufficient module-level docstring (LOW documentation)
A **written summary** is required (`summary` field) β€” absence incurs a `βˆ’0.10` score penalty.
---
## Expected Baseline Scores (gpt-4o)
| Task | Score | Pass? | Notes |
|---|---|---|---|
| `task_1_easy` | ~0.75 | βœ… | GPT-4o reliably spots ZeroDivisionError and off-by-one |
| `task_2_medium` | ~0.65 | βœ… | SQL injection found; MD5 usually flagged; perf issues partial |
| `task_3_hard` | ~0.55 | βœ… | Pickle RCE and shell injection found; docs often missed |
---
## Setup & Usage
### Option A β€” Docker (recommended)
```bash
# Build
docker build -t code-review-env .
# Run (port 7860)
docker run -p 7860:7860 code-review-env
# Test it
curl http://localhost:7860/health
```
### Option B β€” Local Python
```bash
# Install dependencies
pip install -r requirements.txt
# Start the server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
# Open docs
open http://localhost:7860/docs
```
### Run the test suite
```bash
pytest tests/ -v
# Expected: 25 passed
```
### Run the baseline agent
```bash
export OPENAI_API_KEY=sk-...
# All tasks (direct mode β€” no server needed)
python baseline_agent.py
# Single task
python baseline_agent.py --task task_2_medium
# Against a running HTTP server
python baseline_agent.py --mode http --base-url http://localhost:7860
```
---
## API Reference
| Endpoint | Method | Description |
|---|---|---|
| `/` | GET | HTML landing page |
| `/health` | GET | Health check |
| `/tasks` | GET | List all task specs |
| `/reset` | POST | Start or restart an episode |
| `/step` | POST | Submit an action |
| `/state` | GET | Get full serialisable state |
| `/docs` | GET | Interactive Swagger UI |
### Example: Full episode via curl
```bash
# 1. Reset
curl -X POST http://localhost:7860/reset \
-H 'Content-Type: application/json' \
-d '{"task_id": "task_1_easy", "session_id": "demo"}'
# 2. Step
curl -X POST http://localhost:7860/step \
-H 'Content-Type: application/json' \
-d '{
"session_id": "demo",
"action": {
"comments": [
{
"line": 2,
"category": "bug",
"severity": "high",
"message": "divide() will raise ZeroDivisionError when b is 0.",
"suggestion": "Guard with: if b == 0: raise ValueError"
}
],
"submit": true
}
}'
# 3. Check state
curl "http://localhost:7860/state?session_id=demo"
```
---
## Project Structure
```
openenv-code-review/
β”œβ”€β”€ app.py # FastAPI HTTP server
β”œβ”€β”€ openenv.yaml # OpenEnv spec metadata
β”œβ”€β”€ Dockerfile # Container definition
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ baseline_agent.py # gpt-4o baseline inference script
β”‚
β”œβ”€β”€ env/
β”‚ β”œβ”€β”€ models.py # Pydantic typed models (Observation, Action, Reward, …)
β”‚ └── environment.py # CodeReviewEnv β€” step() / reset() / state()
β”‚
β”œβ”€β”€ corpus/
β”‚ └── snippets.py # Python snippets with ground-truth issues
β”‚
β”œβ”€β”€ graders/
β”‚ └── graders.py # Task1Grader, Task2Grader, Task3Grader
β”‚
└── tests/
└── test_env.py # 25-test pytest suite (all passing)
```
---
## Deploying to Hugging Face Spaces
1. Create a new Space with **Docker** SDK
2. Push this repository to the Space
3. Set `OPENAI_API_KEY` as a Space secret (only needed for baseline script)
4. The Space will auto-build and expose port 7860
```yaml
# README.md frontmatter for HF Spaces
---
title: CodeReview OpenEnv
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
tags:
- openenv
- code-review
- ai-agent
- evaluation
---
```
---
## License
MIT