# πŸ” CodeReview OpenEnv An **OpenEnv-compliant AI training environment** that simulates professional Python code review. Agents learn to identify bugs, security vulnerabilities, performance bottlenecks, style issues, and documentation gaps β€” exactly as a senior engineer would in a real pull-request workflow. --- ## Why Code Review? Code review is one of the highest-leverage tasks in software engineering. It is: - **Real-world**: Every professional software team does it daily - **Structured enough to grade**: Issues have objectively correct or incorrect assessments - **Rich in partial signal**: An agent that spots 3/5 critical issues is measurably better than one that spots 1/5 - **Scalable in difficulty**: Easy (bugs only) β†’ Hard (all categories + written summary) This makes it an ideal domain for training and evaluating LLM-based agents on multi-step reasoning and quality estimation tasks. --- ## Environment Description ``` CodeReviewEnv β”œβ”€β”€ Task 1 – Easy : Bug detection + Code style (calculator.py, 31 lines) β”œβ”€β”€ Task 2 – Medium : Security + Performance audit (user_service.py, 55 lines) └── Task 3 – Hard : Full review, all 5 categories (data_pipeline.py, 49 lines) ``` Each task presents a Python snippet containing intentional flaws. The agent submits `ReviewComment` objects across one or more steps, then finalises with `submit=True`. A deterministic grader scores the review against ground-truth issues. --- ## Observation Space What the agent sees on each step: | Field | Type | Description | |---|---|---| | `task_id` | `str` | Active task identifier | | `step` | `int` | Current step (0-indexed) | | `snippet.file_name` | `str` | Logical file name (e.g. `auth.py`) | | `snippet.source` | `str` | Full Python source code | | `instructions` | `str` | Review scope, difficulty, and guidance | | `previous_comments` | `list[ReviewComment]` | All comments submitted so far | | `feedback` | `str \| None` | Env feedback on the last action | | `done` | `bool` | Whether the episode has ended | --- ## Action Space What the agent submits on each step: ```json { "comments": [ { "line": 10, "category": "security", "severity": "critical", "message": "SQL injection via string interpolation in query.", "suggestion": "Use parameterised queries: cursor.execute('...', (username,))" } ], "summary": "Overall review summary (required for task_3_hard)", "submit": true } ``` | Field | Type | Values | |---|---|---| | `comments[].line` | `int \| null` | 1-indexed line number; `null` for file-level | | `comments[].category` | `enum` | `bug`, `security`, `performance`, `style`, `documentation` | | `comments[].severity` | `enum` | `low`, `medium`, `high`, `critical` | | `comments[].message` | `str` | 5–500 chars | | `comments[].suggestion` | `str \| null` | Optional fix suggestion | | `summary` | `str \| null` | Required for `task_3_hard`, optional otherwise | | `submit` | `bool` | `true` finalises the review and triggers the grader | --- ## Reward Function Rewards are shaped to provide signal over the **full trajectory**, not just on terminal submit. ### Per-step (incremental) rewards | Event | Reward | |---|---| | New valid comment added | `+0.05` per comment (max `+0.15`) | | Progress signal (grader score delta) | `+0.5 Γ— Ξ”score` | | Empty step (no new comments) | `βˆ’0.05` | | Spam (> 2.5Γ— expected comments) | `βˆ’0.10` | ### On `submit=True` (terminal) ``` submit_reward = score Γ— 0.8 + (0.2 if score β‰₯ threshold else βˆ’0.2) ``` ### Per-category penalties (applied to terminal grader score) | Event | Penalty | |---|---| | False positive (fabricated issue) | `βˆ’0.08–0.12` per comment | | Missed CRITICAL security issue | `βˆ’0.15–0.20` | | Missed HIGH issue | `βˆ’0.08–0.10` | | No summary on task 3 | `βˆ’0.10` | All rewards are clipped to `[βˆ’1.0, 1.0]`. --- ## Task Descriptions ### Task 1 – Easy: Bug Detection & Style Review **File**: `calculator.py` (31 lines) | **Max steps**: 5 | **Pass threshold**: 0.55 Covers basic utility functions: `divide`, `average`, `celsius_to_fahrenheit`, `find_max`, `count_words`. **Ground-truth issues (6)**: - `divide()` β€” no zero-division guard (HIGH bug) - `average()` β€” crashes on empty list (HIGH bug) - `celsius_to_fahrenheit` β€” off-by-one (+31 vs +32) (MEDIUM bug) - `find_max()` β€” crashes on empty list (MEDIUM bug) - `for i in range(len(lst))` β€” unpythonic iteration (LOW style) - Manual `Counter` reimplementation (LOW style) --- ### Task 2 – Medium: Security & Performance Audit **File**: `user_service.py` (55 lines) | **Max steps**: 7 | **Pass threshold**: 0.60 A SQLite-backed user management service with authentication. **Ground-truth issues (6)**: - SQL injection in `get_user()` β€” f-string query (CRITICAL security) - MD5 password hashing in `create_user()` (CRITICAL security) - SQL injection in `delete_user()` (CRITICAL security) - MD5 reuse in `authenticate()` (HIGH security) - `fetchall()` on unbounded table (HIGH performance) - New DB connection per query, no pooling (MEDIUM performance) --- ### Task 3 – Hard: Comprehensive Code Review **File**: `data_pipeline.py` (49 lines) | **Max steps**: 10 | **Pass threshold**: 0.65 An analytics data pipeline with CSV loading, row transformation, caching, and stats. **Ground-truth issues (13 across all 5 categories)**: - `subprocess.run(shell=True)` with user input β€” OS command injection (CRITICAL security) - `pickle.loads()` on arbitrary cache data β€” RCE risk (CRITICAL security) - Pickling into module-level dict (HIGH security) - `compute_stats()` ZeroDivisionError on empty data (HIGH bug) - Missing `"value"` key β†’ silent KeyError (MEDIUM bug) - `open()` without encoding (MEDIUM bug) - Two-pass iteration in `compute_stats` (MEDIUM performance) - Subprocess per row instead of batching (MEDIUM performance) - `str(stats)` instead of JSON export (LOW style) - Module-level mutable global cache (LOW style) - `load_data()` missing docstring (LOW documentation) - `process_row()` missing docstring (LOW documentation) - Insufficient module-level docstring (LOW documentation) A **written summary** is required (`summary` field) β€” absence incurs a `βˆ’0.10` score penalty. --- ## Expected Baseline Scores (gpt-4o) | Task | Score | Pass? | Notes | |---|---|---|---| | `task_1_easy` | ~0.75 | βœ… | GPT-4o reliably spots ZeroDivisionError and off-by-one | | `task_2_medium` | ~0.65 | βœ… | SQL injection found; MD5 usually flagged; perf issues partial | | `task_3_hard` | ~0.55 | βœ… | Pickle RCE and shell injection found; docs often missed | --- ## Setup & Usage ### Option A β€” Docker (recommended) ```bash # Build docker build -t code-review-env . # Run (port 7860) docker run -p 7860:7860 code-review-env # Test it curl http://localhost:7860/health ``` ### Option B β€” Local Python ```bash # Install dependencies pip install -r requirements.txt # Start the server uvicorn app:app --host 0.0.0.0 --port 7860 --reload # Open docs open http://localhost:7860/docs ``` ### Run the test suite ```bash pytest tests/ -v # Expected: 25 passed ``` ### Run the baseline agent ```bash export OPENAI_API_KEY=sk-... # All tasks (direct mode β€” no server needed) python baseline_agent.py # Single task python baseline_agent.py --task task_2_medium # Against a running HTTP server python baseline_agent.py --mode http --base-url http://localhost:7860 ``` --- ## API Reference | Endpoint | Method | Description | |---|---|---| | `/` | GET | HTML landing page | | `/health` | GET | Health check | | `/tasks` | GET | List all task specs | | `/reset` | POST | Start or restart an episode | | `/step` | POST | Submit an action | | `/state` | GET | Get full serialisable state | | `/docs` | GET | Interactive Swagger UI | ### Example: Full episode via curl ```bash # 1. Reset curl -X POST http://localhost:7860/reset \ -H 'Content-Type: application/json' \ -d '{"task_id": "task_1_easy", "session_id": "demo"}' # 2. Step curl -X POST http://localhost:7860/step \ -H 'Content-Type: application/json' \ -d '{ "session_id": "demo", "action": { "comments": [ { "line": 2, "category": "bug", "severity": "high", "message": "divide() will raise ZeroDivisionError when b is 0.", "suggestion": "Guard with: if b == 0: raise ValueError" } ], "submit": true } }' # 3. Check state curl "http://localhost:7860/state?session_id=demo" ``` --- ## Project Structure ``` openenv-code-review/ β”œβ”€β”€ app.py # FastAPI HTTP server β”œβ”€β”€ openenv.yaml # OpenEnv spec metadata β”œβ”€β”€ Dockerfile # Container definition β”œβ”€β”€ requirements.txt β”œβ”€β”€ baseline_agent.py # gpt-4o baseline inference script β”‚ β”œβ”€β”€ env/ β”‚ β”œβ”€β”€ models.py # Pydantic typed models (Observation, Action, Reward, …) β”‚ └── environment.py # CodeReviewEnv β€” step() / reset() / state() β”‚ β”œβ”€β”€ corpus/ β”‚ └── snippets.py # Python snippets with ground-truth issues β”‚ β”œβ”€β”€ graders/ β”‚ └── graders.py # Task1Grader, Task2Grader, Task3Grader β”‚ └── tests/ └── test_env.py # 25-test pytest suite (all passing) ``` --- ## Deploying to Hugging Face Spaces 1. Create a new Space with **Docker** SDK 2. Push this repository to the Space 3. Set `OPENAI_API_KEY` as a Space secret (only needed for baseline script) 4. The Space will auto-build and expose port 7860 ```yaml # README.md frontmatter for HF Spaces --- title: CodeReview OpenEnv emoji: πŸ” colorFrom: blue colorTo: indigo sdk: docker pinned: false tags: - openenv - code-review - ai-agent - evaluation --- ``` --- ## License MIT