Spaces:
Sleeping
Sleeping
| title: Code Review Environment Server | |
| emoji: π³ | |
| colorFrom: green | |
| colorTo: gray | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| # Code Review Environment | |
| A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks β spanning missing imports, logic errors, and security vulnerabilities. | |
| --- | |
| ## Motivation | |
| Code review is a high-stakes, multi-step reasoning task that requires an agent to: | |
| - **Detect bugs and security vulnerabilities** from raw code diffs | |
| - **Generate corrective code** that resolves identified issues | |
| - **Make a final judgment** (approve or reject) backed by technical reasoning | |
| Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop β detection, remediation, and decision-making β in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in real software development pipelines. | |
| --- | |
| ## Setup and Usage | |
| ### Install dependencies | |
| ```bash | |
| pip install openenv-core | |
| ``` | |
| ```bash | |
| git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review | |
| uv pip install -e . | |
| ``` | |
| ### Run the server locally (optional) | |
| ```bash | |
| uv run server --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Run the agent | |
| ```bash | |
| uv run python inference.py | |
| ``` | |
| ### Environment variables | |
| Set the following before running: | |
| | Variable | Description | | |
| |----------|-------------| | |
| | `API_BASE_URL` | The API endpoint for the LLM (e.g. `https://router.huggingface.co/v1`) | | |
| | `MODEL_NAME` | The model identifier to use for inference | | |
| | `HF_TOKEN` | Your Hugging Face / API key | | |
| ### Key constants in `inference.py` | |
| | Constant | Default | Description | | |
| |----------|---------|-------------| | |
| | `MAX_STEPS` | `3` | Steps per episode | | |
| | `NUM_EPISODES` | `16` | Number of PRs to review | | |
| | `TEMPERATURE` | `0.2` | Sampling temperature (lower = more deterministic) | | |
| | `MAX_TOKENS` | `512` | Max tokens per LLM response | | |
| | `SUCCESS_SCORE_THRESHOLD` | `0.1` | Minimum score to count as success | | |
| --- | |
| ## Environment Description | |
| The agent receives a pull request observation at each step and must respond with a structured JSON action. Each episode runs for up to `MAX_STEPS = 3` steps following a fixed workflow: | |
| | Step | Expected Action | Purpose | | |
| |------|----------------|---------| | |
| | 1 | `comment` | Identify all issues in the diff | | |
| | 2 | `suggest_fix` | Provide corrected code | | |
| | 3 | `final_decision` | Approve or reject the PR | | |
| Each step is independently scored. The final episode score is the maximum score achieved across all steps. | |
| The environment automatically selects a grader tier (`easy`, `medium`, or `hard`) based on the `task_type` field of each dataset sample. No manual configuration is needed β the grader switches per episode as `reset()` is called. | |
| --- | |
| ## Action Space | |
| Actions must be returned as JSON with the following fields: | |
| ```json | |
| { | |
| "action_type": "comment | suggest_fix | final_decision", | |
| "comment": "Detailed description of identified issues (>30 characters)", | |
| "suggested_code": "Corrected code snippet, or null if not applicable", | |
| "decision": "approve | reject | null" | |
| } | |
| ``` | |
| | Field | Type | Required | Description | | |
| |-------|------|----------|-------------| | |
| | `action_type` | `str` | Always | One of `comment`, `suggest_fix`, `final_decision` | | |
| | `comment` | `str` | Recommended | Technical description of issues found | | |
| | `suggested_code` | `str \| null` | Step 2 | Corrected code replacing the buggy diff | | |
| | `decision` | `str \| null` | Step 3 | `approve` or `reject`; `null` otherwise | | |
| --- | |
| ## Observation Space | |
| Each step returns a `CodeReviewObservation` with the following fields: | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `pr` | `CodeReviewPullRequest` | The pull request under review | | |
| | `pr.id` | `str` | Unique PR identifier | | |
| | `pr.title` | `str` | Short title of the PR | | |
| | `pr.description` | `str` | Brief description of intent | | |
| | `pr.language` | `str` | Programming language (e.g. `python`) | | |
| | `pr.diffs` | `List[CodeDiff]` | List of file diffs | | |
| | `pr.diffs[].file_name` | `str` | Name of the changed file | | |
| | `pr.diffs[].diff` | `str` | The actual code change | | |
| | `previous_comments` | `List[str]` | Comments made in prior steps | | |
| | `step_count` | `int` | Current step number | | |
| | `max_steps` | `int` | Maximum steps per episode (default: 3) | | |
| --- | |
| ## Scoring | |
| ### Grader tiers | |
| The dataset contains three difficulty levels, each backed by a dedicated grader class in `graders.py`. The grader is selected automatically from `task_type` in the dataset sample. | |
| | Tier | Class | Issue matching | Wrong decision | Done scoring | | |
| |------|-------|---------------|---------------|--------------| | |
| | `easy` | `EasyGrader` | Substring match | 0.2 partial credit | Max over full history | | |
| | `medium` | `MediumGrader` | Token overlap + substring fallback | 0.1 partial credit | Recency-weighted max | | |
| | `hard` | `HardGrader` | Token overlap + seq sim (threshold 0.3) | No credit | Final step only | | |
| ### Score components per tier | |
| | Component | Easy | Medium | Hard | | |
| |-----------|------|--------|------| | |
| | Issue detection | 40% | 42% | 45% | | |
| | Fix quality | 30% | 30% | 28% | | |
| | Decision accuracy | 30% | 28% | 27% | | |
| **Fix quality** is computed as a weighted combination of token overlap, sequence similarity, and (for medium/hard) line-level exact matching. **Issue detection** checks how many ground-truth issues appear in the agent's comment. All scores are clamped to `[0.01, 0.99]`. | |
| ### Bonuses and penalties | |
| | Condition | Easy | Medium | Hard | | |
| |-----------|------|--------|------| | |
| | Comment length > 30 chars | +0.15 | +0.10 | β | | |
| | Correct decision at step 1 | +0.10 | +0.10 | +0.05 | | |
| | Correct decision at step 2 | +0.10 | +0.05 | β | | |
| | No comment on non-decision step | β0.05 | β0.08 | β0.12 | | |
| | Step count > 3 | β | β0.04/step | β0.05 Γ (steps β 2) | | |
| --- | |
| ## Task Descriptions | |
| ### Easy | |
| Straightforward single-file issues with an obvious fix. The `EasyGrader` uses simple substring matching β the agent gets full issue credit if the issue phrase appears anywhere in the comment. | |
| | PR | Issue | Expected Decision | | |
| |----|-------|------------------| | |
| | Missing import | `datetime` used without import | reject | | |
| **What the agent must do:** Detect the missing `from datetime import datetime` statement and supply the corrected import line. | |
| --- | |
| ### Medium | |
| Logical or performance issues that require understanding of Python semantics. The `MediumGrader` uses token overlap so paraphrased descriptions still score well. | |
| | PR | Issue | Expected Decision | | |
| |----|-------|------------------| | |
| | Division function | No guard against division by zero | reject | | |
| | Inefficient loop | `range(len(arr))` pattern; can use `in` directly | approve | | |
| **What the agent must do:** For the division task, add a `if b == 0: return None` guard. For the loop task, recognise it as a style issue but not a correctness bug β the correct decision is **approve**. | |
| --- | |
| ### Hard | |
| Security vulnerabilities, injection attacks, and cross-file null-handling bugs. The `HardGrader` applies a minimum similarity threshold: vague or generic comments receive zero issue credit. | |
| | PR | Issue | Expected Decision | | |
| |----|-------|------------------| | |
| | Authentication logic | Hardcoded plaintext password `admin123` | reject | | |
| | SQL query | String concatenation exposes SQL injection | reject | | |
| | Cross-file null bug | `get_user(None)` called without input validation | reject | | |
| **What the agent must do:** | |
| - **Auth:** Detect the hardcoded secret and propose `bcrypt`-based password comparison. | |
| - **SQL:** Detect string concatenation and replace with a parameterised query using `%s` placeholder + `cursor.execute`. | |
| - **Null bug:** Validate `id is not None` before the `db[id]` lookup and fix the call site in `controller.py`. | |
| --- | |
| ## Baseline Scores | |
| Expected performance ranges by model capability: | |
| | Score Range | Interpretation | | |
| |-------------|---------------| | |
| | 0.00 β 0.20 | Failing β agent cannot follow the JSON schema or identify basic issues | | |
| | 0.20 β 0.50 | Partial β agent detects some issues but misses security vulnerabilities or gives wrong decisions | | |
| | 0.50 β 0.75 | Competent β agent handles easy and medium tasks; struggles with hard security/null cases | | |
| | 0.75 β 1.00 | Strong β agent reliably detects all issue types, generates correct fixes, and makes sound decisions | | |
| ### Step-level log format | |
| ``` | |
| [START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct | |
| [STEP] step=1 action=comment reward=0.55 done=false error=null | |
| [STEP] step=2 action=suggest_fix reward=0.72 done=false error=null | |
| [STEP] step=3 action=final_decision reward=0.85 done=true error=null | |
| [END] success=true steps=3 score=0.850 rewards=0.55,0.72,0.85 | |
| ``` | |
| --- | |
| ## Conclusion | |
| The Code Review Environment provides a structured, reproducible benchmark for evaluating LLM-based agents on one of the most practically valuable tasks in software engineering. By decomposing the review process into three distinct steps β issue detection, fix generation, and final judgment β and by scaling difficulty through dedicated grader tiers, it rewards agents that reason carefully rather than those that simply pattern-match on surface-level symptoms. |