Spaces:

h1manshu
/

code_review

Sleeping

File size: 9,515 Bytes

0f13ee5
09ec238
 
 
 
0f13ee5
 
09ec238
 
 
 
0f13ee5
 
09ec238
 
a0ea022
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0fb8bd2
750f345
 
 
 
 
 
 
 
 
a0ea022
 
750f345
 
 
 
a0ea022
 
750f345
 
 
 
a0ea022
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0fb8bd2
 
a0ea022
 
 
0fb8bd2
 
 
 
 
a0ea022
 
 
 
 
 
0fb8bd2
 
a0ea022
 
 
0fb8bd2
 
 
 
 
 
 
 
a0ea022
0fb8bd2
 
 
 
 
 
 
a0ea022
 
0fb8bd2
a0ea022
0fb8bd2
a0ea022
0fb8bd2
 
 
 
 
 
 
 
 
 
 
 
 
 
a0ea022
 
0fb8bd2
a0ea022
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0fb8bd2
 
a0ea022
0fb8bd2
a0ea022
 
 
0fb8bd2
 
 
a0ea022
 
 
0fb8bd2
a0ea022
0fb8bd2
a0ea022
 
 
0fb8bd2
 
 
a0ea022
 
 
 
0fb8bd2
a0ea022
0fb8bd2
a0ea022
 
 
0fb8bd2
 
 
 
 
a0ea022
0fb8bd2
a0ea022
 
 
0fb8bd2
a0ea022
 
 
 
 
0fb8bd2
 
 
 
 
 
 
09ec238
a0ea022
 
 
 
 
 
 
 
 
 
 
 
750f345
09ec238
a0ea022

---
title: Code Review Environment Server
emoji: 🎳
colorFrom: green
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Code Review Environment

A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks — spanning missing imports, logic errors, and security vulnerabilities.

---

## Motivation

Code review is a high-stakes, multi-step reasoning task that requires an agent to:

- **Detect bugs and security vulnerabilities** from raw code diffs
- **Generate corrective code** that resolves identified issues
- **Make a final judgment** (approve or reject) backed by technical reasoning

Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop — detection, remediation, and decision-making — in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in real software development pipelines.

---

## Setup and Usage

### Install dependencies

```bash
pip install openenv-core
```

```bash
git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
uv pip install -e .
```

### Run the server locally (optional)

```bash
uv run server --host 0.0.0.0 --port 8000
```

### Run the agent

```bash
uv run python inference.py
```

### Environment variables

Set the following before running:

| Variable | Description |
|----------|-------------|
| `API_BASE_URL` | The API endpoint for the LLM (e.g. `https://router.huggingface.co/v1`) |
| `MODEL_NAME` | The model identifier to use for inference |
| `HF_TOKEN` | Your Hugging Face / API key |

### Key constants in `inference.py`

| Constant | Default | Description |
|----------|---------|-------------|
| `MAX_STEPS` | `3` | Steps per episode |
| `NUM_EPISODES` | `16` | Number of PRs to review |
| `TEMPERATURE` | `0.2` | Sampling temperature (lower = more deterministic) |
| `MAX_TOKENS` | `512` | Max tokens per LLM response |
| `SUCCESS_SCORE_THRESHOLD` | `0.1` | Minimum score to count as success |

---

## Environment Description

The agent receives a pull request observation at each step and must respond with a structured JSON action. Each episode runs for up to `MAX_STEPS = 3` steps following a fixed workflow:

| Step | Expected Action | Purpose |
|------|----------------|---------|
| 1 | `comment` | Identify all issues in the diff |
| 2 | `suggest_fix` | Provide corrected code |
| 3 | `final_decision` | Approve or reject the PR |

Each step is independently scored. The final episode score is the maximum score achieved across all steps.

The environment automatically selects a grader tier (`easy`, `medium`, or `hard`) based on the `task_type` field of each dataset sample. No manual configuration is needed — the grader switches per episode as `reset()` is called.

---

## Action Space

Actions must be returned as JSON with the following fields:

```json
{
  "action_type": "comment | suggest_fix | final_decision",
  "comment": "Detailed description of identified issues (>30 characters)",
  "suggested_code": "Corrected code snippet, or null if not applicable",
  "decision": "approve | reject | null"
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `action_type` | `str` | Always | One of `comment`, `suggest_fix`, `final_decision` |
| `comment` | `str` | Recommended | Technical description of issues found |
| `suggested_code` | `str \| null` | Step 2 | Corrected code replacing the buggy diff |
| `decision` | `str \| null` | Step 3 | `approve` or `reject`; `null` otherwise |

---

## Observation Space

Each step returns a `CodeReviewObservation` with the following fields:

| Field | Type | Description |
|-------|------|-------------|
| `pr` | `CodeReviewPullRequest` | The pull request under review |
| `pr.id` | `str` | Unique PR identifier |
| `pr.title` | `str` | Short title of the PR |
| `pr.description` | `str` | Brief description of intent |
| `pr.language` | `str` | Programming language (e.g. `python`) |
| `pr.diffs` | `List[CodeDiff]` | List of file diffs |
| `pr.diffs[].file_name` | `str` | Name of the changed file |
| `pr.diffs[].diff` | `str` | The actual code change |
| `previous_comments` | `List[str]` | Comments made in prior steps |
| `step_count` | `int` | Current step number |
| `max_steps` | `int` | Maximum steps per episode (default: 3) |

---

## Scoring

### Grader tiers

The dataset contains three difficulty levels, each backed by a dedicated grader class in `graders.py`. The grader is selected automatically from `task_type` in the dataset sample.

| Tier | Class | Issue matching | Wrong decision | Done scoring |
|------|-------|---------------|---------------|--------------|
| `easy` | `EasyGrader` | Substring match | 0.2 partial credit | Max over full history |
| `medium` | `MediumGrader` | Token overlap + substring fallback | 0.1 partial credit | Recency-weighted max |
| `hard` | `HardGrader` | Token overlap + seq sim (threshold 0.3) | No credit | Final step only |

### Score components per tier

| Component | Easy | Medium | Hard |
|-----------|------|--------|------|
| Issue detection | 40% | 42% | 45% |
| Fix quality | 30% | 30% | 28% |
| Decision accuracy | 30% | 28% | 27% |

**Fix quality** is computed as a weighted combination of token overlap, sequence similarity, and (for medium/hard) line-level exact matching. **Issue detection** checks how many ground-truth issues appear in the agent's comment. All scores are clamped to `[0.01, 0.99]`.

### Bonuses and penalties

| Condition | Easy | Medium | Hard |
|-----------|------|--------|------|
| Comment length > 30 chars | +0.15 | +0.10 | — |
| Correct decision at step 1 | +0.10 | +0.10 | +0.05 |
| Correct decision at step 2 | +0.10 | +0.05 | — |
| No comment on non-decision step | −0.05 | −0.08 | −0.12 |
| Step count > 3 | — | −0.04/step | −0.05 × (steps − 2) |

---

## Task Descriptions

### Easy

Straightforward single-file issues with an obvious fix. The `EasyGrader` uses simple substring matching — the agent gets full issue credit if the issue phrase appears anywhere in the comment.

| PR | Issue | Expected Decision |
|----|-------|------------------|
| Missing import | `datetime` used without import | reject |

**What the agent must do:** Detect the missing `from datetime import datetime` statement and supply the corrected import line.

---

### Medium

Logical or performance issues that require understanding of Python semantics. The `MediumGrader` uses token overlap so paraphrased descriptions still score well.

| PR | Issue | Expected Decision |
|----|-------|------------------|
| Division function | No guard against division by zero | reject |
| Inefficient loop | `range(len(arr))` pattern; can use `in` directly | approve |

**What the agent must do:** For the division task, add a `if b == 0: return None` guard. For the loop task, recognise it as a style issue but not a correctness bug — the correct decision is **approve**.

---

### Hard

Security vulnerabilities, injection attacks, and cross-file null-handling bugs. The `HardGrader` applies a minimum similarity threshold: vague or generic comments receive zero issue credit.

| PR | Issue | Expected Decision |
|----|-------|------------------|
| Authentication logic | Hardcoded plaintext password `admin123` | reject |
| SQL query | String concatenation exposes SQL injection | reject |
| Cross-file null bug | `get_user(None)` called without input validation | reject |

**What the agent must do:**
- **Auth:** Detect the hardcoded secret and propose `bcrypt`-based password comparison.
- **SQL:** Detect string concatenation and replace with a parameterised query using `%s` placeholder + `cursor.execute`.
- **Null bug:** Validate `id is not None` before the `db[id]` lookup and fix the call site in `controller.py`.

---

## Baseline Scores

Expected performance ranges by model capability:

| Score Range | Interpretation |
|-------------|---------------|
| 0.00 – 0.20 | Failing — agent cannot follow the JSON schema or identify basic issues |
| 0.20 – 0.50 | Partial — agent detects some issues but misses security vulnerabilities or gives wrong decisions |
| 0.50 – 0.75 | Competent — agent handles easy and medium tasks; struggles with hard security/null cases |
| 0.75 – 1.00 | Strong — agent reliably detects all issue types, generates correct fixes, and makes sound decisions |

### Step-level log format

```
[START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
[STEP]  step=1 action=comment        reward=0.55 done=false error=null
[STEP]  step=2 action=suggest_fix    reward=0.72 done=false error=null
[STEP]  step=3 action=final_decision reward=0.85 done=true  error=null
[END]   success=true steps=3 score=0.850 rewards=0.55,0.72,0.85
```

---

## Conclusion

The Code Review Environment provides a structured, reproducible benchmark for evaluating LLM-based agents on one of the most practically valuable tasks in software engineering. By decomposing the review process into three distinct steps — issue detection, fix generation, and final judgment — and by scaling difficulty through dedicated grader tiers, it rewards agents that reason carefully rather than those that simply pattern-match on surface-level symptoms.