Spaces:

h1manshu
/

code_review

Sleeping

App Files Files Community

code_review / README.md

h1manshu

Upload folder using huggingface_hub

a0ea022 verified about 1 month ago

preview code

raw

history blame contribute delete

9.52 kB

metadata

title: Code Review Environment Server
emoji: 🎳
colorFrom: green
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Code Review Environment

A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks — spanning missing imports, logic errors, and security vulnerabilities.

Motivation

Code review is a high-stakes, multi-step reasoning task that requires an agent to:

Detect bugs and security vulnerabilities from raw code diffs
Generate corrective code that resolves identified issues
Make a final judgment (approve or reject) backed by technical reasoning

Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop — detection, remediation, and decision-making — in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in real software development pipelines.

Setup and Usage

Install dependencies

pip install openenv-core

git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
uv pip install -e .

Run the server locally (optional)

uv run server --host 0.0.0.0 --port 8000

Run the agent

uv run python inference.py

Environment variables

Set the following before running:

Variable	Description
`API_BASE_URL`	The API endpoint for the LLM (e.g. `https://router.huggingface.co/v1`)
`MODEL_NAME`	The model identifier to use for inference
`HF_TOKEN`	Your Hugging Face / API key

Key constants in `inference.py`

Constant	Default	Description
`MAX_STEPS`	`3`	Steps per episode
`NUM_EPISODES`	`16`	Number of PRs to review
`TEMPERATURE`	`0.2`	Sampling temperature (lower = more deterministic)
`MAX_TOKENS`	`512`	Max tokens per LLM response
`SUCCESS_SCORE_THRESHOLD`	`0.1`	Minimum score to count as success

Environment Description

The agent receives a pull request observation at each step and must respond with a structured JSON action. Each episode runs for up to MAX_STEPS = 3 steps following a fixed workflow:

Step	Expected Action	Purpose
1	`comment`	Identify all issues in the diff
2	`suggest_fix`	Provide corrected code
3	`final_decision`	Approve or reject the PR

Each step is independently scored. The final episode score is the maximum score achieved across all steps.

The environment automatically selects a grader tier (easy, medium, or hard) based on the task_type field of each dataset sample. No manual configuration is needed — the grader switches per episode as reset() is called.

Action Space

Actions must be returned as JSON with the following fields:

{
  "action_type": "comment | suggest_fix | final_decision",
  "comment": "Detailed description of identified issues (>30 characters)",
  "suggested_code": "Corrected code snippet, or null if not applicable",
  "decision": "approve | reject | null"
}

Field	Type	Required	Description
`action_type`	`str`	Always	One of `comment`, `suggest_fix`, `final_decision`
`comment`	`str`	Recommended	Technical description of issues found
`suggested_code`	`str \| null`	Step 2	Corrected code replacing the buggy diff
`decision`	`str \| null`	Step 3	`approve` or `reject`; `null` otherwise

Observation Space

Each step returns a CodeReviewObservation with the following fields:

Field	Type	Description
`pr`	`CodeReviewPullRequest`	The pull request under review
`pr.id`	`str`	Unique PR identifier
`pr.title`	`str`	Short title of the PR
`pr.description`	`str`	Brief description of intent
`pr.language`	`str`	Programming language (e.g. `python`)
`pr.diffs`	`List[CodeDiff]`	List of file diffs
`pr.diffs[].file_name`	`str`	Name of the changed file
`pr.diffs[].diff`	`str`	The actual code change
`previous_comments`	`List[str]`	Comments made in prior steps
`step_count`	`int`	Current step number
`max_steps`	`int`	Maximum steps per episode (default: 3)

Scoring

Grader tiers

The dataset contains three difficulty levels, each backed by a dedicated grader class in graders.py. The grader is selected automatically from task_type in the dataset sample.

Tier	Class	Issue matching	Wrong decision	Done scoring
`easy`	`EasyGrader`	Substring match	0.2 partial credit	Max over full history
`medium`	`MediumGrader`	Token overlap + substring fallback	0.1 partial credit	Recency-weighted max
`hard`	`HardGrader`	Token overlap + seq sim (threshold 0.3)	No credit	Final step only

Score components per tier

Component	Easy	Medium	Hard
Issue detection	40%	42%	45%
Fix quality	30%	30%	28%
Decision accuracy	30%	28%	27%

Fix quality is computed as a weighted combination of token overlap, sequence similarity, and (for medium/hard) line-level exact matching. Issue detection checks how many ground-truth issues appear in the agent's comment. All scores are clamped to [0.01, 0.99].

Bonuses and penalties

Condition	Easy	Medium	Hard
Comment length > 30 chars	+0.15	+0.10	—
Correct decision at step 1	+0.10	+0.10	+0.05
Correct decision at step 2	+0.10	+0.05	—
No comment on non-decision step	−0.05	−0.08	−0.12
Step count > 3	—	−0.04/step	−0.05 × (steps − 2)

Task Descriptions

Easy

Straightforward single-file issues with an obvious fix. The EasyGrader uses simple substring matching — the agent gets full issue credit if the issue phrase appears anywhere in the comment.

PR	Issue	Expected Decision
Missing import	`datetime` used without import	reject

What the agent must do: Detect the missing from datetime import datetime statement and supply the corrected import line.

Medium

Logical or performance issues that require understanding of Python semantics. The MediumGrader uses token overlap so paraphrased descriptions still score well.

PR	Issue	Expected Decision
Division function	No guard against division by zero	reject
Inefficient loop	`range(len(arr))` pattern; can use `in` directly	approve

What the agent must do: For the division task, add a if b == 0: return None guard. For the loop task, recognise it as a style issue but not a correctness bug — the correct decision is approve.

Hard

Security vulnerabilities, injection attacks, and cross-file null-handling bugs. The HardGrader applies a minimum similarity threshold: vague or generic comments receive zero issue credit.

PR	Issue	Expected Decision
Authentication logic	Hardcoded plaintext password `admin123`	reject
SQL query	String concatenation exposes SQL injection	reject
Cross-file null bug	`get_user(None)` called without input validation	reject

What the agent must do:

Auth: Detect the hardcoded secret and propose bcrypt-based password comparison.
SQL: Detect string concatenation and replace with a parameterised query using %s placeholder + cursor.execute.
Null bug: Validate id is not None before the db[id] lookup and fix the call site in controller.py.

Baseline Scores

Expected performance ranges by model capability:

Score Range	Interpretation
0.00 – 0.20	Failing — agent cannot follow the JSON schema or identify basic issues
0.20 – 0.50	Partial — agent detects some issues but misses security vulnerabilities or gives wrong decisions
0.50 – 0.75	Competent — agent handles easy and medium tasks; struggles with hard security/null cases
0.75 – 1.00	Strong — agent reliably detects all issue types, generates correct fixes, and makes sound decisions

Step-level log format

[START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
[STEP]  step=1 action=comment        reward=0.55 done=false error=null
[STEP]  step=2 action=suggest_fix    reward=0.72 done=false error=null
[STEP]  step=3 action=final_decision reward=0.85 done=true  error=null
[END]   success=true steps=3 score=0.850 rewards=0.55,0.72,0.85

Conclusion

The Code Review Environment provides a structured, reproducible benchmark for evaluating LLM-based agents on one of the most practically valuable tasks in software engineering. By decomposing the review process into three distinct steps — issue detection, fix generation, and final judgment — and by scaling difficulty through dedicated grader tiers, it rewards agents that reason carefully rather than those that simply pattern-match on surface-level symptoms.