code_review / README.md
h1manshu's picture
Upload folder using huggingface_hub
a0ea022 verified
metadata
title: Code Review Environment Server
emoji: 🎳
colorFrom: green
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Code Review Environment

A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks β€” spanning missing imports, logic errors, and security vulnerabilities.


Motivation

Code review is a high-stakes, multi-step reasoning task that requires an agent to:

  • Detect bugs and security vulnerabilities from raw code diffs
  • Generate corrective code that resolves identified issues
  • Make a final judgment (approve or reject) backed by technical reasoning

Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop β€” detection, remediation, and decision-making β€” in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in real software development pipelines.


Setup and Usage

Install dependencies

pip install openenv-core
git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
uv pip install -e .

Run the server locally (optional)

uv run server --host 0.0.0.0 --port 8000

Run the agent

uv run python inference.py

Environment variables

Set the following before running:

Variable Description
API_BASE_URL The API endpoint for the LLM (e.g. https://router.huggingface.co/v1)
MODEL_NAME The model identifier to use for inference
HF_TOKEN Your Hugging Face / API key

Key constants in inference.py

Constant Default Description
MAX_STEPS 3 Steps per episode
NUM_EPISODES 16 Number of PRs to review
TEMPERATURE 0.2 Sampling temperature (lower = more deterministic)
MAX_TOKENS 512 Max tokens per LLM response
SUCCESS_SCORE_THRESHOLD 0.1 Minimum score to count as success

Environment Description

The agent receives a pull request observation at each step and must respond with a structured JSON action. Each episode runs for up to MAX_STEPS = 3 steps following a fixed workflow:

Step Expected Action Purpose
1 comment Identify all issues in the diff
2 suggest_fix Provide corrected code
3 final_decision Approve or reject the PR

Each step is independently scored. The final episode score is the maximum score achieved across all steps.

The environment automatically selects a grader tier (easy, medium, or hard) based on the task_type field of each dataset sample. No manual configuration is needed β€” the grader switches per episode as reset() is called.


Action Space

Actions must be returned as JSON with the following fields:

{
  "action_type": "comment | suggest_fix | final_decision",
  "comment": "Detailed description of identified issues (>30 characters)",
  "suggested_code": "Corrected code snippet, or null if not applicable",
  "decision": "approve | reject | null"
}
Field Type Required Description
action_type str Always One of comment, suggest_fix, final_decision
comment str Recommended Technical description of issues found
suggested_code str | null Step 2 Corrected code replacing the buggy diff
decision str | null Step 3 approve or reject; null otherwise

Observation Space

Each step returns a CodeReviewObservation with the following fields:

Field Type Description
pr CodeReviewPullRequest The pull request under review
pr.id str Unique PR identifier
pr.title str Short title of the PR
pr.description str Brief description of intent
pr.language str Programming language (e.g. python)
pr.diffs List[CodeDiff] List of file diffs
pr.diffs[].file_name str Name of the changed file
pr.diffs[].diff str The actual code change
previous_comments List[str] Comments made in prior steps
step_count int Current step number
max_steps int Maximum steps per episode (default: 3)

Scoring

Grader tiers

The dataset contains three difficulty levels, each backed by a dedicated grader class in graders.py. The grader is selected automatically from task_type in the dataset sample.

Tier Class Issue matching Wrong decision Done scoring
easy EasyGrader Substring match 0.2 partial credit Max over full history
medium MediumGrader Token overlap + substring fallback 0.1 partial credit Recency-weighted max
hard HardGrader Token overlap + seq sim (threshold 0.3) No credit Final step only

Score components per tier

Component Easy Medium Hard
Issue detection 40% 42% 45%
Fix quality 30% 30% 28%
Decision accuracy 30% 28% 27%

Fix quality is computed as a weighted combination of token overlap, sequence similarity, and (for medium/hard) line-level exact matching. Issue detection checks how many ground-truth issues appear in the agent's comment. All scores are clamped to [0.01, 0.99].

Bonuses and penalties

Condition Easy Medium Hard
Comment length > 30 chars +0.15 +0.10 β€”
Correct decision at step 1 +0.10 +0.10 +0.05
Correct decision at step 2 +0.10 +0.05 β€”
No comment on non-decision step βˆ’0.05 βˆ’0.08 βˆ’0.12
Step count > 3 β€” βˆ’0.04/step βˆ’0.05 Γ— (steps βˆ’ 2)

Task Descriptions

Easy

Straightforward single-file issues with an obvious fix. The EasyGrader uses simple substring matching β€” the agent gets full issue credit if the issue phrase appears anywhere in the comment.

PR Issue Expected Decision
Missing import datetime used without import reject

What the agent must do: Detect the missing from datetime import datetime statement and supply the corrected import line.


Medium

Logical or performance issues that require understanding of Python semantics. The MediumGrader uses token overlap so paraphrased descriptions still score well.

PR Issue Expected Decision
Division function No guard against division by zero reject
Inefficient loop range(len(arr)) pattern; can use in directly approve

What the agent must do: For the division task, add a if b == 0: return None guard. For the loop task, recognise it as a style issue but not a correctness bug β€” the correct decision is approve.


Hard

Security vulnerabilities, injection attacks, and cross-file null-handling bugs. The HardGrader applies a minimum similarity threshold: vague or generic comments receive zero issue credit.

PR Issue Expected Decision
Authentication logic Hardcoded plaintext password admin123 reject
SQL query String concatenation exposes SQL injection reject
Cross-file null bug get_user(None) called without input validation reject

What the agent must do:

  • Auth: Detect the hardcoded secret and propose bcrypt-based password comparison.
  • SQL: Detect string concatenation and replace with a parameterised query using %s placeholder + cursor.execute.
  • Null bug: Validate id is not None before the db[id] lookup and fix the call site in controller.py.

Baseline Scores

Expected performance ranges by model capability:

Score Range Interpretation
0.00 – 0.20 Failing β€” agent cannot follow the JSON schema or identify basic issues
0.20 – 0.50 Partial β€” agent detects some issues but misses security vulnerabilities or gives wrong decisions
0.50 – 0.75 Competent β€” agent handles easy and medium tasks; struggles with hard security/null cases
0.75 – 1.00 Strong β€” agent reliably detects all issue types, generates correct fixes, and makes sound decisions

Step-level log format

[START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
[STEP]  step=1 action=comment        reward=0.55 done=false error=null
[STEP]  step=2 action=suggest_fix    reward=0.72 done=false error=null
[STEP]  step=3 action=final_decision reward=0.85 done=true  error=null
[END]   success=true steps=3 score=0.850 rewards=0.55,0.72,0.85

Conclusion

The Code Review Environment provides a structured, reproducible benchmark for evaluating LLM-based agents on one of the most practically valuable tasks in software engineering. By decomposing the review process into three distinct steps β€” issue detection, fix generation, and final judgment β€” and by scaling difficulty through dedicated grader tiers, it rewards agents that reason carefully rather than those that simply pattern-match on surface-level symptoms.