Spaces:
Sleeping
title: PR Review OpenEnv
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
- openenv
- code-review
- reinforcement-learning
PR Review OpenEnv
An OpenEnv-compatible environment for benchmarking AI agents on pull request code review. Agents read unified diffs, post review comments, and make approve/reject decisions. No real GitHub access β all 22 scenarios are self-contained JSON files.
Environment Description
The environment presents agents with realistic pull request diffs (Python code). The agent must identify bugs and security issues through comments, then issue a final approve or request_changes decision. Rewards are shaped to encourage precise, specific feedback β not hallucination.
Observation Space
| Field | Type | Description |
|---|---|---|
pr_title |
string | Title of the pull request |
pr_description |
string | Author's description of the change |
diff |
string | Unified diff of the PR |
file_tree |
list[string] | Files touched (reserved, currently empty) |
comments_so_far |
list[object] | Comments posted this episode |
step_count |
integer | Steps taken so far |
done |
boolean | True when the episode has ended |
Action Space
| Field | Type | Description |
|---|---|---|
action_type |
enum | comment Β· approve Β· request_changes |
file |
string (optional) | Filename the comment refers to |
line |
integer (optional) | Line number the comment refers to |
body |
string | Comment text (required for comment) |
Submitting approve or request_changes ends the episode and triggers scoring.
Tasks
| ID | Scenarios | Max Steps | Success Threshold | Description |
|---|---|---|---|---|
easy |
8 | 5 | 0.7 | Obvious bugs β off-by-one, null dereference, hardcoded secrets |
medium |
7 | 10 | 0.6 | Logic bugs and security issues across multiple files |
hard |
7 | 15 | 0.5 | Subtle race conditions, TOCTOU, cache invalidation, timing attacks |
Each task pool includes clean PRs (no bugs) to test false-positive behaviour.
Scoring
score = bug_detection_rate Γ 0.7 + decision_correct Γ 0.3
- Bug detection β fraction of ground-truth bugs whose keywords appear in comments.
- Decision correct β +0.3 if approve/reject matches ground truth, else -0.3.
- False rejection penalty β β0.2 if agent rejects a clean (bug-free) PR.
- Reward range β
[-0.5, 1.0](perfect score = 1.0).
Partial rewards are given per step: each comment that correctly identifies a new bug earns 0.7 / total_bugs. False-positive comments earn β0.05.
Setup
Docker (recommended)
docker build -t pr-review-env .
docker run -p 7860:7860 pr-review-env
API available at http://localhost:7860. Docs at http://localhost:7860/docs.
Python (no Docker)
pip install -r requirements.txt
uvicorn src.api:app --host 0.0.0.0 --port 7860
Running Inference
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_..."
export API_BASE_URL="https://router.huggingface.co/hf-inference/v1"
export ENV_URL="http://localhost:7860"
python inference.py
Structured log output:
[START] {"task": "easy", "env": "PRReviewEnv", "model": "meta-llama/Llama-3.1-8B-Instruct"}
[STEP] {"step": 1, "action": "off-by-one error in loop bound", "reward": 0.35, "done": false, "error": null}
[STEP] {"step": 2, "action": "request_changes", "reward": 0.3, "done": true, "error": null}
[END] {"success": true, "steps": 2, "score": 0.65, "rewards": [0.35, 0.3]}
API Reference
| Method | Path | Description |
|---|---|---|
GET |
/ |
Health check |
POST |
/reset?task=easy |
Start a new episode |
POST |
/step |
Submit an action |
GET |
/state |
Inspect current episode state |
Baseline Scores
Run
python inference.pywith your model to populate.
| Model | Easy | Medium | Hard | Average |
|---|---|---|---|---|
| (placeholder) | β | β | β | β |