Spaces:
Sleeping
title: Code Review Environment
emoji: π‘οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
short_description: AI agent code review environment benchmark
tags:
- openenv
- reinforcement-learning
- code-review
Code Review OpenEnv Benchmark
π Scaler March 2026 Hackathon Submission
Author: Dolphin-Syndrom Type: OpenEnv Benchmark Environment Focus: Evaluating LLM agents on security-aware code review tasks
β‘ TL;DR
A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews.
- 5 tasks with progressive difficulty (extra_easy β easy β medium β hard β expert)
- 12-tag issue taxonomy covering security, logic, and robustness flaws
- Multi-dimensional grading: recall + quality bonus + severity bonus β precision penalty
- Iterative refinement: feedback-driven multi-step improvement within episodes
- 32 unit tests covering graders, environment lifecycle, and task coverage
- Deterministic scoring (0.0β1.0), deployable via Docker on Hugging Face Spaces
- Fully OpenEnv compliant
Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops with iterative refinement.
Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives.
What Makes This Environment Unique
1. Iterative Refinement Mechanic
Unlike single-shot evaluation environments, this benchmark provides structured feedback after each step that tells agents what categories of issues they missed (without revealing exact tags). This creates a genuine multi-step learning loop:
Step 1: Agent submits initial review β receives "Hint: look for security vulnerability"
Step 2: Agent refines review based on hint β finds missed sql_injection β score improves
Step 3: Final attempt with all accumulated feedback
This models how real code review works β reviewers iterate based on discussion and feedback.
2. Multi-Dimensional Reward Function
The grading system evaluates four orthogonal dimensions simultaneously:
| Component | Value | Signal |
|---|---|---|
| Recall reward | ` | correct |
| Quality bonus | +0.05 per issue | Keyword-rich explanations |
| Severity bonus | +0.05 | Correct risk assessment |
| Precision penalty | β0.10 per FP | Anti-hallucination |
This forces agents to balance thoroughness against precision β a core tension in real code review.
3. Full 12-Tag Taxonomy Coverage
Every tag in the taxonomy is exercised across the 5 tasks:
| Category | Tags | Task Coverage |
|---|---|---|
| Logic errors | null_pointer, missing_return, index_out_of_bounds |
extra_easy, easy |
| Security | sql_injection, hardcoded_secret, path_traversal |
medium, expert |
| Robustness | race_condition, timing_attack, improper_error_handling |
hard |
| Input handling | type_error, integer_overflow, missing_input_validation |
expert |
Architecture
graph TB
Agent[AI Agent / inference.py] -->|POST /reset| Server[FastAPI Server]
Agent -->|POST /step| Server
Server --> Env[CodeReviewEnvironment]
Env --> Tasks[Task Registry - 5 tasks]
Env --> Grader[Deterministic Grader]
Grader -->|recall + quality + severity β penalty| Score[Score 0.0-1.0]
Score -->|observation + reward + feedback| Agent
Server -->|GET /health| Health[Health Check]
Server -->|POST /grader| Grader
Server -->|POST /baseline| Baseline[Rule-Based Baseline]
Server -->|Gradio UI| Dashboard[Analytics Dashboard]
style Agent fill:#58a6ff,stroke:#333
style Server fill:#3fb950,stroke:#333
style Grader fill:#f0883e,stroke:#333
style Dashboard fill:#bc8cff,stroke:#333
Environment Specification
Objective
For each episode, the agent sees a Python code snippet containing planted issues and must:
- Identify issues using tags from a 12-item
ISSUE_TAXONOMY - Assess overall severity (
low,medium,high,critical) - Articulate findings in a human-readable
review_comment - Iteratively refine based on environment feedback across up to 3 steps
Observation Space
| Field | Type | Description |
|---|---|---|
task_id |
string | Current task identifier |
file_name |
string | File under review |
task_description |
string | Review instructions |
code_snippet |
string | Python code with planted issues |
feedback |
string | Previous step feedback with refinement hints |
step_number |
integer | Current step (0 after reset) |
available_issue_tags |
array | Allowed taxonomy tags |
Action Space
| Field | Type | Description |
|---|---|---|
issues_found |
list[str] | Tags from ISSUE_TAXONOMY |
severity |
enum | low / medium / high / critical |
review_comment |
string | Explanation of identified issues |
Episode Flow
reset(task_id)loads a task and returns the initial observation- Agent receives code snippet and available tags
- Agent submits review via
step(action) - Environment returns observation with score, feedback, and refinement hints
- Agent can use feedback to improve on subsequent steps
- Episode ends when score β₯ 0.95 or step limit (3) reached
Tasks
| Task | Difficulty | Planted Issues | File |
|---|---|---|---|
task_extra_easy |
Extra Easy | index_out_of_bounds |
data_utils.py |
task_easy |
Easy | null_pointer, missing_return |
user_service.py |
task_medium |
Medium | sql_injection, hardcoded_secret |
auth.py |
task_hard |
Hard | race_condition, improper_error_handling, timing_attack |
payments.py |
task_expert |
Expert | path_traversal, integer_overflow, missing_input_validation, type_error |
file_processor.py |
Reward Design
Summary: Correct behavior yields positive reward (~1.0), random strategies are penalized, ensuring meaningful learning signals.
The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end.
Core components:
- Recall reward: fractional points for correctly identified issues
- Quality bonus: +0.05 per correct issue with a matching keyword in the comment
- Severity bonus: +0.05 when severity matches expected level for task difficulty
- Precision penalty: β0.10 for hallucinated or false-positive issues
Project Structure
.
βββ __init__.py # Package exports
βββ client.py # WebSocket client for agent interaction
βββ models.py # Typed Pydantic models (Action, Observation, State)
βββ inference.py # Baseline inference script with LLM + rule fallback
βββ openenv.yaml # OpenEnv specification
βββ pyproject.toml # Project config with pytest setup
βββ requirements.txt # Pip dependencies
βββ Dockerfile # Production container with health check
βββ conftest.py # Pytest root configuration
βββ README.md
βββ scripts/
β βββ validate-submission.sh
βββ server/
β βββ __init__.py
β βββ app.py # FastAPI + Gradio dashboard
β βββ code_review_env_environment.py # Environment with iterative refinement
β βββ graders.py # Multi-dimensional deterministic grader
β βββ tasks.py # 5 task definitions with planted issues
β βββ requirements.txt
β βββ Dockerfile
βββ tests/
βββ conftest.py
βββ __init__.py
βββ test_graders.py # 19 grader tests
βββ test_environment.py # 13 environment lifecycle tests
Setup
uv sync --frozen
# OR:
pip install -r requirements.txt
pip install -r server/requirements.txt
Running
Start the server
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
Run tests
uv run pytest tests/ -v
Run baseline inference
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=your-token
python inference.py
Docker
docker build -t code-review-openenv -f Dockerfile .
docker run -p 8000:8000 code-review-openenv
π API Endpoints
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check |
GET |
/tasks |
List all tasks with schemas |
POST |
/reset |
Reset environment for a task |
POST |
/step |
Submit a review action |
GET |
/state |
Get current episode state |
POST |
/grader |
Score a review against a task |
POST |
/baseline |
Run rule-based baseline |
Validation
openenv validate .
./scripts/validate-submission.sh http://localhost:8000 .
π Submission Status
- All 5 OpenEnv validation checks passing
- 32/32 unit tests passing
- Docker build and deployment verified
- End-to-end inference and grading pipeline tested
π Links
- GitHub: https://github.com/Dolphin-Syndrom/code-review-env
- Hugging Face Space: https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env
License
BSD-3-Clause