code-review-env / README.md
theaniketgiri's picture
Optimize for Phase 2: 5 tasks, severity scoring, iterative refinement, 32 tests
0bbb422
metadata
title: Code Review Environment
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
short_description: AI agent code review environment benchmark
tags:
  - openenv
  - reinforcement-learning
  - code-review

Code Review OpenEnv Benchmark

πŸš€ Scaler March 2026 Hackathon Submission

Author: Dolphin-Syndrom Type: OpenEnv Benchmark Environment Focus: Evaluating LLM agents on security-aware code review tasks


⚑ TL;DR

A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews.

  • 5 tasks with progressive difficulty (extra_easy β†’ easy β†’ medium β†’ hard β†’ expert)
  • 12-tag issue taxonomy covering security, logic, and robustness flaws
  • Multi-dimensional grading: recall + quality bonus + severity bonus βˆ’ precision penalty
  • Iterative refinement: feedback-driven multi-step improvement within episodes
  • 32 unit tests covering graders, environment lifecycle, and task coverage
  • Deterministic scoring (0.0–1.0), deployable via Docker on Hugging Face Spaces
  • Fully OpenEnv compliant

Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops with iterative refinement.

Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives.

What Makes This Environment Unique

1. Iterative Refinement Mechanic

Unlike single-shot evaluation environments, this benchmark provides structured feedback after each step that tells agents what categories of issues they missed (without revealing exact tags). This creates a genuine multi-step learning loop:

Step 1: Agent submits initial review β†’ receives "Hint: look for security vulnerability"
Step 2: Agent refines review based on hint β†’ finds missed sql_injection β†’ score improves
Step 3: Final attempt with all accumulated feedback

This models how real code review works β€” reviewers iterate based on discussion and feedback.

2. Multi-Dimensional Reward Function

The grading system evaluates four orthogonal dimensions simultaneously:

Component Value Signal
Recall reward ` correct
Quality bonus +0.05 per issue Keyword-rich explanations
Severity bonus +0.05 Correct risk assessment
Precision penalty βˆ’0.10 per FP Anti-hallucination

This forces agents to balance thoroughness against precision β€” a core tension in real code review.

3. Full 12-Tag Taxonomy Coverage

Every tag in the taxonomy is exercised across the 5 tasks:

Category Tags Task Coverage
Logic errors null_pointer, missing_return, index_out_of_bounds extra_easy, easy
Security sql_injection, hardcoded_secret, path_traversal medium, expert
Robustness race_condition, timing_attack, improper_error_handling hard
Input handling type_error, integer_overflow, missing_input_validation expert

Architecture

graph TB
    Agent[AI Agent / inference.py] -->|POST /reset| Server[FastAPI Server]
    Agent -->|POST /step| Server
    Server --> Env[CodeReviewEnvironment]
    Env --> Tasks[Task Registry - 5 tasks]
    Env --> Grader[Deterministic Grader]
    Grader -->|recall + quality + severity βˆ’ penalty| Score[Score 0.0-1.0]
    Score -->|observation + reward + feedback| Agent
    Server -->|GET /health| Health[Health Check]
    Server -->|POST /grader| Grader
    Server -->|POST /baseline| Baseline[Rule-Based Baseline]
    Server -->|Gradio UI| Dashboard[Analytics Dashboard]

    style Agent fill:#58a6ff,stroke:#333
    style Server fill:#3fb950,stroke:#333
    style Grader fill:#f0883e,stroke:#333
    style Dashboard fill:#bc8cff,stroke:#333

Environment Specification

Objective

For each episode, the agent sees a Python code snippet containing planted issues and must:

  1. Identify issues using tags from a 12-item ISSUE_TAXONOMY
  2. Assess overall severity (low, medium, high, critical)
  3. Articulate findings in a human-readable review_comment
  4. Iteratively refine based on environment feedback across up to 3 steps

Observation Space

Field Type Description
task_id string Current task identifier
file_name string File under review
task_description string Review instructions
code_snippet string Python code with planted issues
feedback string Previous step feedback with refinement hints
step_number integer Current step (0 after reset)
available_issue_tags array Allowed taxonomy tags

Action Space

Field Type Description
issues_found list[str] Tags from ISSUE_TAXONOMY
severity enum low / medium / high / critical
review_comment string Explanation of identified issues

Episode Flow

  1. reset(task_id) loads a task and returns the initial observation
  2. Agent receives code snippet and available tags
  3. Agent submits review via step(action)
  4. Environment returns observation with score, feedback, and refinement hints
  5. Agent can use feedback to improve on subsequent steps
  6. Episode ends when score β‰₯ 0.95 or step limit (3) reached

Tasks

Task Difficulty Planted Issues File
task_extra_easy Extra Easy index_out_of_bounds data_utils.py
task_easy Easy null_pointer, missing_return user_service.py
task_medium Medium sql_injection, hardcoded_secret auth.py
task_hard Hard race_condition, improper_error_handling, timing_attack payments.py
task_expert Expert path_traversal, integer_overflow, missing_input_validation, type_error file_processor.py

Reward Design

Summary: Correct behavior yields positive reward (~1.0), random strategies are penalized, ensuring meaningful learning signals.

The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end.

Core components:

  • Recall reward: fractional points for correctly identified issues
  • Quality bonus: +0.05 per correct issue with a matching keyword in the comment
  • Severity bonus: +0.05 when severity matches expected level for task difficulty
  • Precision penalty: βˆ’0.10 for hallucinated or false-positive issues

Project Structure

.
β”œβ”€β”€ __init__.py              # Package exports
β”œβ”€β”€ client.py                # WebSocket client for agent interaction
β”œβ”€β”€ models.py                # Typed Pydantic models (Action, Observation, State)
β”œβ”€β”€ inference.py             # Baseline inference script with LLM + rule fallback
β”œβ”€β”€ openenv.yaml             # OpenEnv specification
β”œβ”€β”€ pyproject.toml           # Project config with pytest setup
β”œβ”€β”€ requirements.txt         # Pip dependencies
β”œβ”€β”€ Dockerfile               # Production container with health check
β”œβ”€β”€ conftest.py              # Pytest root configuration
β”œβ”€β”€ README.md
β”œβ”€β”€ scripts/
β”‚   └── validate-submission.sh
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py               # FastAPI + Gradio dashboard
β”‚   β”œβ”€β”€ code_review_env_environment.py  # Environment with iterative refinement
β”‚   β”œβ”€β”€ graders.py            # Multi-dimensional deterministic grader
β”‚   β”œβ”€β”€ tasks.py              # 5 task definitions with planted issues
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── Dockerfile
└── tests/
    β”œβ”€β”€ conftest.py
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ test_graders.py       # 19 grader tests
    └── test_environment.py   # 13 environment lifecycle tests

Setup

uv sync --frozen
# OR:
pip install -r requirements.txt
pip install -r server/requirements.txt

Running

Start the server

uv run uvicorn server.app:app --host 0.0.0.0 --port 8000

Run tests

uv run pytest tests/ -v

Run baseline inference

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=your-token
python inference.py

Docker

docker build -t code-review-openenv -f Dockerfile .
docker run -p 8000:8000 code-review-openenv

πŸ”Œ API Endpoints

Method Endpoint Description
GET /health Health check
GET /tasks List all tasks with schemas
POST /reset Reset environment for a task
POST /step Submit a review action
GET /state Get current episode state
POST /grader Score a review against a task
POST /baseline Run rule-based baseline

Validation

openenv validate .
./scripts/validate-submission.sh http://localhost:8000 .

🏁 Submission Status

  • All 5 OpenEnv validation checks passing
  • 32/32 unit tests passing
  • Docker build and deployment verified
  • End-to-end inference and grading pipeline tested

πŸ”— Links

License

BSD-3-Clause