Spaces:

Dolphin-Syndrom
/

code-review-env

Sleeping

App Files Files Community

code-review-env / README.md

theaniketgiri

Optimize for Phase 2: 5 tasks, severity scoring, iterative refinement, 32 tests

0bbb422 about 2 months ago

preview code

raw

history blame contribute delete

9.38 kB

metadata

title: Code Review Environment
emoji: 🛡️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
short_description: AI agent code review environment benchmark
tags:
  - openenv
  - reinforcement-learning
  - code-review

Code Review OpenEnv Benchmark

🚀 Scaler March 2026 Hackathon Submission

Author: Dolphin-Syndrom Type: OpenEnv Benchmark Environment Focus: Evaluating LLM agents on security-aware code review tasks

⚡ TL;DR

A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews.

5 tasks with progressive difficulty (extra_easy → easy → medium → hard → expert)
12-tag issue taxonomy covering security, logic, and robustness flaws
Multi-dimensional grading: recall + quality bonus + severity bonus − precision penalty
Iterative refinement: feedback-driven multi-step improvement within episodes
32 unit tests covering graders, environment lifecycle, and task coverage
Deterministic scoring (0.0–1.0), deployable via Docker on Hugging Face Spaces
Fully OpenEnv compliant

Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops with iterative refinement.

Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives.

What Makes This Environment Unique

1. Iterative Refinement Mechanic

Unlike single-shot evaluation environments, this benchmark provides structured feedback after each step that tells agents what categories of issues they missed (without revealing exact tags). This creates a genuine multi-step learning loop:

Step 1: Agent submits initial review → receives "Hint: look for security vulnerability"
Step 2: Agent refines review based on hint → finds missed sql_injection → score improves
Step 3: Final attempt with all accumulated feedback

This models how real code review works — reviewers iterate based on discussion and feedback.

2. Multi-Dimensional Reward Function

The grading system evaluates four orthogonal dimensions simultaneously:

Component	Value	Signal
Recall reward	`	correct
Quality bonus	+0.05 per issue	Keyword-rich explanations
Severity bonus	+0.05	Correct risk assessment
Precision penalty	−0.10 per FP	Anti-hallucination

This forces agents to balance thoroughness against precision — a core tension in real code review.

3. Full 12-Tag Taxonomy Coverage

Every tag in the taxonomy is exercised across the 5 tasks:

Category	Tags	Task Coverage
Logic errors	`null_pointer`, `missing_return`, `index_out_of_bounds`	extra_easy, easy
Security	`sql_injection`, `hardcoded_secret`, `path_traversal`	medium, expert
Robustness	`race_condition`, `timing_attack`, `improper_error_handling`	hard
Input handling	`type_error`, `integer_overflow`, `missing_input_validation`	expert

Architecture

graph TB
    Agent[AI Agent / inference.py] -->|POST /reset| Server[FastAPI Server]
    Agent -->|POST /step| Server
    Server --> Env[CodeReviewEnvironment]
    Env --> Tasks[Task Registry - 5 tasks]
    Env --> Grader[Deterministic Grader]
    Grader -->|recall + quality + severity − penalty| Score[Score 0.0-1.0]
    Score -->|observation + reward + feedback| Agent
    Server -->|GET /health| Health[Health Check]
    Server -->|POST /grader| Grader
    Server -->|POST /baseline| Baseline[Rule-Based Baseline]
    Server -->|Gradio UI| Dashboard[Analytics Dashboard]

    style Agent fill:#58a6ff,stroke:#333
    style Server fill:#3fb950,stroke:#333
    style Grader fill:#f0883e,stroke:#333
    style Dashboard fill:#bc8cff,stroke:#333

Environment Specification

Objective

For each episode, the agent sees a Python code snippet containing planted issues and must:

Identify issues using tags from a 12-item ISSUE_TAXONOMY
Assess overall severity (low, medium, high, critical)
Articulate findings in a human-readable review_comment
Iteratively refine based on environment feedback across up to 3 steps

Observation Space

Field	Type	Description
`task_id`	string	Current task identifier
`file_name`	string	File under review
`task_description`	string	Review instructions
`code_snippet`	string	Python code with planted issues
`feedback`	string	Previous step feedback with refinement hints
`step_number`	integer	Current step (0 after reset)
`available_issue_tags`	array	Allowed taxonomy tags

Action Space

Field	Type	Description
`issues_found`	list[str]	Tags from ISSUE_TAXONOMY
`severity`	enum	`low` / `medium` / `high` / `critical`
`review_comment`	string	Explanation of identified issues

Episode Flow

reset(task_id) loads a task and returns the initial observation
Agent receives code snippet and available tags
Agent submits review via step(action)
Environment returns observation with score, feedback, and refinement hints
Agent can use feedback to improve on subsequent steps
Episode ends when score ≥ 0.95 or step limit (3) reached

Tasks

Task	Difficulty	Planted Issues	File
`task_extra_easy`	Extra Easy	`index_out_of_bounds`	data_utils.py
`task_easy`	Easy	`null_pointer`, `missing_return`	user_service.py
`task_medium`	Medium	`sql_injection`, `hardcoded_secret`	auth.py
`task_hard`	Hard	`race_condition`, `improper_error_handling`, `timing_attack`	payments.py
`task_expert`	Expert	`path_traversal`, `integer_overflow`, `missing_input_validation`, `type_error`	file_processor.py

Reward Design

Summary: Correct behavior yields positive reward (~1.0), random strategies are penalized, ensuring meaningful learning signals.

The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end.

Core components:

Recall reward: fractional points for correctly identified issues
Quality bonus: +0.05 per correct issue with a matching keyword in the comment
Severity bonus: +0.05 when severity matches expected level for task difficulty
Precision penalty: −0.10 for hallucinated or false-positive issues

Project Structure

.
├── __init__.py              # Package exports
├── client.py                # WebSocket client for agent interaction
├── models.py                # Typed Pydantic models (Action, Observation, State)
├── inference.py             # Baseline inference script with LLM + rule fallback
├── openenv.yaml             # OpenEnv specification
├── pyproject.toml           # Project config with pytest setup
├── requirements.txt         # Pip dependencies
├── Dockerfile               # Production container with health check
├── conftest.py              # Pytest root configuration
├── README.md
├── scripts/
│   └── validate-submission.sh
├── server/
│   ├── __init__.py
│   ├── app.py               # FastAPI + Gradio dashboard
│   ├── code_review_env_environment.py  # Environment with iterative refinement
│   ├── graders.py            # Multi-dimensional deterministic grader
│   ├── tasks.py              # 5 task definitions with planted issues
│   ├── requirements.txt
│   └── Dockerfile
└── tests/
    ├── conftest.py
    ├── __init__.py
    ├── test_graders.py       # 19 grader tests
    └── test_environment.py   # 13 environment lifecycle tests

Setup

uv sync --frozen
# OR:
pip install -r requirements.txt
pip install -r server/requirements.txt

Running

Start the server

uv run uvicorn server.app:app --host 0.0.0.0 --port 8000

Run tests

uv run pytest tests/ -v

Run baseline inference

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=your-token
python inference.py

Docker

docker build -t code-review-openenv -f Dockerfile .
docker run -p 8000:8000 code-review-openenv

🔌 API Endpoints

Method	Endpoint	Description
`GET`	`/health`	Health check
`GET`	`/tasks`	List all tasks with schemas
`POST`	`/reset`	Reset environment for a task
`POST`	`/step`	Submit a review action
`GET`	`/state`	Get current episode state
`POST`	`/grader`	Score a review against a task
`POST`	`/baseline`	Run rule-based baseline

Validation

openenv validate .
./scripts/validate-submission.sh http://localhost:8000 .

🏁 Submission Status

All 5 OpenEnv validation checks passing
32/32 unit tests passing
Docker build and deployment verified
End-to-end inference and grading pipeline tested

🔗 Links

GitHub: https://github.com/Dolphin-Syndrom/code-review-env
Hugging Face Space: https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env

License

BSD-3-Clause