code-review-env / README.md
theaniketgiri's picture
Optimize for Phase 2: 5 tasks, severity scoring, iterative refinement, 32 tests
0bbb422
---
title: Code Review Environment
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
short_description: AI agent code review environment benchmark
tags:
- openenv
- reinforcement-learning
- code-review
---
# Code Review OpenEnv Benchmark
## πŸš€ Scaler March 2026 Hackathon Submission
**Author:** Dolphin-Syndrom
**Type:** OpenEnv Benchmark Environment
**Focus:** Evaluating LLM agents on security-aware code review tasks
---
## ⚑ TL;DR
A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews.
- **5 tasks** with progressive difficulty (extra_easy β†’ easy β†’ medium β†’ hard β†’ expert)
- **12-tag issue taxonomy** covering security, logic, and robustness flaws
- **Multi-dimensional grading**: recall + quality bonus + severity bonus βˆ’ precision penalty
- **Iterative refinement**: feedback-driven multi-step improvement within episodes
- **32 unit tests** covering graders, environment lifecycle, and task coverage
- Deterministic scoring (0.0–1.0), deployable via Docker on Hugging Face Spaces
- Fully OpenEnv compliant
---
> Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops with iterative refinement.
>
> Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives.
## What Makes This Environment Unique
### 1. Iterative Refinement Mechanic
Unlike single-shot evaluation environments, this benchmark provides **structured feedback after each step** that tells agents what categories of issues they missed (without revealing exact tags). This creates a genuine multi-step learning loop:
```
Step 1: Agent submits initial review β†’ receives "Hint: look for security vulnerability"
Step 2: Agent refines review based on hint β†’ finds missed sql_injection β†’ score improves
Step 3: Final attempt with all accumulated feedback
```
This models how real code review works β€” reviewers iterate based on discussion and feedback.
### 2. Multi-Dimensional Reward Function
The grading system evaluates four orthogonal dimensions simultaneously:
| Component | Value | Signal |
|---|---|---|
| **Recall reward** | `|correct| / |planted|` | Comprehensive detection |
| **Quality bonus** | +0.05 per issue | Keyword-rich explanations |
| **Severity bonus** | +0.05 | Correct risk assessment |
| **Precision penalty** | βˆ’0.10 per FP | Anti-hallucination |
This forces agents to balance thoroughness against precision β€” a core tension in real code review.
### 3. Full 12-Tag Taxonomy Coverage
Every tag in the taxonomy is exercised across the 5 tasks:
| Category | Tags | Task Coverage |
|---|---|---|
| Logic errors | `null_pointer`, `missing_return`, `index_out_of_bounds` | extra_easy, easy |
| Security | `sql_injection`, `hardcoded_secret`, `path_traversal` | medium, expert |
| Robustness | `race_condition`, `timing_attack`, `improper_error_handling` | hard |
| Input handling | `type_error`, `integer_overflow`, `missing_input_validation` | expert |
## Architecture
```mermaid
graph TB
Agent[AI Agent / inference.py] -->|POST /reset| Server[FastAPI Server]
Agent -->|POST /step| Server
Server --> Env[CodeReviewEnvironment]
Env --> Tasks[Task Registry - 5 tasks]
Env --> Grader[Deterministic Grader]
Grader -->|recall + quality + severity βˆ’ penalty| Score[Score 0.0-1.0]
Score -->|observation + reward + feedback| Agent
Server -->|GET /health| Health[Health Check]
Server -->|POST /grader| Grader
Server -->|POST /baseline| Baseline[Rule-Based Baseline]
Server -->|Gradio UI| Dashboard[Analytics Dashboard]
style Agent fill:#58a6ff,stroke:#333
style Server fill:#3fb950,stroke:#333
style Grader fill:#f0883e,stroke:#333
style Dashboard fill:#bc8cff,stroke:#333
```
## Environment Specification
### Objective
For each episode, the agent sees a Python code snippet containing planted issues and must:
1. Identify issues using tags from a 12-item `ISSUE_TAXONOMY`
2. Assess overall severity (`low`, `medium`, `high`, `critical`)
3. Articulate findings in a human-readable `review_comment`
4. Iteratively refine based on environment feedback across up to 3 steps
### Observation Space
| Field | Type | Description |
|---|---|---|
| `task_id` | string | Current task identifier |
| `file_name` | string | File under review |
| `task_description` | string | Review instructions |
| `code_snippet` | string | Python code with planted issues |
| `feedback` | string | Previous step feedback with refinement hints |
| `step_number` | integer | Current step (0 after reset) |
| `available_issue_tags` | array | Allowed taxonomy tags |
### Action Space
| Field | Type | Description |
|---|---|---|
| `issues_found` | list[str] | Tags from ISSUE_TAXONOMY |
| `severity` | enum | `low` / `medium` / `high` / `critical` |
| `review_comment` | string | Explanation of identified issues |
### Episode Flow
1. `reset(task_id)` loads a task and returns the initial observation
2. Agent receives code snippet and available tags
3. Agent submits review via `step(action)`
4. Environment returns observation with score, feedback, and refinement hints
5. Agent can use feedback to improve on subsequent steps
6. Episode ends when score β‰₯ 0.95 or step limit (3) reached
## Tasks
| Task | Difficulty | Planted Issues | File |
|---|---|---|---|
| `task_extra_easy` | Extra Easy | `index_out_of_bounds` | data_utils.py |
| `task_easy` | Easy | `null_pointer`, `missing_return` | user_service.py |
| `task_medium` | Medium | `sql_injection`, `hardcoded_secret` | auth.py |
| `task_hard` | Hard | `race_condition`, `improper_error_handling`, `timing_attack` | payments.py |
| `task_expert` | Expert | `path_traversal`, `integer_overflow`, `missing_input_validation`, `type_error` | file_processor.py |
## Reward Design
**Summary:** Correct behavior yields positive reward (~1.0), random strategies are penalized, ensuring meaningful learning signals.
The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end.
Core components:
- **Recall reward**: fractional points for correctly identified issues
- **Quality bonus**: +0.05 per correct issue with a matching keyword in the comment
- **Severity bonus**: +0.05 when severity matches expected level for task difficulty
- **Precision penalty**: βˆ’0.10 for hallucinated or false-positive issues
## Project Structure
```text
.
β”œβ”€β”€ __init__.py # Package exports
β”œβ”€β”€ client.py # WebSocket client for agent interaction
β”œβ”€β”€ models.py # Typed Pydantic models (Action, Observation, State)
β”œβ”€β”€ inference.py # Baseline inference script with LLM + rule fallback
β”œβ”€β”€ openenv.yaml # OpenEnv specification
β”œβ”€β”€ pyproject.toml # Project config with pytest setup
β”œβ”€β”€ requirements.txt # Pip dependencies
β”œβ”€β”€ Dockerfile # Production container with health check
β”œβ”€β”€ conftest.py # Pytest root configuration
β”œβ”€β”€ README.md
β”œβ”€β”€ scripts/
β”‚ └── validate-submission.sh
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ app.py # FastAPI + Gradio dashboard
β”‚ β”œβ”€β”€ code_review_env_environment.py # Environment with iterative refinement
β”‚ β”œβ”€β”€ graders.py # Multi-dimensional deterministic grader
β”‚ β”œβ”€β”€ tasks.py # 5 task definitions with planted issues
β”‚ β”œβ”€β”€ requirements.txt
β”‚ └── Dockerfile
└── tests/
β”œβ”€β”€ conftest.py
β”œβ”€β”€ __init__.py
β”œβ”€β”€ test_graders.py # 19 grader tests
└── test_environment.py # 13 environment lifecycle tests
```
## Setup
```bash
uv sync --frozen
# OR:
pip install -r requirements.txt
pip install -r server/requirements.txt
```
## Running
### Start the server
```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Run tests
```bash
uv run pytest tests/ -v
```
### Run baseline inference
```bash
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=your-token
python inference.py
```
## Docker
```bash
docker build -t code-review-openenv -f Dockerfile .
docker run -p 8000:8000 code-review-openenv
```
## πŸ”Œ API Endpoints
| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/health` | Health check |
| `GET` | `/tasks` | List all tasks with schemas |
| `POST` | `/reset` | Reset environment for a task |
| `POST` | `/step` | Submit a review action |
| `GET` | `/state` | Get current episode state |
| `POST` | `/grader` | Score a review against a task |
| `POST` | `/baseline` | Run rule-based baseline |
## Validation
```bash
openenv validate .
./scripts/validate-submission.sh http://localhost:8000 .
```
## 🏁 Submission Status
- All 5 OpenEnv validation checks passing
- 32/32 unit tests passing
- Docker build and deployment verified
- End-to-end inference and grading pipeline tested
---
## πŸ”— Links
- GitHub: https://github.com/Dolphin-Syndrom/code-review-env
- Hugging Face Space: https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env
## License
BSD-3-Clause