Spaces:

Dolphin-Syndrom
/

code-review-env

Sleeping

File size: 9,379 Bytes

1256abd
14dc79c
 
 
 
 
 
 
 
bb30ed3
14dc79c
 
 
 
 
 
77e1c62
319df19
77e1c62
319df19
0bbb422
 
77e1c62
319df19
 
 
 
 
77e1c62
319df19
0bbb422
 
 
 
 
 
 
319df19
 
 
0bbb422
77e1c62
 
319df19
0bbb422
319df19
0bbb422
319df19
0bbb422
319df19
0bbb422
 
 
 
 
319df19
0bbb422
319df19
0bbb422
1256abd
0bbb422
1256abd
0bbb422
 
 
 
 
 
319df19
0bbb422
319df19
0bbb422
319df19
0bbb422
1256abd
0bbb422
 
 
 
 
 
1256abd
0bbb422
1256abd
0bbb422
 
 
 
 
 
 
 
 
 
 
 
 
1256abd
0bbb422
 
 
 
 
1256abd
0bbb422
1256abd
0bbb422
1256abd
0bbb422
1256abd
0bbb422
 
 
 
1256abd
0bbb422
1256abd
0bbb422
 
 
 
 
 
 
 
 
1256abd
0bbb422
1256abd
0bbb422
 
 
 
 
319df19
77e1c62
319df19
0bbb422
 
 
 
 
 
319df19
77e1c62
319df19
0bbb422
 
 
 
 
 
 
319df19
77e1c62
319df19
0bbb422
1256abd
77e1c62
1256abd
77e1c62
1256abd
0bbb422
 
 
 
319df19
77e1c62
319df19
77e1c62
 
0bbb422
 
 
 
 
 
 
 
 
 
 
 
 
77e1c62
0bbb422
 
 
 
77e1c62
0bbb422
 
 
 
 
 
319df19
 
77e1c62
319df19
1256abd
77e1c62
0bbb422
77e1c62
 
319df19
 
0bbb422
77e1c62
0bbb422
319df19
 
 
1256abd
c3a9860
0bbb422
c3a9860
0bbb422
 
 
c3a9860
0bbb422
c3a9860
d1cfa81
77e1c62
 
 
319df19
1256abd
 
77e1c62
 
c3a9860
77e1c62
 
c3a9860
 
0bbb422
319df19
0bbb422
 
 
 
 
 
 
 
 
319df19
0bbb422
1256abd
0bbb422
 
 
 
319df19
77e1c62
d1cfa81
0bbb422
 
 
 
319df19
77e1c62
319df19
77e1c62
319df19
0bbb422
77e1c62
319df19
77e1c62
319df19
77e1c62

---
title: Code Review Environment
emoji: 🛡️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
short_description: AI agent code review environment benchmark
tags:
  - openenv
  - reinforcement-learning
  - code-review
---

# Code Review OpenEnv Benchmark

## 🚀 Scaler March 2026 Hackathon Submission

**Author:** Dolphin-Syndrom
**Type:** OpenEnv Benchmark Environment
**Focus:** Evaluating LLM agents on security-aware code review tasks

---

## ⚡ TL;DR

A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews.

- **5 tasks** with progressive difficulty (extra_easy → easy → medium → hard → expert)
- **12-tag issue taxonomy** covering security, logic, and robustness flaws
- **Multi-dimensional grading**: recall + quality bonus + severity bonus − precision penalty
- **Iterative refinement**: feedback-driven multi-step improvement within episodes
- **32 unit tests** covering graders, environment lifecycle, and task coverage
- Deterministic scoring (0.0–1.0), deployable via Docker on Hugging Face Spaces
- Fully OpenEnv compliant

---

> Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops with iterative refinement.
>
> Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives.

## What Makes This Environment Unique

### 1. Iterative Refinement Mechanic

Unlike single-shot evaluation environments, this benchmark provides **structured feedback after each step** that tells agents what categories of issues they missed (without revealing exact tags). This creates a genuine multi-step learning loop:

```
Step 1: Agent submits initial review → receives "Hint: look for security vulnerability"
Step 2: Agent refines review based on hint → finds missed sql_injection → score improves
Step 3: Final attempt with all accumulated feedback
```

This models how real code review works — reviewers iterate based on discussion and feedback.

### 2. Multi-Dimensional Reward Function

The grading system evaluates four orthogonal dimensions simultaneously:

| Component | Value | Signal |
|---|---|---|
| **Recall reward** | `|correct| / |planted|` | Comprehensive detection |
| **Quality bonus** | +0.05 per issue | Keyword-rich explanations |
| **Severity bonus** | +0.05 | Correct risk assessment |
| **Precision penalty** | −0.10 per FP | Anti-hallucination |

This forces agents to balance thoroughness against precision — a core tension in real code review.

### 3. Full 12-Tag Taxonomy Coverage

Every tag in the taxonomy is exercised across the 5 tasks:

| Category | Tags | Task Coverage |
|---|---|---|
| Logic errors | `null_pointer`, `missing_return`, `index_out_of_bounds` | extra_easy, easy |
| Security | `sql_injection`, `hardcoded_secret`, `path_traversal` | medium, expert |
| Robustness | `race_condition`, `timing_attack`, `improper_error_handling` | hard |
| Input handling | `type_error`, `integer_overflow`, `missing_input_validation` | expert |

## Architecture

```mermaid
graph TB
    Agent[AI Agent / inference.py] -->|POST /reset| Server[FastAPI Server]
    Agent -->|POST /step| Server
    Server --> Env[CodeReviewEnvironment]
    Env --> Tasks[Task Registry - 5 tasks]
    Env --> Grader[Deterministic Grader]
    Grader -->|recall + quality + severity − penalty| Score[Score 0.0-1.0]
    Score -->|observation + reward + feedback| Agent
    Server -->|GET /health| Health[Health Check]
    Server -->|POST /grader| Grader
    Server -->|POST /baseline| Baseline[Rule-Based Baseline]
    Server -->|Gradio UI| Dashboard[Analytics Dashboard]

    style Agent fill:#58a6ff,stroke:#333
    style Server fill:#3fb950,stroke:#333
    style Grader fill:#f0883e,stroke:#333
    style Dashboard fill:#bc8cff,stroke:#333
```

## Environment Specification

### Objective

For each episode, the agent sees a Python code snippet containing planted issues and must:

1. Identify issues using tags from a 12-item `ISSUE_TAXONOMY`
2. Assess overall severity (`low`, `medium`, `high`, `critical`)
3. Articulate findings in a human-readable `review_comment`
4. Iteratively refine based on environment feedback across up to 3 steps

### Observation Space

| Field | Type | Description |
|---|---|---|
| `task_id` | string | Current task identifier |
| `file_name` | string | File under review |
| `task_description` | string | Review instructions |
| `code_snippet` | string | Python code with planted issues |
| `feedback` | string | Previous step feedback with refinement hints |
| `step_number` | integer | Current step (0 after reset) |
| `available_issue_tags` | array | Allowed taxonomy tags |

### Action Space

| Field | Type | Description |
|---|---|---|
| `issues_found` | list[str] | Tags from ISSUE_TAXONOMY |
| `severity` | enum | `low` / `medium` / `high` / `critical` |
| `review_comment` | string | Explanation of identified issues |

### Episode Flow

1. `reset(task_id)` loads a task and returns the initial observation
2. Agent receives code snippet and available tags
3. Agent submits review via `step(action)`
4. Environment returns observation with score, feedback, and refinement hints
5. Agent can use feedback to improve on subsequent steps
6. Episode ends when score ≥ 0.95 or step limit (3) reached

## Tasks

| Task | Difficulty | Planted Issues | File |
|---|---|---|---|
| `task_extra_easy` | Extra Easy | `index_out_of_bounds` | data_utils.py |
| `task_easy` | Easy | `null_pointer`, `missing_return` | user_service.py |
| `task_medium` | Medium | `sql_injection`, `hardcoded_secret` | auth.py |
| `task_hard` | Hard | `race_condition`, `improper_error_handling`, `timing_attack` | payments.py |
| `task_expert` | Expert | `path_traversal`, `integer_overflow`, `missing_input_validation`, `type_error` | file_processor.py |

## Reward Design

**Summary:** Correct behavior yields positive reward (~1.0), random strategies are penalized, ensuring meaningful learning signals.

The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end.

Core components:

- **Recall reward**: fractional points for correctly identified issues
- **Quality bonus**: +0.05 per correct issue with a matching keyword in the comment
- **Severity bonus**: +0.05 when severity matches expected level for task difficulty
- **Precision penalty**: −0.10 for hallucinated or false-positive issues

## Project Structure

```text
.
├── __init__.py              # Package exports
├── client.py                # WebSocket client for agent interaction
├── models.py                # Typed Pydantic models (Action, Observation, State)
├── inference.py             # Baseline inference script with LLM + rule fallback
├── openenv.yaml             # OpenEnv specification
├── pyproject.toml           # Project config with pytest setup
├── requirements.txt         # Pip dependencies
├── Dockerfile               # Production container with health check
├── conftest.py              # Pytest root configuration
├── README.md
├── scripts/
│   └── validate-submission.sh
├── server/
│   ├── __init__.py
│   ├── app.py               # FastAPI + Gradio dashboard
│   ├── code_review_env_environment.py  # Environment with iterative refinement
│   ├── graders.py            # Multi-dimensional deterministic grader
│   ├── tasks.py              # 5 task definitions with planted issues
│   ├── requirements.txt
│   └── Dockerfile
└── tests/
    ├── conftest.py
    ├── __init__.py
    ├── test_graders.py       # 19 grader tests
    └── test_environment.py   # 13 environment lifecycle tests
```

## Setup

```bash
uv sync --frozen
# OR:
pip install -r requirements.txt
pip install -r server/requirements.txt
```

## Running

### Start the server

```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Run tests

```bash
uv run pytest tests/ -v
```

### Run baseline inference

```bash
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=your-token
python inference.py
```

## Docker

```bash
docker build -t code-review-openenv -f Dockerfile .
docker run -p 8000:8000 code-review-openenv
```

## 🔌 API Endpoints

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/health` | Health check |
| `GET` | `/tasks` | List all tasks with schemas |
| `POST` | `/reset` | Reset environment for a task |
| `POST` | `/step` | Submit a review action |
| `GET` | `/state` | Get current episode state |
| `POST` | `/grader` | Score a review against a task |
| `POST` | `/baseline` | Run rule-based baseline |

## Validation

```bash
openenv validate .
./scripts/validate-submission.sh http://localhost:8000 .
```

## 🏁 Submission Status

-  All 5 OpenEnv validation checks passing
-  32/32 unit tests passing
-  Docker build and deployment verified
-  End-to-end inference and grading pipeline tested

---

## 🔗 Links

- GitHub: https://github.com/Dolphin-Syndrom/code-review-env
- Hugging Face Space: https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env

## License

BSD-3-Clause