Spaces:
Sleeping
Sleeping
File size: 9,379 Bytes
1256abd 14dc79c bb30ed3 14dc79c 77e1c62 319df19 77e1c62 319df19 0bbb422 77e1c62 319df19 77e1c62 319df19 0bbb422 319df19 0bbb422 77e1c62 319df19 0bbb422 319df19 0bbb422 319df19 0bbb422 319df19 0bbb422 319df19 0bbb422 319df19 0bbb422 1256abd 0bbb422 1256abd 0bbb422 319df19 0bbb422 319df19 0bbb422 319df19 0bbb422 1256abd 0bbb422 1256abd 0bbb422 1256abd 0bbb422 1256abd 0bbb422 1256abd 0bbb422 1256abd 0bbb422 1256abd 0bbb422 1256abd 0bbb422 1256abd 0bbb422 1256abd 0bbb422 1256abd 0bbb422 1256abd 0bbb422 319df19 77e1c62 319df19 0bbb422 319df19 77e1c62 319df19 0bbb422 319df19 77e1c62 319df19 0bbb422 1256abd 77e1c62 1256abd 77e1c62 1256abd 0bbb422 319df19 77e1c62 319df19 77e1c62 0bbb422 77e1c62 0bbb422 77e1c62 0bbb422 319df19 77e1c62 319df19 1256abd 77e1c62 0bbb422 77e1c62 319df19 0bbb422 77e1c62 0bbb422 319df19 1256abd c3a9860 0bbb422 c3a9860 0bbb422 c3a9860 0bbb422 c3a9860 d1cfa81 77e1c62 319df19 1256abd 77e1c62 c3a9860 77e1c62 c3a9860 0bbb422 319df19 0bbb422 319df19 0bbb422 1256abd 0bbb422 319df19 77e1c62 d1cfa81 0bbb422 319df19 77e1c62 319df19 77e1c62 319df19 0bbb422 77e1c62 319df19 77e1c62 319df19 77e1c62 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 | ---
title: Code Review Environment
emoji: π‘οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
short_description: AI agent code review environment benchmark
tags:
- openenv
- reinforcement-learning
- code-review
---
# Code Review OpenEnv Benchmark
## π Scaler March 2026 Hackathon Submission
**Author:** Dolphin-Syndrom
**Type:** OpenEnv Benchmark Environment
**Focus:** Evaluating LLM agents on security-aware code review tasks
---
## β‘ TL;DR
A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews.
- **5 tasks** with progressive difficulty (extra_easy β easy β medium β hard β expert)
- **12-tag issue taxonomy** covering security, logic, and robustness flaws
- **Multi-dimensional grading**: recall + quality bonus + severity bonus β precision penalty
- **Iterative refinement**: feedback-driven multi-step improvement within episodes
- **32 unit tests** covering graders, environment lifecycle, and task coverage
- Deterministic scoring (0.0β1.0), deployable via Docker on Hugging Face Spaces
- Fully OpenEnv compliant
---
> Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops with iterative refinement.
>
> Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives.
## What Makes This Environment Unique
### 1. Iterative Refinement Mechanic
Unlike single-shot evaluation environments, this benchmark provides **structured feedback after each step** that tells agents what categories of issues they missed (without revealing exact tags). This creates a genuine multi-step learning loop:
```
Step 1: Agent submits initial review β receives "Hint: look for security vulnerability"
Step 2: Agent refines review based on hint β finds missed sql_injection β score improves
Step 3: Final attempt with all accumulated feedback
```
This models how real code review works β reviewers iterate based on discussion and feedback.
### 2. Multi-Dimensional Reward Function
The grading system evaluates four orthogonal dimensions simultaneously:
| Component | Value | Signal |
|---|---|---|
| **Recall reward** | `|correct| / |planted|` | Comprehensive detection |
| **Quality bonus** | +0.05 per issue | Keyword-rich explanations |
| **Severity bonus** | +0.05 | Correct risk assessment |
| **Precision penalty** | β0.10 per FP | Anti-hallucination |
This forces agents to balance thoroughness against precision β a core tension in real code review.
### 3. Full 12-Tag Taxonomy Coverage
Every tag in the taxonomy is exercised across the 5 tasks:
| Category | Tags | Task Coverage |
|---|---|---|
| Logic errors | `null_pointer`, `missing_return`, `index_out_of_bounds` | extra_easy, easy |
| Security | `sql_injection`, `hardcoded_secret`, `path_traversal` | medium, expert |
| Robustness | `race_condition`, `timing_attack`, `improper_error_handling` | hard |
| Input handling | `type_error`, `integer_overflow`, `missing_input_validation` | expert |
## Architecture
```mermaid
graph TB
Agent[AI Agent / inference.py] -->|POST /reset| Server[FastAPI Server]
Agent -->|POST /step| Server
Server --> Env[CodeReviewEnvironment]
Env --> Tasks[Task Registry - 5 tasks]
Env --> Grader[Deterministic Grader]
Grader -->|recall + quality + severity β penalty| Score[Score 0.0-1.0]
Score -->|observation + reward + feedback| Agent
Server -->|GET /health| Health[Health Check]
Server -->|POST /grader| Grader
Server -->|POST /baseline| Baseline[Rule-Based Baseline]
Server -->|Gradio UI| Dashboard[Analytics Dashboard]
style Agent fill:#58a6ff,stroke:#333
style Server fill:#3fb950,stroke:#333
style Grader fill:#f0883e,stroke:#333
style Dashboard fill:#bc8cff,stroke:#333
```
## Environment Specification
### Objective
For each episode, the agent sees a Python code snippet containing planted issues and must:
1. Identify issues using tags from a 12-item `ISSUE_TAXONOMY`
2. Assess overall severity (`low`, `medium`, `high`, `critical`)
3. Articulate findings in a human-readable `review_comment`
4. Iteratively refine based on environment feedback across up to 3 steps
### Observation Space
| Field | Type | Description |
|---|---|---|
| `task_id` | string | Current task identifier |
| `file_name` | string | File under review |
| `task_description` | string | Review instructions |
| `code_snippet` | string | Python code with planted issues |
| `feedback` | string | Previous step feedback with refinement hints |
| `step_number` | integer | Current step (0 after reset) |
| `available_issue_tags` | array | Allowed taxonomy tags |
### Action Space
| Field | Type | Description |
|---|---|---|
| `issues_found` | list[str] | Tags from ISSUE_TAXONOMY |
| `severity` | enum | `low` / `medium` / `high` / `critical` |
| `review_comment` | string | Explanation of identified issues |
### Episode Flow
1. `reset(task_id)` loads a task and returns the initial observation
2. Agent receives code snippet and available tags
3. Agent submits review via `step(action)`
4. Environment returns observation with score, feedback, and refinement hints
5. Agent can use feedback to improve on subsequent steps
6. Episode ends when score β₯ 0.95 or step limit (3) reached
## Tasks
| Task | Difficulty | Planted Issues | File |
|---|---|---|---|
| `task_extra_easy` | Extra Easy | `index_out_of_bounds` | data_utils.py |
| `task_easy` | Easy | `null_pointer`, `missing_return` | user_service.py |
| `task_medium` | Medium | `sql_injection`, `hardcoded_secret` | auth.py |
| `task_hard` | Hard | `race_condition`, `improper_error_handling`, `timing_attack` | payments.py |
| `task_expert` | Expert | `path_traversal`, `integer_overflow`, `missing_input_validation`, `type_error` | file_processor.py |
## Reward Design
**Summary:** Correct behavior yields positive reward (~1.0), random strategies are penalized, ensuring meaningful learning signals.
The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end.
Core components:
- **Recall reward**: fractional points for correctly identified issues
- **Quality bonus**: +0.05 per correct issue with a matching keyword in the comment
- **Severity bonus**: +0.05 when severity matches expected level for task difficulty
- **Precision penalty**: β0.10 for hallucinated or false-positive issues
## Project Structure
```text
.
βββ __init__.py # Package exports
βββ client.py # WebSocket client for agent interaction
βββ models.py # Typed Pydantic models (Action, Observation, State)
βββ inference.py # Baseline inference script with LLM + rule fallback
βββ openenv.yaml # OpenEnv specification
βββ pyproject.toml # Project config with pytest setup
βββ requirements.txt # Pip dependencies
βββ Dockerfile # Production container with health check
βββ conftest.py # Pytest root configuration
βββ README.md
βββ scripts/
β βββ validate-submission.sh
βββ server/
β βββ __init__.py
β βββ app.py # FastAPI + Gradio dashboard
β βββ code_review_env_environment.py # Environment with iterative refinement
β βββ graders.py # Multi-dimensional deterministic grader
β βββ tasks.py # 5 task definitions with planted issues
β βββ requirements.txt
β βββ Dockerfile
βββ tests/
βββ conftest.py
βββ __init__.py
βββ test_graders.py # 19 grader tests
βββ test_environment.py # 13 environment lifecycle tests
```
## Setup
```bash
uv sync --frozen
# OR:
pip install -r requirements.txt
pip install -r server/requirements.txt
```
## Running
### Start the server
```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Run tests
```bash
uv run pytest tests/ -v
```
### Run baseline inference
```bash
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=your-token
python inference.py
```
## Docker
```bash
docker build -t code-review-openenv -f Dockerfile .
docker run -p 8000:8000 code-review-openenv
```
## π API Endpoints
| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/health` | Health check |
| `GET` | `/tasks` | List all tasks with schemas |
| `POST` | `/reset` | Reset environment for a task |
| `POST` | `/step` | Submit a review action |
| `GET` | `/state` | Get current episode state |
| `POST` | `/grader` | Score a review against a task |
| `POST` | `/baseline` | Run rule-based baseline |
## Validation
```bash
openenv validate .
./scripts/validate-submission.sh http://localhost:8000 .
```
## π Submission Status
- All 5 OpenEnv validation checks passing
- 32/32 unit tests passing
- Docker build and deployment verified
- End-to-end inference and grading pipeline tested
---
## π Links
- GitHub: https://github.com/Dolphin-Syndrom/code-review-env
- Hugging Face Space: https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env
## License
BSD-3-Clause
|