--- title: Code Review Environment emoji: 🛡️ colorFrom: blue colorTo: purple sdk: docker app_port: 8000 pinned: false license: bsd-3-clause short_description: AI agent code review environment benchmark tags: - openenv - reinforcement-learning - code-review --- # Code Review OpenEnv Benchmark ## 🚀 Scaler March 2026 Hackathon Submission **Author:** Dolphin-Syndrom **Type:** OpenEnv Benchmark Environment **Focus:** Evaluating LLM agents on security-aware code review tasks --- ## ⚡ TL;DR A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews. - **5 tasks** with progressive difficulty (extra_easy → easy → medium → hard → expert) - **12-tag issue taxonomy** covering security, logic, and robustness flaws - **Multi-dimensional grading**: recall + quality bonus + severity bonus − precision penalty - **Iterative refinement**: feedback-driven multi-step improvement within episodes - **32 unit tests** covering graders, environment lifecycle, and task coverage - Deterministic scoring (0.0–1.0), deployable via Docker on Hugging Face Spaces - Fully OpenEnv compliant --- > Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops with iterative refinement. > > Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives. ## What Makes This Environment Unique ### 1. Iterative Refinement Mechanic Unlike single-shot evaluation environments, this benchmark provides **structured feedback after each step** that tells agents what categories of issues they missed (without revealing exact tags). This creates a genuine multi-step learning loop: ``` Step 1: Agent submits initial review → receives "Hint: look for security vulnerability" Step 2: Agent refines review based on hint → finds missed sql_injection → score improves Step 3: Final attempt with all accumulated feedback ``` This models how real code review works — reviewers iterate based on discussion and feedback. ### 2. Multi-Dimensional Reward Function The grading system evaluates four orthogonal dimensions simultaneously: | Component | Value | Signal | |---|---|---| | **Recall reward** | `|correct| / |planted|` | Comprehensive detection | | **Quality bonus** | +0.05 per issue | Keyword-rich explanations | | **Severity bonus** | +0.05 | Correct risk assessment | | **Precision penalty** | −0.10 per FP | Anti-hallucination | This forces agents to balance thoroughness against precision — a core tension in real code review. ### 3. Full 12-Tag Taxonomy Coverage Every tag in the taxonomy is exercised across the 5 tasks: | Category | Tags | Task Coverage | |---|---|---| | Logic errors | `null_pointer`, `missing_return`, `index_out_of_bounds` | extra_easy, easy | | Security | `sql_injection`, `hardcoded_secret`, `path_traversal` | medium, expert | | Robustness | `race_condition`, `timing_attack`, `improper_error_handling` | hard | | Input handling | `type_error`, `integer_overflow`, `missing_input_validation` | expert | ## Architecture ```mermaid graph TB Agent[AI Agent / inference.py] -->|POST /reset| Server[FastAPI Server] Agent -->|POST /step| Server Server --> Env[CodeReviewEnvironment] Env --> Tasks[Task Registry - 5 tasks] Env --> Grader[Deterministic Grader] Grader -->|recall + quality + severity − penalty| Score[Score 0.0-1.0] Score -->|observation + reward + feedback| Agent Server -->|GET /health| Health[Health Check] Server -->|POST /grader| Grader Server -->|POST /baseline| Baseline[Rule-Based Baseline] Server -->|Gradio UI| Dashboard[Analytics Dashboard] style Agent fill:#58a6ff,stroke:#333 style Server fill:#3fb950,stroke:#333 style Grader fill:#f0883e,stroke:#333 style Dashboard fill:#bc8cff,stroke:#333 ``` ## Environment Specification ### Objective For each episode, the agent sees a Python code snippet containing planted issues and must: 1. Identify issues using tags from a 12-item `ISSUE_TAXONOMY` 2. Assess overall severity (`low`, `medium`, `high`, `critical`) 3. Articulate findings in a human-readable `review_comment` 4. Iteratively refine based on environment feedback across up to 3 steps ### Observation Space | Field | Type | Description | |---|---|---| | `task_id` | string | Current task identifier | | `file_name` | string | File under review | | `task_description` | string | Review instructions | | `code_snippet` | string | Python code with planted issues | | `feedback` | string | Previous step feedback with refinement hints | | `step_number` | integer | Current step (0 after reset) | | `available_issue_tags` | array | Allowed taxonomy tags | ### Action Space | Field | Type | Description | |---|---|---| | `issues_found` | list[str] | Tags from ISSUE_TAXONOMY | | `severity` | enum | `low` / `medium` / `high` / `critical` | | `review_comment` | string | Explanation of identified issues | ### Episode Flow 1. `reset(task_id)` loads a task and returns the initial observation 2. Agent receives code snippet and available tags 3. Agent submits review via `step(action)` 4. Environment returns observation with score, feedback, and refinement hints 5. Agent can use feedback to improve on subsequent steps 6. Episode ends when score ≥ 0.95 or step limit (3) reached ## Tasks | Task | Difficulty | Planted Issues | File | |---|---|---|---| | `task_extra_easy` | Extra Easy | `index_out_of_bounds` | data_utils.py | | `task_easy` | Easy | `null_pointer`, `missing_return` | user_service.py | | `task_medium` | Medium | `sql_injection`, `hardcoded_secret` | auth.py | | `task_hard` | Hard | `race_condition`, `improper_error_handling`, `timing_attack` | payments.py | | `task_expert` | Expert | `path_traversal`, `integer_overflow`, `missing_input_validation`, `type_error` | file_processor.py | ## Reward Design **Summary:** Correct behavior yields positive reward (~1.0), random strategies are penalized, ensuring meaningful learning signals. The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end. Core components: - **Recall reward**: fractional points for correctly identified issues - **Quality bonus**: +0.05 per correct issue with a matching keyword in the comment - **Severity bonus**: +0.05 when severity matches expected level for task difficulty - **Precision penalty**: −0.10 for hallucinated or false-positive issues ## Project Structure ```text . ├── __init__.py # Package exports ├── client.py # WebSocket client for agent interaction ├── models.py # Typed Pydantic models (Action, Observation, State) ├── inference.py # Baseline inference script with LLM + rule fallback ├── openenv.yaml # OpenEnv specification ├── pyproject.toml # Project config with pytest setup ├── requirements.txt # Pip dependencies ├── Dockerfile # Production container with health check ├── conftest.py # Pytest root configuration ├── README.md ├── scripts/ │ └── validate-submission.sh ├── server/ │ ├── __init__.py │ ├── app.py # FastAPI + Gradio dashboard │ ├── code_review_env_environment.py # Environment with iterative refinement │ ├── graders.py # Multi-dimensional deterministic grader │ ├── tasks.py # 5 task definitions with planted issues │ ├── requirements.txt │ └── Dockerfile └── tests/ ├── conftest.py ├── __init__.py ├── test_graders.py # 19 grader tests └── test_environment.py # 13 environment lifecycle tests ``` ## Setup ```bash uv sync --frozen # OR: pip install -r requirements.txt pip install -r server/requirements.txt ``` ## Running ### Start the server ```bash uv run uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` ### Run tests ```bash uv run pytest tests/ -v ``` ### Run baseline inference ```bash export API_BASE_URL=https://router.huggingface.co/v1 export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct export HF_TOKEN=your-token python inference.py ``` ## Docker ```bash docker build -t code-review-openenv -f Dockerfile . docker run -p 8000:8000 code-review-openenv ``` ## 🔌 API Endpoints | Method | Endpoint | Description | |---|---|---| | `GET` | `/health` | Health check | | `GET` | `/tasks` | List all tasks with schemas | | `POST` | `/reset` | Reset environment for a task | | `POST` | `/step` | Submit a review action | | `GET` | `/state` | Get current episode state | | `POST` | `/grader` | Score a review against a task | | `POST` | `/baseline` | Run rule-based baseline | ## Validation ```bash openenv validate . ./scripts/validate-submission.sh http://localhost:8000 . ``` ## 🏁 Submission Status - All 5 OpenEnv validation checks passing - 32/32 unit tests passing - Docker build and deployment verified - End-to-end inference and grading pipeline tested --- ## 🔗 Links - GitHub: https://github.com/Dolphin-Syndrom/code-review-env - Hugging Face Space: https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env ## License BSD-3-Clause