Spaces:
Sleeping
Sleeping
| title: Code Review Environment | |
| emoji: π‘οΈ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: false | |
| license: bsd-3-clause | |
| short_description: AI agent code review environment benchmark | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - code-review | |
| # Code Review OpenEnv Benchmark | |
| ## π Scaler March 2026 Hackathon Submission | |
| **Author:** Dolphin-Syndrom | |
| **Type:** OpenEnv Benchmark Environment | |
| **Focus:** Evaluating LLM agents on security-aware code review tasks | |
| --- | |
| ## β‘ TL;DR | |
| A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews. | |
| - **5 tasks** with progressive difficulty (extra_easy β easy β medium β hard β expert) | |
| - **12-tag issue taxonomy** covering security, logic, and robustness flaws | |
| - **Multi-dimensional grading**: recall + quality bonus + severity bonus β precision penalty | |
| - **Iterative refinement**: feedback-driven multi-step improvement within episodes | |
| - **32 unit tests** covering graders, environment lifecycle, and task coverage | |
| - Deterministic scoring (0.0β1.0), deployable via Docker on Hugging Face Spaces | |
| - Fully OpenEnv compliant | |
| --- | |
| > Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops with iterative refinement. | |
| > | |
| > Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives. | |
| ## What Makes This Environment Unique | |
| ### 1. Iterative Refinement Mechanic | |
| Unlike single-shot evaluation environments, this benchmark provides **structured feedback after each step** that tells agents what categories of issues they missed (without revealing exact tags). This creates a genuine multi-step learning loop: | |
| ``` | |
| Step 1: Agent submits initial review β receives "Hint: look for security vulnerability" | |
| Step 2: Agent refines review based on hint β finds missed sql_injection β score improves | |
| Step 3: Final attempt with all accumulated feedback | |
| ``` | |
| This models how real code review works β reviewers iterate based on discussion and feedback. | |
| ### 2. Multi-Dimensional Reward Function | |
| The grading system evaluates four orthogonal dimensions simultaneously: | |
| | Component | Value | Signal | | |
| |---|---|---| | |
| | **Recall reward** | `|correct| / |planted|` | Comprehensive detection | | |
| | **Quality bonus** | +0.05 per issue | Keyword-rich explanations | | |
| | **Severity bonus** | +0.05 | Correct risk assessment | | |
| | **Precision penalty** | β0.10 per FP | Anti-hallucination | | |
| This forces agents to balance thoroughness against precision β a core tension in real code review. | |
| ### 3. Full 12-Tag Taxonomy Coverage | |
| Every tag in the taxonomy is exercised across the 5 tasks: | |
| | Category | Tags | Task Coverage | | |
| |---|---|---| | |
| | Logic errors | `null_pointer`, `missing_return`, `index_out_of_bounds` | extra_easy, easy | | |
| | Security | `sql_injection`, `hardcoded_secret`, `path_traversal` | medium, expert | | |
| | Robustness | `race_condition`, `timing_attack`, `improper_error_handling` | hard | | |
| | Input handling | `type_error`, `integer_overflow`, `missing_input_validation` | expert | | |
| ## Architecture | |
| ```mermaid | |
| graph TB | |
| Agent[AI Agent / inference.py] -->|POST /reset| Server[FastAPI Server] | |
| Agent -->|POST /step| Server | |
| Server --> Env[CodeReviewEnvironment] | |
| Env --> Tasks[Task Registry - 5 tasks] | |
| Env --> Grader[Deterministic Grader] | |
| Grader -->|recall + quality + severity β penalty| Score[Score 0.0-1.0] | |
| Score -->|observation + reward + feedback| Agent | |
| Server -->|GET /health| Health[Health Check] | |
| Server -->|POST /grader| Grader | |
| Server -->|POST /baseline| Baseline[Rule-Based Baseline] | |
| Server -->|Gradio UI| Dashboard[Analytics Dashboard] | |
| style Agent fill:#58a6ff,stroke:#333 | |
| style Server fill:#3fb950,stroke:#333 | |
| style Grader fill:#f0883e,stroke:#333 | |
| style Dashboard fill:#bc8cff,stroke:#333 | |
| ``` | |
| ## Environment Specification | |
| ### Objective | |
| For each episode, the agent sees a Python code snippet containing planted issues and must: | |
| 1. Identify issues using tags from a 12-item `ISSUE_TAXONOMY` | |
| 2. Assess overall severity (`low`, `medium`, `high`, `critical`) | |
| 3. Articulate findings in a human-readable `review_comment` | |
| 4. Iteratively refine based on environment feedback across up to 3 steps | |
| ### Observation Space | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `task_id` | string | Current task identifier | | |
| | `file_name` | string | File under review | | |
| | `task_description` | string | Review instructions | | |
| | `code_snippet` | string | Python code with planted issues | | |
| | `feedback` | string | Previous step feedback with refinement hints | | |
| | `step_number` | integer | Current step (0 after reset) | | |
| | `available_issue_tags` | array | Allowed taxonomy tags | | |
| ### Action Space | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `issues_found` | list[str] | Tags from ISSUE_TAXONOMY | | |
| | `severity` | enum | `low` / `medium` / `high` / `critical` | | |
| | `review_comment` | string | Explanation of identified issues | | |
| ### Episode Flow | |
| 1. `reset(task_id)` loads a task and returns the initial observation | |
| 2. Agent receives code snippet and available tags | |
| 3. Agent submits review via `step(action)` | |
| 4. Environment returns observation with score, feedback, and refinement hints | |
| 5. Agent can use feedback to improve on subsequent steps | |
| 6. Episode ends when score β₯ 0.95 or step limit (3) reached | |
| ## Tasks | |
| | Task | Difficulty | Planted Issues | File | | |
| |---|---|---|---| | |
| | `task_extra_easy` | Extra Easy | `index_out_of_bounds` | data_utils.py | | |
| | `task_easy` | Easy | `null_pointer`, `missing_return` | user_service.py | | |
| | `task_medium` | Medium | `sql_injection`, `hardcoded_secret` | auth.py | | |
| | `task_hard` | Hard | `race_condition`, `improper_error_handling`, `timing_attack` | payments.py | | |
| | `task_expert` | Expert | `path_traversal`, `integer_overflow`, `missing_input_validation`, `type_error` | file_processor.py | | |
| ## Reward Design | |
| **Summary:** Correct behavior yields positive reward (~1.0), random strategies are penalized, ensuring meaningful learning signals. | |
| The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end. | |
| Core components: | |
| - **Recall reward**: fractional points for correctly identified issues | |
| - **Quality bonus**: +0.05 per correct issue with a matching keyword in the comment | |
| - **Severity bonus**: +0.05 when severity matches expected level for task difficulty | |
| - **Precision penalty**: β0.10 for hallucinated or false-positive issues | |
| ## Project Structure | |
| ```text | |
| . | |
| βββ __init__.py # Package exports | |
| βββ client.py # WebSocket client for agent interaction | |
| βββ models.py # Typed Pydantic models (Action, Observation, State) | |
| βββ inference.py # Baseline inference script with LLM + rule fallback | |
| βββ openenv.yaml # OpenEnv specification | |
| βββ pyproject.toml # Project config with pytest setup | |
| βββ requirements.txt # Pip dependencies | |
| βββ Dockerfile # Production container with health check | |
| βββ conftest.py # Pytest root configuration | |
| βββ README.md | |
| βββ scripts/ | |
| β βββ validate-submission.sh | |
| βββ server/ | |
| β βββ __init__.py | |
| β βββ app.py # FastAPI + Gradio dashboard | |
| β βββ code_review_env_environment.py # Environment with iterative refinement | |
| β βββ graders.py # Multi-dimensional deterministic grader | |
| β βββ tasks.py # 5 task definitions with planted issues | |
| β βββ requirements.txt | |
| β βββ Dockerfile | |
| βββ tests/ | |
| βββ conftest.py | |
| βββ __init__.py | |
| βββ test_graders.py # 19 grader tests | |
| βββ test_environment.py # 13 environment lifecycle tests | |
| ``` | |
| ## Setup | |
| ```bash | |
| uv sync --frozen | |
| # OR: | |
| pip install -r requirements.txt | |
| pip install -r server/requirements.txt | |
| ``` | |
| ## Running | |
| ### Start the server | |
| ```bash | |
| uv run uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Run tests | |
| ```bash | |
| uv run pytest tests/ -v | |
| ``` | |
| ### Run baseline inference | |
| ```bash | |
| export API_BASE_URL=https://router.huggingface.co/v1 | |
| export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct | |
| export HF_TOKEN=your-token | |
| python inference.py | |
| ``` | |
| ## Docker | |
| ```bash | |
| docker build -t code-review-openenv -f Dockerfile . | |
| docker run -p 8000:8000 code-review-openenv | |
| ``` | |
| ## π API Endpoints | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `GET` | `/health` | Health check | | |
| | `GET` | `/tasks` | List all tasks with schemas | | |
| | `POST` | `/reset` | Reset environment for a task | | |
| | `POST` | `/step` | Submit a review action | | |
| | `GET` | `/state` | Get current episode state | | |
| | `POST` | `/grader` | Score a review against a task | | |
| | `POST` | `/baseline` | Run rule-based baseline | | |
| ## Validation | |
| ```bash | |
| openenv validate . | |
| ./scripts/validate-submission.sh http://localhost:8000 . | |
| ``` | |
| ## π Submission Status | |
| - All 5 OpenEnv validation checks passing | |
| - 32/32 unit tests passing | |
| - Docker build and deployment verified | |
| - End-to-end inference and grading pipeline tested | |
| --- | |
| ## π Links | |
| - GitHub: https://github.com/Dolphin-Syndrom/code-review-env | |
| - Hugging Face Space: https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env | |
| ## License | |
| BSD-3-Clause | |