Spaces:

Dolphin-Syndrom
/

code-review-env

Sleeping

App Files Files Community

code-review-env / README.md

theaniketgiri

Optimize for Phase 2: 5 tasks, severity scoring, iterative refinement, 32 tests

0bbb422 about 2 months ago

preview code

raw

history blame contribute delete

9.38 kB

	---
	title: Code Review Environment
	emoji: 🛡️
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 8000
	pinned: false
	license: bsd-3-clause
	short_description: AI agent code review environment benchmark
	tags:
	- openenv
	- reinforcement-learning
	- code-review
	---

	# Code Review OpenEnv Benchmark

	## 🚀 Scaler March 2026 Hackathon Submission

	Author: Dolphin-Syndrom
	Type: OpenEnv Benchmark Environment
	Focus: Evaluating LLM agents on security-aware code review tasks

	---

	## ⚡ TL;DR

	A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews.

	- 5 tasks with progressive difficulty (extra_easy → easy → medium → hard → expert)
	- 12-tag issue taxonomy covering security, logic, and robustness flaws
	- Multi-dimensional grading: recall + quality bonus + severity bonus − precision penalty
	- Iterative refinement: feedback-driven multi-step improvement within episodes
	- 32 unit tests covering graders, environment lifecycle, and task coverage
	- Deterministic scoring (0.0–1.0), deployable via Docker on Hugging Face Spaces
	- Fully OpenEnv compliant

	---

	> Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops with iterative refinement.
	>
	> Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives.

	## What Makes This Environment Unique

	### 1. Iterative Refinement Mechanic

	Unlike single-shot evaluation environments, this benchmark provides structured feedback after each step that tells agents what categories of issues they missed (without revealing exact tags). This creates a genuine multi-step learning loop:

	```
	Step 1: Agent submits initial review → receives "Hint: look for security vulnerability"
	Step 2: Agent refines review based on hint → finds missed sql_injection → score improves
	Step 3: Final attempt with all accumulated feedback
	```

	This models how real code review works — reviewers iterate based on discussion and feedback.

	### 2. Multi-Dimensional Reward Function

	The grading system evaluates four orthogonal dimensions simultaneously:

	\| Component \| Value \| Signal \|
	\|---\|---\|---\|
	\| Recall reward \| `\|correct\| / \|planted\|` \| Comprehensive detection \|
	\| Quality bonus \| +0.05 per issue \| Keyword-rich explanations \|
	\| Severity bonus \| +0.05 \| Correct risk assessment \|
	\| Precision penalty \| −0.10 per FP \| Anti-hallucination \|

	This forces agents to balance thoroughness against precision — a core tension in real code review.

	### 3. Full 12-Tag Taxonomy Coverage

	Every tag in the taxonomy is exercised across the 5 tasks:

	\| Category \| Tags \| Task Coverage \|
	\|---\|---\|---\|
	\| Logic errors \| `null_pointer`, `missing_return`, `index_out_of_bounds` \| extra_easy, easy \|
	\| Security \| `sql_injection`, `hardcoded_secret`, `path_traversal` \| medium, expert \|
	\| Robustness \| `race_condition`, `timing_attack`, `improper_error_handling` \| hard \|
	\| Input handling \| `type_error`, `integer_overflow`, `missing_input_validation` \| expert \|

	## Architecture

	```mermaid
	graph TB
	Agent[AI Agent / inference.py] -->\|POST /reset\| Server[FastAPI Server]
	Agent -->\|POST /step\| Server
	Server --> Env[CodeReviewEnvironment]
	Env --> Tasks[Task Registry - 5 tasks]
	Env --> Grader[Deterministic Grader]
	Grader -->\|recall + quality + severity − penalty\| Score[Score 0.0-1.0]
	Score -->\|observation + reward + feedback\| Agent
	Server -->\|GET /health\| Health[Health Check]
	Server -->\|POST /grader\| Grader
	Server -->\|POST /baseline\| Baseline[Rule-Based Baseline]
	Server -->\|Gradio UI\| Dashboard[Analytics Dashboard]

	style Agent fill:#58a6ff,stroke:#333
	style Server fill:#3fb950,stroke:#333
	style Grader fill:#f0883e,stroke:#333
	style Dashboard fill:#bc8cff,stroke:#333
	```

	## Environment Specification

	### Objective

	For each episode, the agent sees a Python code snippet containing planted issues and must:

	1. Identify issues using tags from a 12-item `ISSUE_TAXONOMY`
	2. Assess overall severity (`low`, `medium`, `high`, `critical`)
	3. Articulate findings in a human-readable `review_comment`
	4. Iteratively refine based on environment feedback across up to 3 steps

	### Observation Space

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `task_id` \| string \| Current task identifier \|
	\| `file_name` \| string \| File under review \|
	\| `task_description` \| string \| Review instructions \|
	\| `code_snippet` \| string \| Python code with planted issues \|
	\| `feedback` \| string \| Previous step feedback with refinement hints \|
	\| `step_number` \| integer \| Current step (0 after reset) \|
	\| `available_issue_tags` \| array \| Allowed taxonomy tags \|

	### Action Space

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `issues_found` \| list[str] \| Tags from ISSUE_TAXONOMY \|
	\| `severity` \| enum \| `low` / `medium` / `high` / `critical` \|
	\| `review_comment` \| string \| Explanation of identified issues \|

	### Episode Flow

	1. `reset(task_id)` loads a task and returns the initial observation
	2. Agent receives code snippet and available tags
	3. Agent submits review via `step(action)`
	4. Environment returns observation with score, feedback, and refinement hints
	5. Agent can use feedback to improve on subsequent steps
	6. Episode ends when score ≥ 0.95 or step limit (3) reached

	## Tasks

	\| Task \| Difficulty \| Planted Issues \| File \|
	\|---\|---\|---\|---\|
	\| `task_extra_easy` \| Extra Easy \| `index_out_of_bounds` \| data_utils.py \|
	\| `task_easy` \| Easy \| `null_pointer`, `missing_return` \| user_service.py \|
	\| `task_medium` \| Medium \| `sql_injection`, `hardcoded_secret` \| auth.py \|
	\| `task_hard` \| Hard \| `race_condition`, `improper_error_handling`, `timing_attack` \| payments.py \|
	\| `task_expert` \| Expert \| `path_traversal`, `integer_overflow`, `missing_input_validation`, `type_error` \| file_processor.py \|

	## Reward Design

	Summary: Correct behavior yields positive reward (~1.0), random strategies are penalized, ensuring meaningful learning signals.

	The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end.

	Core components:

	- Recall reward: fractional points for correctly identified issues
	- Quality bonus: +0.05 per correct issue with a matching keyword in the comment
	- Severity bonus: +0.05 when severity matches expected level for task difficulty
	- Precision penalty: −0.10 for hallucinated or false-positive issues

	## Project Structure

	```text
	.
	├── __init__.py # Package exports
	├── client.py # WebSocket client for agent interaction
	├── models.py # Typed Pydantic models (Action, Observation, State)
	├── inference.py # Baseline inference script with LLM + rule fallback
	├── openenv.yaml # OpenEnv specification
	├── pyproject.toml # Project config with pytest setup
	├── requirements.txt # Pip dependencies
	├── Dockerfile # Production container with health check
	├── conftest.py # Pytest root configuration
	├── README.md
	├── scripts/
	│ └── validate-submission.sh
	├── server/
	│ ├── __init__.py
	│ ├── app.py # FastAPI + Gradio dashboard
	│ ├── code_review_env_environment.py # Environment with iterative refinement
	│ ├── graders.py # Multi-dimensional deterministic grader
	│ ├── tasks.py # 5 task definitions with planted issues
	│ ├── requirements.txt
	│ └── Dockerfile
	└── tests/
	├── conftest.py
	├── __init__.py
	├── test_graders.py # 19 grader tests
	└── test_environment.py # 13 environment lifecycle tests
	```

	## Setup

	```bash
	uv sync --frozen
	# OR:
	pip install -r requirements.txt
	pip install -r server/requirements.txt
	```

	## Running

	### Start the server

	```bash
	uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
	```

	### Run tests

	```bash
	uv run pytest tests/ -v
	```

	### Run baseline inference

	```bash
	export API_BASE_URL=https://router.huggingface.co/v1
	export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
	export HF_TOKEN=your-token
	python inference.py
	```

	## Docker

	```bash
	docker build -t code-review-openenv -f Dockerfile .
	docker run -p 8000:8000 code-review-openenv
	```

	## 🔌 API Endpoints

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `GET` \| `/health` \| Health check \|
	\| `GET` \| `/tasks` \| List all tasks with schemas \|
	\| `POST` \| `/reset` \| Reset environment for a task \|
	\| `POST` \| `/step` \| Submit a review action \|
	\| `GET` \| `/state` \| Get current episode state \|
	\| `POST` \| `/grader` \| Score a review against a task \|
	\| `POST` \| `/baseline` \| Run rule-based baseline \|

	## Validation

	```bash
	openenv validate .
	./scripts/validate-submission.sh http://localhost:8000 .
	```

	## 🏁 Submission Status

	- All 5 OpenEnv validation checks passing
	- 32/32 unit tests passing
	- Docker build and deployment verified
	- End-to-end inference and grading pipeline tested

	---

	## 🔗 Links

	- GitHub: https://github.com/Dolphin-Syndrom/code-review-env
	- Hugging Face Space: https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env

	## License

	BSD-3-Clause