Spaces:

Avnishjain
/

codereview

Configuration error

App Files Files Community

codereview / README.md

Avnishjain

Upload 34 files

7b2a69c verified 5 days ago

preview code

raw

history blame contribute delete

9.87 kB

	# 🔍 CodeReview OpenEnv

	An OpenEnv-compliant AI training environment that simulates professional Python code review. Agents learn to identify bugs, security vulnerabilities, performance bottlenecks, style issues, and documentation gaps — exactly as a senior engineer would in a real pull-request workflow.

	---

	## Why Code Review?

	Code review is one of the highest-leverage tasks in software engineering. It is:

	- Real-world: Every professional software team does it daily
	- Structured enough to grade: Issues have objectively correct or incorrect assessments
	- Rich in partial signal: An agent that spots 3/5 critical issues is measurably better than one that spots 1/5
	- Scalable in difficulty: Easy (bugs only) → Hard (all categories + written summary)

	This makes it an ideal domain for training and evaluating LLM-based agents on multi-step reasoning and quality estimation tasks.

	---

	## Environment Description

	```
	CodeReviewEnv
	├── Task 1 – Easy : Bug detection + Code style (calculator.py, 31 lines)
	├── Task 2 – Medium : Security + Performance audit (user_service.py, 55 lines)
	└── Task 3 – Hard : Full review, all 5 categories (data_pipeline.py, 49 lines)
	```

	Each task presents a Python snippet containing intentional flaws. The agent submits `ReviewComment` objects across one or more steps, then finalises with `submit=True`. A deterministic grader scores the review against ground-truth issues.

	---

	## Observation Space

	What the agent sees on each step:

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `task_id` \| `str` \| Active task identifier \|
	\| `step` \| `int` \| Current step (0-indexed) \|
	\| `snippet.file_name` \| `str` \| Logical file name (e.g. `auth.py`) \|
	\| `snippet.source` \| `str` \| Full Python source code \|
	\| `instructions` \| `str` \| Review scope, difficulty, and guidance \|
	\| `previous_comments` \| `list[ReviewComment]` \| All comments submitted so far \|
	\| `feedback` \| `str \\| None` \| Env feedback on the last action \|
	\| `done` \| `bool` \| Whether the episode has ended \|

	---

	## Action Space

	What the agent submits on each step:

	```json
	{
	"comments": [
	{
	"line": 10,
	"category": "security",
	"severity": "critical",
	"message": "SQL injection via string interpolation in query.",
	"suggestion": "Use parameterised queries: cursor.execute('...', (username,))"
	}
	],
	"summary": "Overall review summary (required for task_3_hard)",
	"submit": true
	}
	```

	\| Field \| Type \| Values \|
	\|---\|---\|---\|
	\| `comments[].line` \| `int \\| null` \| 1-indexed line number; `null` for file-level \|
	\| `comments[].category` \| `enum` \| `bug`, `security`, `performance`, `style`, `documentation` \|
	\| `comments[].severity` \| `enum` \| `low`, `medium`, `high`, `critical` \|
	\| `comments[].message` \| `str` \| 5–500 chars \|
	\| `comments[].suggestion` \| `str \\| null` \| Optional fix suggestion \|
	\| `summary` \| `str \\| null` \| Required for `task_3_hard`, optional otherwise \|
	\| `submit` \| `bool` \| `true` finalises the review and triggers the grader \|

	---

	## Reward Function

	Rewards are shaped to provide signal over the full trajectory, not just on terminal submit.

	### Per-step (incremental) rewards

	\| Event \| Reward \|
	\|---\|---\|
	\| New valid comment added \| `+0.05` per comment (max `+0.15`) \|
	\| Progress signal (grader score delta) \| `+0.5 × Δscore` \|
	\| Empty step (no new comments) \| `−0.05` \|
	\| Spam (> 2.5× expected comments) \| `−0.10` \|

	### On `submit=True` (terminal)

	```
	submit_reward = score × 0.8 + (0.2 if score ≥ threshold else −0.2)
	```

	### Per-category penalties (applied to terminal grader score)

	\| Event \| Penalty \|
	\|---\|---\|
	\| False positive (fabricated issue) \| `−0.08–0.12` per comment \|
	\| Missed CRITICAL security issue \| `−0.15–0.20` \|
	\| Missed HIGH issue \| `−0.08–0.10` \|
	\| No summary on task 3 \| `−0.10` \|

	All rewards are clipped to `[−1.0, 1.0]`.

	---

	## Task Descriptions

	### Task 1 – Easy: Bug Detection & Style Review
	File: `calculator.py` (31 lines) \| Max steps: 5 \| Pass threshold: 0.55

	Covers basic utility functions: `divide`, `average`, `celsius_to_fahrenheit`, `find_max`, `count_words`.

	Ground-truth issues (6):
	- `divide()` — no zero-division guard (HIGH bug)
	- `average()` — crashes on empty list (HIGH bug)
	- `celsius_to_fahrenheit` — off-by-one (+31 vs +32) (MEDIUM bug)
	- `find_max()` — crashes on empty list (MEDIUM bug)
	- `for i in range(len(lst))` — unpythonic iteration (LOW style)
	- Manual `Counter` reimplementation (LOW style)

	---

	### Task 2 – Medium: Security & Performance Audit
	File: `user_service.py` (55 lines) \| Max steps: 7 \| Pass threshold: 0.60

	A SQLite-backed user management service with authentication.

	Ground-truth issues (6):
	- SQL injection in `get_user()` — f-string query (CRITICAL security)
	- MD5 password hashing in `create_user()` (CRITICAL security)
	- SQL injection in `delete_user()` (CRITICAL security)
	- MD5 reuse in `authenticate()` (HIGH security)
	- `fetchall()` on unbounded table (HIGH performance)
	- New DB connection per query, no pooling (MEDIUM performance)

	---

	### Task 3 – Hard: Comprehensive Code Review
	File: `data_pipeline.py` (49 lines) \| Max steps: 10 \| Pass threshold: 0.65

	An analytics data pipeline with CSV loading, row transformation, caching, and stats.

	Ground-truth issues (13 across all 5 categories):
	- `subprocess.run(shell=True)` with user input — OS command injection (CRITICAL security)
	- `pickle.loads()` on arbitrary cache data — RCE risk (CRITICAL security)
	- Pickling into module-level dict (HIGH security)
	- `compute_stats()` ZeroDivisionError on empty data (HIGH bug)
	- Missing `"value"` key → silent KeyError (MEDIUM bug)
	- `open()` without encoding (MEDIUM bug)
	- Two-pass iteration in `compute_stats` (MEDIUM performance)
	- Subprocess per row instead of batching (MEDIUM performance)
	- `str(stats)` instead of JSON export (LOW style)
	- Module-level mutable global cache (LOW style)
	- `load_data()` missing docstring (LOW documentation)
	- `process_row()` missing docstring (LOW documentation)
	- Insufficient module-level docstring (LOW documentation)

	A written summary is required (`summary` field) — absence incurs a `−0.10` score penalty.

	---

	## Expected Baseline Scores (gpt-4o)

	\| Task \| Score \| Pass? \| Notes \|
	\|---\|---\|---\|---\|
	\| `task_1_easy` \| ~0.75 \| ✅ \| GPT-4o reliably spots ZeroDivisionError and off-by-one \|
	\| `task_2_medium` \| ~0.65 \| ✅ \| SQL injection found; MD5 usually flagged; perf issues partial \|
	\| `task_3_hard` \| ~0.55 \| ✅ \| Pickle RCE and shell injection found; docs often missed \|

	---

	## Setup & Usage

	### Option A — Docker (recommended)

	```bash
	# Build
	docker build -t code-review-env .

	# Run (port 7860)
	docker run -p 7860:7860 code-review-env

	# Test it
	curl http://localhost:7860/health
	```

	### Option B — Local Python

	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Start the server
	uvicorn app:app --host 0.0.0.0 --port 7860 --reload

	# Open docs
	open http://localhost:7860/docs
	```

	### Run the test suite

	```bash
	pytest tests/ -v
	# Expected: 25 passed
	```

	### Run the baseline agent

	```bash
	export OPENAI_API_KEY=sk-...

	# All tasks (direct mode — no server needed)
	python baseline_agent.py

	# Single task
	python baseline_agent.py --task task_2_medium

	# Against a running HTTP server
	python baseline_agent.py --mode http --base-url http://localhost:7860
	```

	---

	## API Reference

	\| Endpoint \| Method \| Description \|
	\|---\|---\|---\|
	\| `/` \| GET \| HTML landing page \|
	\| `/health` \| GET \| Health check \|
	\| `/tasks` \| GET \| List all task specs \|
	\| `/reset` \| POST \| Start or restart an episode \|
	\| `/step` \| POST \| Submit an action \|
	\| `/state` \| GET \| Get full serialisable state \|
	\| `/docs` \| GET \| Interactive Swagger UI \|

	### Example: Full episode via curl

	```bash
	# 1. Reset
	curl -X POST http://localhost:7860/reset \
	-H 'Content-Type: application/json' \
	-d '{"task_id": "task_1_easy", "session_id": "demo"}'

	# 2. Step
	curl -X POST http://localhost:7860/step \
	-H 'Content-Type: application/json' \
	-d '{
	"session_id": "demo",
	"action": {
	"comments": [
	{
	"line": 2,
	"category": "bug",
	"severity": "high",
	"message": "divide() will raise ZeroDivisionError when b is 0.",
	"suggestion": "Guard with: if b == 0: raise ValueError"
	}
	],
	"submit": true
	}
	}'

	# 3. Check state
	curl "http://localhost:7860/state?session_id=demo"
	```

	---

	## Project Structure

	```
	openenv-code-review/
	├── app.py # FastAPI HTTP server
	├── openenv.yaml # OpenEnv spec metadata
	├── Dockerfile # Container definition
	├── requirements.txt
	├── baseline_agent.py # gpt-4o baseline inference script
	│
	├── env/
	│ ├── models.py # Pydantic typed models (Observation, Action, Reward, …)
	│ └── environment.py # CodeReviewEnv — step() / reset() / state()
	│
	├── corpus/
	│ └── snippets.py # Python snippets with ground-truth issues
	│
	├── graders/
	│ └── graders.py # Task1Grader, Task2Grader, Task3Grader
	│
	└── tests/
	└── test_env.py # 25-test pytest suite (all passing)
	```

	---

	## Deploying to Hugging Face Spaces

	1. Create a new Space with Docker SDK
	2. Push this repository to the Space
	3. Set `OPENAI_API_KEY` as a Space secret (only needed for baseline script)
	4. The Space will auto-build and expose port 7860

	```yaml
	# README.md frontmatter for HF Spaces
	---
	title: CodeReview OpenEnv
	emoji: 🔍
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	pinned: false
	tags:
	- openenv
	- code-review
	- ai-agent
	- evaluation
	---
	```

	---

	## License

	MIT