theaniketgiri commited on
Commit
77e1c62
·
1 Parent(s): 14dc79c
Files changed (1) hide show
  1. README.md +204 -418
README.md CHANGED
@@ -14,537 +14,323 @@ tags:
14
  - code-review
15
  ---
16
 
17
- <!-- Banner -->
18
- <div align="center">
19
 
20
- # Code Review Environment
21
 
22
- ### An OpenEnv Benchmark for AI-Driven Pull-Request Review
23
 
24
- [![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue?style=for-the-badge&logo=meta)](https://github.com/meta-pytorch/OpenEnv)
25
- [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org)
26
- [![Docker](https://img.shields.io/badge/docker-ready-2496ED?style=for-the-badge&logo=docker&logoColor=white)](https://hub.docker.com)
27
- [![License](https://img.shields.io/badge/license-BSD--3-green?style=for-the-badge)](LICENSE)
28
-
29
- ---
30
-
31
- **🚀 Scaler March 2026 Hackathon Submission**
32
-
33
- **Author:** [Dolphin-Syndrom](https://github.com/Dolphin-Syndrom) &nbsp;|&nbsp; **Type:** OpenEnv Benchmark &nbsp;|&nbsp; **Focus:** Security-Aware Code Review
34
-
35
- </div>
36
 
37
  ---
38
 
39
  ## ⚡ TL;DR
40
 
41
- A benchmark environment that evaluates whether an LLM agent can review buggy Python code, identify security vulnerabilities and logic errors using a fixed taxonomy, and articulate its findings — just like a senior engineer doing a pull-request review.
42
-
43
- - **3 progressive tasks** — easy → medium → hard
44
- - **12-tag issue taxonomy** — from `null_pointer` to `timing_attack`
45
- - **Deterministic multi-dimensional grading** — recall + precision + articulation quality
46
- - **Dense reward shaping** — signal at every step, not just episode end
47
- - **Structured actions & observations** — typed Pydantic models with full schema
48
- - **Dual inference modes** — LLM-backed or rule-based fallback
49
- - **Deployable** via Docker on Hugging Face Spaces
50
- - **Fully OpenEnv compliant** — passes `openenv validate`
51
-
52
- ---
53
-
54
- > *Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops — a high-value, real-world software engineering workflow.*
55
-
56
- ---
57
-
58
- ## 📑 Table of Contents
59
 
60
- - [Environment Description](#-environment-description)
61
- - [Why This Domain?](#-why-this-domain)
62
- - [Action Space](#-action-space)
63
- - [Observation Space](#-observation-space)
64
- - [Reward Function](#-reward-function)
65
- - [Tasks](#-tasks)
66
- - [Setup & Usage](#-setup--usage)
67
- - [Running the Baseline](#-running-the-baseline)
68
- - [Baseline Scores](#-baseline-scores)
69
- - [Deployment (HF Spaces + Docker)](#-deployment)
70
 
71
  ---
72
 
73
- ## 🧠 Environment Description
74
-
75
- This OpenEnv environment simulates a real software engineering task: **reviewing buggy Python code and identifying security and logic issues** using a fixed taxonomy of tags.
76
-
77
- Each episode presents the agent with a Python code snippet containing **planted vulnerabilities**. The agent must:
78
-
79
- 1. **Identify** the issues using tags from a 12-item taxonomy
80
- 2. **Assess** overall severity (`low` / `medium` / `high` / `critical`)
81
- 3. **Articulate** its findings in a human-readable review comment
82
-
83
- Performance is measured by a **deterministic, multi-dimensional grader** that scores recall, penalizes false positives, and rewards articulation quality — producing a final score in `[0.0, 1.0]`.
84
-
85
- ### Episode Flow
86
-
87
- ```
88
- ┌─────────┐ ┌──────────────┐ ┌────────────┐ ┌────────────┐
89
- │ reset() │────▶│ Observation │────▶│ Agent Act │────▶│ Grading │
90
- │ (task_id)│ │ code_snippet │ │ issues_found│ │ score/done │
91
- └─────────┘ │ file_name │ │ comment │ └─────┬──────┘
92
- │ description │ │ severity │ │
93
- └──────────────┘ └────────────┘ ┌─────▼──────┐
94
- │ Feedback │
95
- │ + next obs │
96
- └────────────┘
97
- ```
98
-
99
- - `reset(task_id)` loads a task and returns the initial observation
100
- - `step(action)` grades the agent's review and returns `(observation, reward, done)`
101
- - Episode ends when score ≥ 0.95 **or** the step limit (3) is reached
102
 
103
- ### Internal State
104
 
105
- The environment tracks:
106
- - Current task ID, file name, and planted issues
107
- - Episode ID and step count
108
- - Maximum allowed steps (3 per episode)
109
 
110
- The full state is available through the OpenEnv `state()` API for debugging, but the agent **does not** observe the ground-truth issues during normal play.
111
 
112
- ---
113
-
114
- ## 🎯 Why This Domain?
 
 
 
 
115
 
116
- | Criteria | How Code Review Fits |
117
- |----------|---------------------|
118
- | **Real-world utility** | PR review is a daily, high-value engineering workflow |
119
- | **RL-friendly** | Structured action space with dense rewards and a deterministic grader |
120
- | **Progressive difficulty** | Easy → Medium → Hard with increasing issue complexity |
121
- | **Measurable precision** | False positives are explicitly penalized — no reward hacking |
122
- | **Articulation matters** | Bonus for explaining *why* an issue exists, not just tagging it |
123
- | **Security relevance** | Covers OWASP-style vulnerabilities (SQLi, hardcoded secrets, timing attacks) |
124
 
125
- ---
 
 
 
126
 
127
- ## 🕹️ Action Space
128
 
129
- The agent submits a `ReviewAction` (defined in `models.py`) with three fields:
130
 
131
- | Field | Type | Description |
132
- |-------|------|-------------|
133
- | `review_comment` | `str` | Human-readable explanation of identified issues and suggested fixes |
134
- | `issues_found` | `list[str]` | Issue tags selected from the 12-tag `ISSUE_TAXONOMY` |
135
- | `severity` | `"low"` \| `"medium"` \| `"high"` \| `"critical"` | Overall severity assessment |
136
 
137
- ### Issue Taxonomy (12 Tags)
 
 
138
 
139
- ```
140
- ┌──────────────────────┬──────────────────────────┬────────────────────────┐
141
- │ Logic Errors │ Security Vulns │ Robustness │
142
- ├──────────────────────┼──────────────────────────┼────────────────────────┤
143
- │ null_pointer │ sql_injection │ race_condition │
144
- │ missing_return │ hardcoded_secret │ timing_attack │
145
- │ type_error │ path_traversal │ improper_error_ │
146
- │ index_out_of_bounds │ missing_input_ │ handling │
147
- │ │ validation │ integer_overflow │
148
- └──────────────────────┴──────────────────────────┴────────────────────────┘
149
- ```
150
 
151
- ### Example Action
 
152
 
153
- ```json
154
- {
155
- "review_comment": "The function uses .get() for birthdate but doesn't guard against None before arithmetic. Also, the function builds a profile dict but never returns it.",
156
- "issues_found": ["null_pointer", "missing_return"],
157
- "severity": "high"
158
- }
159
- ```
160
-
161
- ---
162
 
163
- ## 👁️ Observation Space
164
 
165
- The agent receives a `ReviewObservation` (defined in `models.py`) after every `reset()` and `step()`:
 
 
166
 
167
- | Field | Type | Description |
168
- |-------|------|-------------|
169
- | `task_id` | `str` | Task identifier (`task_easy`, `task_medium`, `task_hard`) |
170
- | `file_name` | `str` | Simulated file under review (e.g., `auth.py`) |
171
- | `task_description` | `str` | Instructions for the agent |
172
- | `code_snippet` | `str` | Python source code containing planted bugs |
173
- | `feedback` | `str` | Grading breakdown after each step |
174
- | `step_number` | `int` | Current step in the episode (0-indexed) |
175
- | `available_issue_tags` | `list[str]` | Full taxonomy for reference |
176
- | `reward` | `float` | Score from the grader (0.0 – 1.0) |
177
- | `done` | `bool` | Whether the episode has ended |
178
- | `metadata` | `dict` | Diagnostics: correctly found, missed, false positives |
179
 
180
- ---
181
 
182
- ## 📊 Reward Function
183
 
184
- The environment uses a **deterministic, multi-dimensional** reward function implemented in `server/graders.py`. Agents receive dense signal at every step.
 
 
 
 
 
 
185
 
186
- ### Formula
187
 
188
- ```
189
- base_score = |correctly_found ∩ planted| / |planted|
190
- quality_bonus = +0.05 × (# correct issues with keyword match in comment)
191
- precision_penalty = −0.10 × (# false-positive issues)
192
 
193
- final_score = clamp(base_score + quality_bonus precision_penalty, 0.0, 1.0)
194
- ```
 
195
 
196
- ### Components Explained
197
 
198
- | Component | Value | Purpose |
199
- |-----------|-------|---------|
200
- | **Recall Reward** | `|correct| / |planted|` | Primary signal — find what's actually broken |
201
- | **Quality Bonus** | `+0.05` per issue | Rewards articulation — mentioning *why* an issue matters |
202
- | **Precision Penalty** | `−0.10` per FP | Discourages hallucinated / over-aggressive flagging |
203
 
204
- ### Keyword Bonus Examples
 
 
 
 
205
 
206
- | Issue Tag | Triggering Keywords |
207
- |-----------|-------------------|
208
- | `sql_injection` | sql, injection, f-string, sanitize, parameterize |
209
- | `hardcoded_secret` | hardcoded, secret, credential, env var, plaintext |
210
- | `race_condition` | race, atomic, concurrent, lock, thread |
211
- | `timing_attack` | timing, constant time, hmac, compare_digest |
212
- | `improper_error_handling` | except, swallow, silent, bare except |
213
 
214
- > **Design rationale:** Random or naive strategies produce low scores (missed issues + penalties). Agents must demonstrate both *detection accuracy* and *communication quality* to score well.
215
 
216
- ---
217
 
218
- ## 📝 Tasks
 
219
 
220
- Three tasks with progressive difficulty. Each task presents a different Python file with distinct planted vulnerabilities.
221
 
222
- ### `task_easy` User Service (`user_service.py`)
223
 
224
- | Property | Value |
225
- |----------|-------|
226
- | **Planted Issues** | `null_pointer`, `missing_return` |
227
- | **Difficulty** | Easy |
228
- | **Description** | Review a function that uses `.get()` without null-guarding and never returns its result |
229
- | **Grader** | Deterministic — recall + quality bonus − FP penalty |
230
 
231
- <details>
232
- <summary>📄 Code Snippet</summary>
233
 
234
- ```python
235
- def get_user_age(user):
236
- # Returns age in years from user profile dict
237
- birthdate = user.get("birthdate")
238
- if user.get("is_active"):
239
- account_label = f"active:{user.get('id')}"
240
- else:
241
- account_label = "inactive"
242
 
243
- age = (datetime.now() - birthdate).days // 365
244
- profile = {"label": account_label, "age": age}
245
- # TODO: return something
246
- ```
247
 
248
- </details>
249
 
250
- ---
251
 
252
- ### `task_medium` Auth Module (`auth.py`)
253
 
254
- | Property | Value |
255
- |----------|-------|
256
- | **Planted Issues** | `sql_injection`, `hardcoded_secret` |
257
- | **Difficulty** | Medium |
258
- | **Description** | Review an authentication function with f-string SQL interpolation and a plaintext secret key |
259
- | **Grader** | Deterministic — recall + quality bonus − FP penalty |
260
 
261
- <details>
262
- <summary>📄 Code Snippet</summary>
 
263
 
264
- ```python
265
- SECRET_KEY = "supersecret123" # used for JWT signing
266
 
267
- def authenticate_user(db_conn, username, password):
268
- query = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'"
269
- result = db_conn.execute(query)
270
- user = result.fetchone()
271
 
272
- if user:
273
- audit_line = f"auth ok for {username}"
274
- token = jwt.encode({"user_id": user.id}, SECRET_KEY)
275
- return token
276
- ```
277
 
278
- </details>
 
 
279
 
280
- ---
281
 
282
- ### `task_hard` — Payment Processing (`payments.py`)
283
-
284
- | Property | Value |
285
- |----------|-------|
286
- | **Planted Issues** | `race_condition`, `improper_error_handling`, `timing_attack` |
287
- | **Difficulty** | Hard |
288
- | **Description** | Review a payment function for non-atomic balance ops, silently swallowed exceptions, and non-constant-time comparisons |
289
- | **Grader** | Deterministic — recall + quality bonus − FP penalty |
290
-
291
- <details>
292
- <summary>📄 Code Snippet</summary>
293
-
294
- ```python
295
- def process_payment(user_id, amount, card_token):
296
- user = db.get_user(user_id)
297
- if user.balance >= amount:
298
- user.balance -= amount # checked and modified non-atomically
299
- db.save_user(user)
300
-
301
- try:
302
- charge_result = payment_gateway.charge(card_token, amount)
303
- except:
304
- pass # silently swallow all payment errors
305
-
306
- expected = db.get_token_hash(card_token)
307
- actual = hash(card_token)
308
- if expected == actual: # non-constant-time comparison
309
- return {"status": "success", "charge": charge_result}
310
  ```
311
 
312
- </details>
313
-
314
- ---
315
-
316
- ## 🔧 Setup & Usage
317
-
318
- ### Prerequisites
319
-
320
- - Python ≥ 3.10
321
- - [uv](https://github.com/astral-sh/uv) (recommended) or pip
322
 
323
- ### Install Dependencies
324
 
325
  ```bash
326
- # Clone the repository
327
- git clone https://github.com/Dolphin-Syndrom/code-review-env.git
328
- cd code-review-env
329
-
330
- # Using uv (recommended)
331
- uv sync
332
-
333
- # Or using pip
334
- pip install -e .
335
  ```
336
 
337
- ### Start the Server
 
 
338
 
339
  ```bash
340
- # Using uv
341
  uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
342
-
343
- # Or using pip-installed packages
344
- uvicorn server.app:app --host 0.0.0.0 --port 8000
345
  ```
346
-
347
- ### Verify It's Running
348
-
349
  ```bash
350
- # Health check
351
- curl http://localhost:8000/health
352
-
353
- # List all tasks
354
- curl http://localhost:8000/tasks
355
-
356
- # Run the built-in baseline
357
- curl -X POST http://localhost:8000/baseline
358
-
359
- # Score a custom submission
360
- curl -X POST http://localhost:8000/grader \
361
- -H "Content-Type: application/json" \
362
- -d '{
363
- "task_id": "task_easy",
364
- "issues_found": ["null_pointer", "missing_return"],
365
- "review_comment": "Null dereference risk on birthdate and missing return statement"
366
- }'
367
  ```
368
 
369
- ---
370
-
371
- ## 🤖 Running the Baseline
372
 
373
- The baseline script (`inference.py`) supports two modes and follows the **mandatory OpenEnv stdout format**.
374
 
375
- ### Environment Variables
 
 
376
 
377
- | Variable | Required | Default | Description |
378
- |----------|----------|---------|-------------|
379
- | `API_BASE_URL` | No | `https://router.huggingface.co/v1` | LLM API endpoint |
380
- | `MODEL_NAME` | No | `Qwen/Qwen2.5-72B-Instruct` | Model identifier |
381
- | `HF_TOKEN` | For LLM mode | — | Hugging Face / API key |
382
- | `ENV_URL` | No | `http://localhost:8000` | Environment server URL |
383
- | `IMAGE_NAME` | No | — | Docker image name (if using `from_docker_image()`) |
384
-
385
- ### Run Locally
386
 
387
  ```bash
388
- # Rule-based fallback (no API key needed)
389
- python inference.py
390
-
391
- # LLM-backed mode
392
- API_BASE_URL=https://router.huggingface.co/v1 \
393
- MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
394
- HF_TOKEN=hf_your_token_here \
395
  python inference.py
396
-
397
- # Against a deployed HF Space
398
- ENV_URL=https://your-space.hf.space python inference.py
399
  ```
400
 
401
- ### Structured Stdout Output (Mandatory)
402
-
403
- The script emits exactly three line types for the OpenEnv validator:
404
-
405
- ```
406
- [START] task=task_easy env=code_review_env model=Qwen/Qwen2.5-72B-Instruct
407
- [STEP] step=1 action={"issues_found":["null_pointer","missing_return"],...} reward=1.00 done=true error=null
408
- [END] success=true steps=1 score=1.000 rewards=1.00
409
- ```
410
-
411
- **Rules:**
412
- - One `[START]` at episode begin
413
- - One `[STEP]` per step, immediately after `env.step()` returns
414
- - One `[END]` after episode completes (always emitted, even on exception)
415
- - `reward` and `rewards` formatted to 2 decimal places
416
- - `done` and `success` are lowercase booleans
417
- - Each task score must be in `(0.0, 1.0)` — strictly between, not exactly 0 or 1
418
-
419
- ---
420
-
421
- ## 📈 Baseline Scores
422
-
423
- Performance of the built-in rule-based baseline (no LLM required):
424
 
425
- | Task | Difficulty | Issues Detected | Score |
426
- |------|-----------|----------------|-------|
427
- | `task_easy` | 🟢 Easy | `null_pointer`, `missing_return` | **1.00** |
428
- | `task_medium` | 🟡 Medium | `sql_injection`, `hardcoded_secret` | **1.00** |
429
- | `task_hard` | 🔴 Hard | `race_condition`, `timing_attack`, `improper_error_handling` | **1.00** |
430
 
431
- > The rule-based baseline uses pattern-matching heuristics (e.g., detecting `.get(` for null pointers, `f"select` for SQL injection). LLM agents are expected to **match or exceed** these scores while providing richer, more actionable review comments.
432
 
433
- ---
 
 
 
434
 
435
- ## 🚀 Deployment
436
 
437
- ### Docker
438
 
439
- ```bash
440
- # Build
441
- docker build -t code-review-env:latest .
442
 
443
- # Run
444
- docker run -p 8000:8000 code-review-env:latest
 
 
 
445
  ```
446
 
447
- ### Hugging Face Spaces
448
 
449
- This repo is structured for Docker-based deployment to HF Spaces.
 
450
 
451
- ```bash
452
- # Using the OpenEnv CLI
453
- openenv push --repo-id Dolphin-Syndrom/code-review-env
 
 
 
454
  ```
455
 
456
- **Recommended Space Settings:**
457
- - **SDK:** Docker
458
- - **Hardware:** CPU Basic (sufficient)
459
- - **Secrets:** Set `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` if you want LLM baseline enabled
460
 
461
- ### Pre-Validation
462
 
463
- Run the validation script before submitting:
 
 
464
 
465
  ```bash
466
- ./scripts/validate-submission.sh https://Dolphin-Syndrom-code-review-env.hf.space .
 
467
  ```
468
 
469
- ---
470
 
471
- ## 🔌 API Reference
472
 
473
- All endpoints are OpenEnv-compatible and return structured JSON.
474
 
475
- | Method | Endpoint | Description |
476
- |--------|----------|-------------|
477
- | `GET` | `/health` | Health check |
478
- | `GET` | `/tasks` | List all tasks with schemas |
479
- | `POST` | `/reset` | Reset episode for a given task |
480
- | `POST` | `/step` | Submit a review action |
481
- | `GET` | `/state` | Get current episode state |
482
- | `POST` | `/grader` | Score an action against a task |
483
- | `POST` | `/baseline` | Run built-in rule-based baseline across all tasks |
484
- | `WS` | `/ws` | WebSocket for real-time interaction |
485
 
486
- ---
487
 
488
- ## 📁 Project Structure
489
 
490
- ```
491
- code-review-env/
492
- ├── __init__.py # Package exports
493
- ├── client.py # CodeReviewEnv WebSocket client (EnvClient subclass)
494
- ├── models.py # ReviewAction, ReviewObservation, ReviewState, ISSUE_TAXONOMY
495
- ├── inference.py # Baseline inference (LLM + rule-based fallback)
496
- ├── openenv.yaml # OpenEnv manifest with grader blocks
497
- ├── pyproject.toml # Project metadata & dependencies
498
- ├── Dockerfile # Production container
499
- ├── README.md
500
- ├── scripts/
501
- │ └── validate-submission.sh # Pre-submission validator
502
- └── server/
503
- ├── __init__.py
504
- ├── app.py # FastAPI server + Gradio dashboard
505
- ├── code_review_env_environment.py # Environment implementation (reset/step/state)
506
- ├── graders.py # Deterministic grading logic
507
- ├── tasks.py # Task definitions with planted issues
508
- ├── requirements.txt
509
- └── Dockerfile
510
- ```
511
 
512
- ---
513
 
514
- ## 🏁 Submission Checklist
515
-
516
- | # | Check | Status |
517
- |---|-------|--------|
518
- | 1 | Docker build succeeds | ✅ |
519
- | 2 | `POST /reset` returns 200 | ✅ |
520
- | 3 | 3 tasks with `grader:` blocks in `openenv.yaml` | ✅ |
521
- | 4 | `inference.py` exits with code 0 | ✅ |
522
- | 5 | `[START]`, `[STEP]`, `[END]` in stdout | ✅ |
523
- | 6 | LLM calls via `API_BASE_URL` proxy | ✅ |
524
- | 7 | All task scores strictly in `(0.0, 1.0)` | ✅ |
525
 
526
- ### Submission Links (Both Required)
527
 
528
- 1. **GitHub Repository:** [Dolphin-Syndrom/code-review-env](https://github.com/Dolphin-Syndrom/code-review-env)
529
- 2. **Hugging Face Space:** [Dolphin-Syndrom/code-review-env](https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env)
530
 
531
- ---
532
 
533
- ## 🔮 Extensibility
 
 
 
 
 
534
 
535
- Possible next steps for this benchmark:
536
 
537
- - **More languages** — Extend beyond Python to JavaScript, Go, Rust
538
- - **Multi-file reviews** — Cross-file dependency analysis
539
- - **Diff-based input** — Review git diffs instead of full files
540
- - **Severity grading** — Score severity accuracy, not just issue detection
541
- - **Exploit generation** — Ask agents to produce PoC exploits for found vulnerabilities
542
- - **Agentic tool use** — Let agents run linters, type checkers, or tests as sub-actions
543
-
544
- ---
545
 
546
- <div align="center">
 
 
 
 
547
 
548
- **Built with [OpenEnv](https://github.com/meta-pytorch/OpenEnv) by Meta** &nbsp;·&nbsp; **Deployed on [Hugging Face Spaces](https://huggingface.co/spaces)**
549
 
550
- </div>
 
14
  - code-review
15
  ---
16
 
17
+ # Code Review OpenEnv Benchmark
 
18
 
19
+ ## 🚀 Scaler March 2026 Hackathon Submission
20
 
21
+ This project was built as part of the **Scaler March 2026 Hackathon**.
22
 
23
+ **Author:** Dolphin-Syndrom
24
+ **Type:** OpenEnv Benchmark Environment
25
+ **Focus:** Evaluating LLM agents on security-aware code review tasks
 
 
 
 
 
 
 
 
 
26
 
27
  ---
28
 
29
  ## ⚡ TL;DR
30
 
31
+ A benchmark environment for evaluating LLM agents on taxonomy-driven pull-request reviews.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
+ - 3 tasks (easy → medium → hard)
34
+ - structured actions and observations
35
+ - reward-based learning signals
36
+ - deterministic grading (0.0–1.0)
37
+ - deployable via Docker on Hugging Face Spaces
38
+ - fully OpenEnv compliant
 
 
 
 
39
 
40
  ---
41
 
42
+ > Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops.
43
+ >
44
+ > Suitable for benchmarking agent performance, reward shaping strategies, and detection accuracy without hallucinating false positives.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
+ This environment models a real-world engineering task: pull-request review. It evaluates whether an agent can read code snippets, identify security vulnerabilities and logic errors using a fixed taxonomy, and articulate its findings clearly.
47
 
48
+ ## Overview
 
 
 
49
 
50
+ This project is designed for OpenEnv-style agent evaluation with:
51
 
52
+ - a real-world task instead of a toy problem
53
+ - typed `Observation`, `Action`, `Reward`, and `State` models
54
+ - `step()`, `reset()`, and `state()` APIs
55
+ - three tasks with deterministic graders
56
+ - dense reward shaping for partial progress
57
+ - a reproducible baseline `inference.py`
58
+ - a FastAPI server and Dockerfile for deployment
59
 
60
+ The environment models a practical workflow engineers actually do:
 
 
 
 
 
 
 
61
 
62
+ - static code analysis
63
+ - vulnerability detection
64
+ - taxonomy-driven bug tagging
65
+ - code review articulation
66
 
67
+ ## Environment Specification
68
 
69
+ ### Objective
70
 
71
+ For each episode, the agent sees a Python code snippet containing planted issues and must make structured decisions:
 
 
 
 
72
 
73
+ 1. identify issues using tags from a 12-item `ISSUE_TAXONOMY` (e.g., `null_pointer`, `sql_injection`, `race_condition`)
74
+ 2. assess overall severity (`low`, `medium`, `high`, `critical`)
75
+ 3. articulate its findings in a human-readable `review_comment`
76
 
77
+ Performance is measured two ways:
 
 
 
 
 
 
 
 
 
 
78
 
79
+ - dense step rewards during interaction
80
+ - final deterministic grader scores between `0.0` and `1.0`
81
 
82
+ ### State
 
 
 
 
 
 
 
 
83
 
84
+ The internal environment state tracks:
85
 
86
+ - current task ID, file name, and planted issues
87
+ - episode ID and step count
88
+ - maximum allowed steps (3 per episode)
89
 
90
+ The full state is available through the OpenEnv `state()` API for debugging and validation, but the agent does not directly observe the ground-truth issues during normal play.
 
 
 
 
 
 
 
 
 
 
 
91
 
92
+ ### Observation Space
93
 
94
+ The agent receives:
95
 
96
+ - `task_id`
97
+ - `file_name`
98
+ - `task_description`
99
+ - `code_snippet`
100
+ - `feedback` from previous grading
101
+ - `step_number`
102
+ - `available_issue_tags`
103
 
104
+ ### Action Space
105
 
106
+ The environment accepts structured actions:
 
 
 
107
 
108
+ - `issues_found(list[str])` selected from the 12-tag `ISSUE_TAXONOMY`
109
+ - `severity(level)` where `level` is one of `low`, `medium`, `high`, `critical`
110
+ - `review_comment(text)` explaining the identified issues
111
 
112
+ Invalid or hallucinated tags are penalized as false positives.
113
 
114
+ ### Episode Flow
 
 
 
 
115
 
116
+ 1. `reset()` loads a task and returns the initial observation
117
+ 2. the agent receives an observation with the code snippet
118
+ 3. the agent acts through `step(action)`
119
+ 4. the environment returns `(observation, reward, done, info)`
120
+ 5. the episode ends when the score ≥ 0.95 or the maximum step limit (3) is reached
121
 
122
+ ## Tasks
 
 
 
 
 
 
123
 
124
+ ### Easy Task
125
 
126
+ Null Pointer & Missing Return.
127
 
128
+ - goal: evaluate `user_service.py` to catch a `.get()` missing a null check, and a missing return statement.
129
+ - grader: weighted string-matching and set intersection
130
 
131
+ ### Medium Task
132
 
133
+ SQL Injection & Hardcoded Secret.
134
 
135
+ - goal: evaluate `auth.py` to identify f-string SQL injection and a plaintext secret key.
136
+ - grader: weighted string-matching and set intersection
 
 
 
 
137
 
138
+ ### Hard Task
 
139
 
140
+ Race Condition, Error Handling & Timing Attack.
 
 
 
 
 
 
 
141
 
142
+ - goal: evaluate `payments.py` for non-atomic operations, bare excepts, and non-constant-time comparisons.
143
+ - grader: weighted string-matching and set intersection
 
 
144
 
145
+ ## Reward Design
146
 
147
+ **Summary:** Correct behavior yields positive reward (~1.0), random strategies are penalized (negative reward), ensuring meaningful learning signals.
148
 
149
+ The benchmark uses dense, shaped rewards so agents receive signal across the full trajectory instead of only at episode end.
150
 
151
+ Core components:
 
 
 
 
 
152
 
153
+ - recall reward (fractional points for correctly identified issues)
154
+ - quality bonus (+0.05 per correct issue with a matching keyword in the comment)
155
+ - precision penalty (-0.10 for hallucinated or false-positive issues)
156
 
157
+ This gives a better learning signal for agent training while the final graders still produce simple deterministic scores in the `0.0` to `1.0` range.
 
158
 
159
+ ## Dataset
 
 
 
160
 
161
+ The built-in dataset contains 3 distinct tasks covering a range of issues:
 
 
 
 
162
 
163
+ - `task_easy`: Logic errors (`null_pointer`, `missing_return`)
164
+ - `task_medium`: Security vulnerabilities (`sql_injection`, `hardcoded_secret`)
165
+ - `task_hard`: Robustness issues (`race_condition`, `improper_error_handling`, `timing_attack`)
166
 
167
+ ## Project Structure
168
 
169
+ ```text
170
+ .
171
+ ├── code_review_env/
172
+ │ ├── __init__.py
173
+ │ ├── client.py
174
+ │ ├── models.py
175
+ │ ├── inference.py
176
+ │ ├── openenv.yaml
177
+ │ ├── pyproject.toml
178
+ │ ├── requirements.txt
179
+ │ ├── Dockerfile
180
+ │ ├── README.md
181
+ │ ├── scripts/
182
+ │ │ └── validate-submission.sh
183
+ │ └── server/
184
+ │ ├── __init__.py
185
+ │ ├── app.py
186
+ │ ├── code_review_env_environment.py
187
+ │ ├── graders.py
188
+ │ ├── tasks.py
189
+ │ ├── requirements.txt
190
+ │ └── Dockerfile
 
 
 
 
 
 
191
  ```
192
 
193
+ ## Setup
 
 
 
 
 
 
 
 
 
194
 
195
+ From the repository root:
196
 
197
  ```bash
198
+ uv sync --frozen
199
+ # OR using pip:
200
+ pip install -r requirements.txt
201
+ pip install -r server/requirements.txt
 
 
 
 
 
202
  ```
203
 
204
+ ## Local Usage
205
+
206
+ ### Start the OpenEnv server
207
 
208
  ```bash
 
209
  uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
 
 
 
210
  ```
211
+ or without uv:
 
 
212
  ```bash
213
+ python -m code_review_env.server.app
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
  ```
215
 
216
+ ## Baseline Inference
 
 
217
 
218
+ The baseline script uses the OpenAI Python client and reads configuration from environment variables:
219
 
220
+ - `API_BASE_URL`
221
+ - `MODEL_NAME`
222
+ - `HF_TOKEN`
223
 
224
+ Example:
 
 
 
 
 
 
 
 
225
 
226
  ```bash
227
+ export API_BASE_URL=https://router.huggingface.co/v1
228
+ export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
229
+ export HF_TOKEN=your-token
 
 
 
 
230
  python inference.py
 
 
 
231
  ```
232
 
233
+ Behavior:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
 
235
+ - tries an LLM-backed agent first
236
+ - falls back to deterministic heuristics when credentials or network access are unavailable
237
+ - emits structured `[START]`, `[STEP]`, and `[END]` logs
238
+ - safely handles rate-limit and transient LLM errors
 
239
 
240
+ ## Validation
241
 
242
+ ```bash
243
+ openenv validate .
244
+ ./scripts/validate-submission.sh http://localhost:8000 .
245
+ ```
246
 
247
+ ## 🔌 API Usage
248
 
249
+ All endpoints are OpenEnv-compatible and return structured JSON responses.
250
 
251
+ ### Health Check
252
+ GET /health
 
253
 
254
+ ### Reset Environment
255
+ POST /reset
256
+ Optional body:
257
+ ```json
258
+ {"task_id": "task_easy"}
259
  ```
260
 
261
+ ### Take Step
262
 
263
+ POST /step
264
+ Body:
265
 
266
+ ```json
267
+ {
268
+ "review_comment": "Null dereference risk on birthdate.",
269
+ "issues_found": ["null_pointer"],
270
+ "severity": "medium"
271
+ }
272
  ```
273
 
274
+ ### Get State
 
 
 
275
 
276
+ GET /state
277
 
278
+ ## Docker
279
+
280
+ Build and run:
281
 
282
  ```bash
283
+ docker build -t code-review-openenv -f Dockerfile .
284
+ docker run -p 8000:8000 code-review-openenv
285
  ```
286
 
287
+ ## Hugging Face Spaces
288
 
289
+ This repo is structured for Docker-based deployment to Hugging Face Spaces.
290
 
291
+ Recommended setup:
292
 
293
+ - SDK: `Docker`
294
+ - hardware: CPU Basic is sufficient
295
+ - set `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` in Space secrets if you want the LLM baseline enabled
 
 
 
 
 
 
 
296
 
297
+ ## 🏁 Submission Status
298
 
299
+ This environment:
300
 
301
+ - passes OpenEnv validation
302
+ - successfully deploys via Docker on Hugging Face Spaces
303
+ - supports full agent interaction through API endpoints
304
+ - was tested end-to-end including inference and grading pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
305
 
306
+ Built, debugged, and deployed under hackathon constraints.
307
 
308
+ ---
 
 
 
 
 
 
 
 
 
 
309
 
310
+ ## 🔗 Links
311
 
312
+ - GitHub Repository: https://github.com/Dolphin-Syndrom/code-review-env
313
+ - Hugging Face Space: https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env
314
 
315
+ ## Why This Environment Fits The Problem Statement
316
 
317
+ - real-world utility: code review is a practical daily workflow
318
+ - three tasks with deterministic graders: easy, medium, hard
319
+ - meaningful reward shaping: partial progress, articulation bonuses, and hallucination penalties
320
+ - OpenEnv-compatible API and typed models
321
+ - baseline inference script included at repo root
322
+ - containerization included for deployment
323
 
324
+ ## Extensibility
325
 
326
+ Possible next steps:
 
 
 
 
 
 
 
327
 
328
+ - more languages such as JavaScript, Go, or Rust
329
+ - multi-file reviews and cross-file dependency analysis
330
+ - diff-based input instead of full files
331
+ - severity grading accuracy
332
+ - target exploit generation
333
 
334
+ ## License
335
 
336
+ BSD-3-Clause