theaniketgiri commited on
Commit
319df19
·
1 Parent(s): 32cf861
Files changed (2) hide show
  1. README.md +458 -108
  2. openenv.yaml +155 -14
README.md CHANGED
@@ -1,184 +1,534 @@
1
  ---
2
- title: Code Review Environment
3
- emoji: 🧠
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: docker
7
- pinned: false
8
- app_port: 8000
9
- tags:
10
- - openenv
11
- - reinforcement-learning
12
- - code-review
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
- # Code Review Environment (`code-review-env`)
16
 
17
- This OpenEnv environment simulates a real software engineering task: reviewing buggy Python code and identifying security and logic issues with fixed taxonomy tags.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- ## Why this environment
20
 
21
- - Real-world utility: PR/code review is a common and valuable engineering workflow.
22
- - RL-friendly: structured action space with dense rewards and deterministic grader.
23
- - Progressive tasks: easy medium hard.
 
 
 
 
24
 
25
- ## Action space
26
 
27
- `ReviewAction` (`models.py`):
28
 
29
- - `review_comment` (`str`): human-readable explanation
30
- - `issues_found` (`list[str]`): tags from `ISSUE_TAXONOMY`
31
- - `severity` (`low|medium|high|critical`)
32
 
33
- ## Observation space
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- `ReviewObservation` (`models.py`):
36
 
37
- - `task_id`, `file_name`, `task_description`
38
- - `code_snippet` to review
39
- - `feedback` after each step
40
- - `step_number`, `reward`, `done`, `metadata`
41
 
42
- ## Tasks
43
 
44
- - `task_easy`: `null_pointer`, `missing_return`
45
- - `task_medium`: `sql_injection`, `hardcoded_secret`
46
- - `task_hard`: `race_condition`, `improper_error_handling`, `timing_attack`
47
 
48
- Defined in `server/tasks.py`.
 
 
 
49
 
50
- ## Reward and grading
 
51
 
52
- Implemented in `server/graders.py`:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
- - `base_score = |correctly_found| / |planted_issues|`
55
- - `quality_bonus = +0.05` per correctly found issue with comment keywords
56
- - `precision_penalty = -0.1` per false-positive issue
57
- - final score clamped to `[0.0, 1.0]`
58
 
59
- ## Required endpoints
 
 
 
 
 
60
 
61
- - `GET /tasks`
62
- - `POST /grader`
63
- - `POST /baseline`
64
- - plus OpenEnv core endpoints (`/reset`, `/step`, `/state`, `/health`, `/ws`)
65
 
66
- ## Local setup
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  ```bash
69
- cd /home/manvith/OpenEnv/code_review_env
 
 
 
 
 
 
 
70
  pip install -e .
71
- pip install -r server/requirements.txt
 
 
 
 
 
 
 
 
72
  uvicorn server.app:app --host 0.0.0.0 --port 8000
73
  ```
74
 
75
- ## Manual API smoke test
76
 
77
  ```bash
 
78
  curl http://localhost:8000/health
 
 
79
  curl http://localhost:8000/tasks
 
 
80
  curl -X POST http://localhost:8000/baseline
 
 
81
  curl -X POST http://localhost:8000/grader \
82
  -H "Content-Type: application/json" \
83
- -d '{"task_id":"task_easy","issues_found":["null_pointer"],"review_comment":"missing null check"}'
 
 
 
 
84
  ```
85
 
86
- ## Baseline inference script
87
-
88
- `inference.py` supports two modes:
89
-
90
- 1. LLM mode (if `OPENAI_API_KEY` is set)
91
- 2. Rule-based fallback (no key required)
92
-
93
- ### Mandatory Additional Instructions
94
 
95
- Before submitting, ensure the following variables are defined in your environment configuration:
96
 
97
- - `API_BASE_URL` The API endpoint for the LLM.
98
- - `MODEL_NAME` — The model identifier to use for inference.
99
- - `HF_TOKEN` — Your Hugging Face / API key.
100
 
101
- Additional mandatory requirements:
102
 
103
- - The inference script must be named `inference.py` and placed in the root directory of the project.
104
- - Participants must use OpenAI Client for all LLM calls using the variables above.
105
- - Participants must emit structured stdout logs following `[START]`, `[STEP]`, and `[END]` format from the sample guidance.
 
 
 
 
106
 
107
- This repository implements those requirements in `inference.py` while still supporting fallback behavior when no API key is provided.
108
 
109
  ```bash
110
- # Local fallback mode
111
  python inference.py
112
 
113
- # Local LLM mode
114
- API_BASE_URL=https://api.openai.com/v1 MODEL_NAME=gpt-4o-mini HF_TOKEN=hf_xxx python inference.py
 
 
 
115
 
116
- # Against HF Space
117
- ENV_URL=https://<your-space>.hf.space python inference.py
118
  ```
119
 
120
- Structured logs example:
 
 
121
 
122
- ```text
123
- [START] {"env_url":"http://localhost:8000","api_base_url":"https://api.openai.com/v1","model_name":"gpt-4o-mini","mode":"rule_based","tasks":["task_easy","task_medium","task_hard"]}
124
- [STEP] {"task_id":"task_easy","mode":"rule_based","issues_found":["null_pointer","missing_return"],"score":1.0}
125
- [END] {"status":"success","results":{...}}
126
  ```
127
 
128
- Example output shape:
 
 
 
 
 
 
129
 
130
- ```json
131
- {
132
- "task_easy": {"score": 0.85, "issues_found": ["null_pointer", "missing_return"]},
133
- "task_medium": {"score": 0.7, "issues_found": ["sql_injection", "hardcoded_secret"]},
134
- "task_hard": {"score": 0.5, "issues_found": ["race_condition", "improper_error_handling"]}
135
- }
136
- ```
 
 
 
 
 
 
 
 
 
 
137
 
138
- ## Docker
139
 
140
  ```bash
141
- docker build -t code-review-env:latest -f server/Dockerfile .
 
 
 
142
  docker run -p 8000:8000 code-review-env:latest
143
  ```
144
 
145
- ## Deploy to Hugging Face Space
 
 
146
 
147
  ```bash
148
- openenv push --repo-id YOUR_USERNAME/code-review-env
 
149
  ```
150
 
151
- ## Pre-validation script
 
 
 
152
 
153
- Run the provided pre-validation script before final submission:
 
 
154
 
155
  ```bash
156
- cd /home/manvith/OpenEnv/code_review_env
157
- ./scripts/validate-submission.sh https://YOUR_USERNAME-code-review-env.hf.space .
158
  ```
159
 
160
- ## Submission links (both required)
 
 
 
 
161
 
162
- 1. GitHub repository URL
163
- 2. Hugging Face Space URL
 
 
 
 
 
 
 
 
 
 
164
 
165
- ## Folder layout
166
 
167
- ```text
168
- code_review_env/
169
- ├── __init__.py
170
- ├── client.py
171
- ├── inference.py
172
- ├── models.py
173
- ├── openenv.yaml
174
- ├── pyproject.toml
 
175
  ├── README.md
 
 
176
  └── server/
177
  ├── __init__.py
178
- ├── app.py
179
- ├── code_review_env_environment.py
180
- ├── graders.py
181
- ├── tasks.py
182
  ├── requirements.txt
183
  └── Dockerfile
184
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ <div align="center">
3
+
4
+ # Code Review Environment
5
+
6
+ ### An OpenEnv Benchmark for AI-Driven Pull-Request Review
7
+
8
+ [![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue?style=for-the-badge&logo=meta)](https://github.com/meta-pytorch/OpenEnv)
9
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org)
10
+ [![Docker](https://img.shields.io/badge/docker-ready-2496ED?style=for-the-badge&logo=docker&logoColor=white)](https://hub.docker.com)
11
+ [![License](https://img.shields.io/badge/license-BSD--3-green?style=for-the-badge)](LICENSE)
12
+
13
+ ---
14
+
15
+ **🚀 Scaler March 2026 Hackathon Submission**
16
+
17
+ **Author:** [Dolphin-Syndrom](https://github.com/Dolphin-Syndrom) &nbsp;|&nbsp; **Type:** OpenEnv Benchmark &nbsp;|&nbsp; **Focus:** Security-Aware Code Review
18
+
19
+ </div>
20
+
21
+ ---
22
+
23
+ ## ⚡ TL;DR
24
+
25
+ A benchmark environment that evaluates whether an LLM agent can review buggy Python code, identify security vulnerabilities and logic errors using a fixed taxonomy, and articulate its findings — just like a senior engineer doing a pull-request review.
26
+
27
+ - **3 progressive tasks** — easy → medium → hard
28
+ - **12-tag issue taxonomy** — from `null_pointer` to `timing_attack`
29
+ - **Deterministic multi-dimensional grading** — recall + precision + articulation quality
30
+ - **Dense reward shaping** — signal at every step, not just episode end
31
+ - **Structured actions & observations** — typed Pydantic models with full schema
32
+ - **Dual inference modes** — LLM-backed or rule-based fallback
33
+ - **Deployable** via Docker on Hugging Face Spaces
34
+ - **Fully OpenEnv compliant** — passes `openenv validate`
35
+
36
+ ---
37
+
38
+ > *Designed to evaluate whether AI agents can perform structured, taxonomy-driven code review under constrained interaction loops — a high-value, real-world software engineering workflow.*
39
+
40
+ ---
41
+
42
+ ## 📑 Table of Contents
43
+
44
+ - [Environment Description](#-environment-description)
45
+ - [Why This Domain?](#-why-this-domain)
46
+ - [Action Space](#-action-space)
47
+ - [Observation Space](#-observation-space)
48
+ - [Reward Function](#-reward-function)
49
+ - [Tasks](#-tasks)
50
+ - [Setup & Usage](#-setup--usage)
51
+ - [Running the Baseline](#-running-the-baseline)
52
+ - [Baseline Scores](#-baseline-scores)
53
+ - [Deployment (HF Spaces + Docker)](#-deployment)
54
+
55
+ ---
56
+
57
+ ## 🧠 Environment Description
58
+
59
+ This OpenEnv environment simulates a real software engineering task: **reviewing buggy Python code and identifying security and logic issues** using a fixed taxonomy of tags.
60
+
61
+ Each episode presents the agent with a Python code snippet containing **planted vulnerabilities**. The agent must:
62
+
63
+ 1. **Identify** the issues using tags from a 12-item taxonomy
64
+ 2. **Assess** overall severity (`low` / `medium` / `high` / `critical`)
65
+ 3. **Articulate** its findings in a human-readable review comment
66
+
67
+ Performance is measured by a **deterministic, multi-dimensional grader** that scores recall, penalizes false positives, and rewards articulation quality — producing a final score in `[0.0, 1.0]`.
68
+
69
+ ### Episode Flow
70
+
71
+ ```
72
+ ┌─────────┐ ┌──────────────┐ ┌────────────┐ ┌────────────┐
73
+ │ reset() │────▶│ Observation │────▶│ Agent Act │────▶│ Grading │
74
+ │ (task_id)│ │ code_snippet │ │ issues_found│ │ score/done │
75
+ └─────────┘ │ file_name │ │ comment │ └─────┬──────┘
76
+ │ description │ │ severity │ │
77
+ └──────────────┘ └────────────┘ ┌─────▼──────┐
78
+ │ Feedback │
79
+ │ + next obs │
80
+ └────────────┘
81
+ ```
82
+
83
+ - `reset(task_id)` loads a task and returns the initial observation
84
+ - `step(action)` grades the agent's review and returns `(observation, reward, done)`
85
+ - Episode ends when score ≥ 0.95 **or** the step limit (3) is reached
86
+
87
+ ### Internal State
88
+
89
+ The environment tracks:
90
+ - Current task ID, file name, and planted issues
91
+ - Episode ID and step count
92
+ - Maximum allowed steps (3 per episode)
93
+
94
+ The full state is available through the OpenEnv `state()` API for debugging, but the agent **does not** observe the ground-truth issues during normal play.
95
+
96
+ ---
97
+
98
+ ## 🎯 Why This Domain?
99
+
100
+ | Criteria | How Code Review Fits |
101
+ |----------|---------------------|
102
+ | **Real-world utility** | PR review is a daily, high-value engineering workflow |
103
+ | **RL-friendly** | Structured action space with dense rewards and a deterministic grader |
104
+ | **Progressive difficulty** | Easy → Medium → Hard with increasing issue complexity |
105
+ | **Measurable precision** | False positives are explicitly penalized — no reward hacking |
106
+ | **Articulation matters** | Bonus for explaining *why* an issue exists, not just tagging it |
107
+ | **Security relevance** | Covers OWASP-style vulnerabilities (SQLi, hardcoded secrets, timing attacks) |
108
+
109
  ---
110
 
111
+ ## 🕹️ Action Space
112
 
113
+ The agent submits a `ReviewAction` (defined in `models.py`) with three fields:
114
+
115
+ | Field | Type | Description |
116
+ |-------|------|-------------|
117
+ | `review_comment` | `str` | Human-readable explanation of identified issues and suggested fixes |
118
+ | `issues_found` | `list[str]` | Issue tags selected from the 12-tag `ISSUE_TAXONOMY` |
119
+ | `severity` | `"low"` \| `"medium"` \| `"high"` \| `"critical"` | Overall severity assessment |
120
+
121
+ ### Issue Taxonomy (12 Tags)
122
+
123
+ ```
124
+ ┌──────────────────────┬──────────────────────────┬────────────────────────┐
125
+ │ Logic Errors │ Security Vulns │ Robustness │
126
+ ├──────────────────────┼──────────────────────────┼────────────────────────┤
127
+ │ null_pointer │ sql_injection │ race_condition │
128
+ │ missing_return │ hardcoded_secret │ timing_attack │
129
+ │ type_error │ path_traversal │ improper_error_ │
130
+ │ index_out_of_bounds │ missing_input_ │ handling │
131
+ │ │ validation │ integer_overflow │
132
+ └──────────────────────┴──────────────────────────┴────────────────────────┘
133
+ ```
134
 
135
+ ### Example Action
136
 
137
+ ```json
138
+ {
139
+ "review_comment": "The function uses .get() for birthdate but doesn't guard against None before arithmetic. Also, the function builds a profile dict but never returns it.",
140
+ "issues_found": ["null_pointer", "missing_return"],
141
+ "severity": "high"
142
+ }
143
+ ```
144
 
145
+ ---
146
 
147
+ ## 👁️ Observation Space
148
 
149
+ The agent receives a `ReviewObservation` (defined in `models.py`) after every `reset()` and `step()`:
 
 
150
 
151
+ | Field | Type | Description |
152
+ |-------|------|-------------|
153
+ | `task_id` | `str` | Task identifier (`task_easy`, `task_medium`, `task_hard`) |
154
+ | `file_name` | `str` | Simulated file under review (e.g., `auth.py`) |
155
+ | `task_description` | `str` | Instructions for the agent |
156
+ | `code_snippet` | `str` | Python source code containing planted bugs |
157
+ | `feedback` | `str` | Grading breakdown after each step |
158
+ | `step_number` | `int` | Current step in the episode (0-indexed) |
159
+ | `available_issue_tags` | `list[str]` | Full taxonomy for reference |
160
+ | `reward` | `float` | Score from the grader (0.0 – 1.0) |
161
+ | `done` | `bool` | Whether the episode has ended |
162
+ | `metadata` | `dict` | Diagnostics: correctly found, missed, false positives |
163
 
164
+ ---
165
 
166
+ ## 📊 Reward Function
 
 
 
167
 
168
+ The environment uses a **deterministic, multi-dimensional** reward function implemented in `server/graders.py`. Agents receive dense signal at every step.
169
 
170
+ ### Formula
 
 
171
 
172
+ ```
173
+ base_score = |correctly_found ∩ planted| / |planted|
174
+ quality_bonus = +0.05 × (# correct issues with keyword match in comment)
175
+ precision_penalty = −0.10 × (# false-positive issues)
176
 
177
+ final_score = clamp(base_score + quality_bonus − precision_penalty, 0.0, 1.0)
178
+ ```
179
 
180
+ ### Components Explained
181
+
182
+ | Component | Value | Purpose |
183
+ |-----------|-------|---------|
184
+ | **Recall Reward** | `|correct| / |planted|` | Primary signal — find what's actually broken |
185
+ | **Quality Bonus** | `+0.05` per issue | Rewards articulation — mentioning *why* an issue matters |
186
+ | **Precision Penalty** | `−0.10` per FP | Discourages hallucinated / over-aggressive flagging |
187
+
188
+ ### Keyword Bonus Examples
189
+
190
+ | Issue Tag | Triggering Keywords |
191
+ |-----------|-------------------|
192
+ | `sql_injection` | sql, injection, f-string, sanitize, parameterize |
193
+ | `hardcoded_secret` | hardcoded, secret, credential, env var, plaintext |
194
+ | `race_condition` | race, atomic, concurrent, lock, thread |
195
+ | `timing_attack` | timing, constant time, hmac, compare_digest |
196
+ | `improper_error_handling` | except, swallow, silent, bare except |
197
+
198
+ > **Design rationale:** Random or naive strategies produce low scores (missed issues + penalties). Agents must demonstrate both *detection accuracy* and *communication quality* to score well.
199
+
200
+ ---
201
+
202
+ ## 📝 Tasks
203
+
204
+ Three tasks with progressive difficulty. Each task presents a different Python file with distinct planted vulnerabilities.
205
+
206
+ ### `task_easy` — User Service (`user_service.py`)
207
+
208
+ | Property | Value |
209
+ |----------|-------|
210
+ | **Planted Issues** | `null_pointer`, `missing_return` |
211
+ | **Difficulty** | Easy |
212
+ | **Description** | Review a function that uses `.get()` without null-guarding and never returns its result |
213
+ | **Grader** | Deterministic — recall + quality bonus − FP penalty |
214
+
215
+ <details>
216
+ <summary>📄 Code Snippet</summary>
217
+
218
+ ```python
219
+ def get_user_age(user):
220
+ # Returns age in years from user profile dict
221
+ birthdate = user.get("birthdate")
222
+ if user.get("is_active"):
223
+ account_label = f"active:{user.get('id')}"
224
+ else:
225
+ account_label = "inactive"
226
+
227
+ age = (datetime.now() - birthdate).days // 365
228
+ profile = {"label": account_label, "age": age}
229
+ # TODO: return something
230
+ ```
231
+
232
+ </details>
233
+
234
+ ---
235
 
236
+ ### `task_medium` Auth Module (`auth.py`)
 
 
 
237
 
238
+ | Property | Value |
239
+ |----------|-------|
240
+ | **Planted Issues** | `sql_injection`, `hardcoded_secret` |
241
+ | **Difficulty** | Medium |
242
+ | **Description** | Review an authentication function with f-string SQL interpolation and a plaintext secret key |
243
+ | **Grader** | Deterministic — recall + quality bonus − FP penalty |
244
 
245
+ <details>
246
+ <summary>📄 Code Snippet</summary>
 
 
247
 
248
+ ```python
249
+ SECRET_KEY = "supersecret123" # used for JWT signing
250
+
251
+ def authenticate_user(db_conn, username, password):
252
+ query = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'"
253
+ result = db_conn.execute(query)
254
+ user = result.fetchone()
255
+
256
+ if user:
257
+ audit_line = f"auth ok for {username}"
258
+ token = jwt.encode({"user_id": user.id}, SECRET_KEY)
259
+ return token
260
+ ```
261
+
262
+ </details>
263
+
264
+ ---
265
+
266
+ ### `task_hard` — Payment Processing (`payments.py`)
267
+
268
+ | Property | Value |
269
+ |----------|-------|
270
+ | **Planted Issues** | `race_condition`, `improper_error_handling`, `timing_attack` |
271
+ | **Difficulty** | Hard |
272
+ | **Description** | Review a payment function for non-atomic balance ops, silently swallowed exceptions, and non-constant-time comparisons |
273
+ | **Grader** | Deterministic — recall + quality bonus − FP penalty |
274
+
275
+ <details>
276
+ <summary>📄 Code Snippet</summary>
277
+
278
+ ```python
279
+ def process_payment(user_id, amount, card_token):
280
+ user = db.get_user(user_id)
281
+ if user.balance >= amount:
282
+ user.balance -= amount # checked and modified non-atomically
283
+ db.save_user(user)
284
+
285
+ try:
286
+ charge_result = payment_gateway.charge(card_token, amount)
287
+ except:
288
+ pass # silently swallow all payment errors
289
+
290
+ expected = db.get_token_hash(card_token)
291
+ actual = hash(card_token)
292
+ if expected == actual: # non-constant-time comparison
293
+ return {"status": "success", "charge": charge_result}
294
+ ```
295
+
296
+ </details>
297
+
298
+ ---
299
+
300
+ ## 🔧 Setup & Usage
301
+
302
+ ### Prerequisites
303
+
304
+ - Python ≥ 3.10
305
+ - [uv](https://github.com/astral-sh/uv) (recommended) or pip
306
+
307
+ ### Install Dependencies
308
 
309
  ```bash
310
+ # Clone the repository
311
+ git clone https://github.com/Dolphin-Syndrom/code-review-env.git
312
+ cd code-review-env
313
+
314
+ # Using uv (recommended)
315
+ uv sync
316
+
317
+ # Or using pip
318
  pip install -e .
319
+ ```
320
+
321
+ ### Start the Server
322
+
323
+ ```bash
324
+ # Using uv
325
+ uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
326
+
327
+ # Or using pip-installed packages
328
  uvicorn server.app:app --host 0.0.0.0 --port 8000
329
  ```
330
 
331
+ ### Verify It's Running
332
 
333
  ```bash
334
+ # Health check
335
  curl http://localhost:8000/health
336
+
337
+ # List all tasks
338
  curl http://localhost:8000/tasks
339
+
340
+ # Run the built-in baseline
341
  curl -X POST http://localhost:8000/baseline
342
+
343
+ # Score a custom submission
344
  curl -X POST http://localhost:8000/grader \
345
  -H "Content-Type: application/json" \
346
+ -d '{
347
+ "task_id": "task_easy",
348
+ "issues_found": ["null_pointer", "missing_return"],
349
+ "review_comment": "Null dereference risk on birthdate and missing return statement"
350
+ }'
351
  ```
352
 
353
+ ---
 
 
 
 
 
 
 
354
 
355
+ ## 🤖 Running the Baseline
356
 
357
+ The baseline script (`inference.py`) supports two modes and follows the **mandatory OpenEnv stdout format**.
 
 
358
 
359
+ ### Environment Variables
360
 
361
+ | Variable | Required | Default | Description |
362
+ |----------|----------|---------|-------------|
363
+ | `API_BASE_URL` | No | `https://router.huggingface.co/v1` | LLM API endpoint |
364
+ | `MODEL_NAME` | No | `Qwen/Qwen2.5-72B-Instruct` | Model identifier |
365
+ | `HF_TOKEN` | For LLM mode | — | Hugging Face / API key |
366
+ | `ENV_URL` | No | `http://localhost:8000` | Environment server URL |
367
+ | `IMAGE_NAME` | No | — | Docker image name (if using `from_docker_image()`) |
368
 
369
+ ### Run Locally
370
 
371
  ```bash
372
+ # Rule-based fallback (no API key needed)
373
  python inference.py
374
 
375
+ # LLM-backed mode
376
+ API_BASE_URL=https://router.huggingface.co/v1 \
377
+ MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
378
+ HF_TOKEN=hf_your_token_here \
379
+ python inference.py
380
 
381
+ # Against a deployed HF Space
382
+ ENV_URL=https://your-space.hf.space python inference.py
383
  ```
384
 
385
+ ### Structured Stdout Output (Mandatory)
386
+
387
+ The script emits exactly three line types for the OpenEnv validator:
388
 
389
+ ```
390
+ [START] task=task_easy env=code_review_env model=Qwen/Qwen2.5-72B-Instruct
391
+ [STEP] step=1 action={"issues_found":["null_pointer","missing_return"],...} reward=1.00 done=true error=null
392
+ [END] success=true steps=1 score=1.000 rewards=1.00
393
  ```
394
 
395
+ **Rules:**
396
+ - One `[START]` at episode begin
397
+ - One `[STEP]` per step, immediately after `env.step()` returns
398
+ - One `[END]` after episode completes (always emitted, even on exception)
399
+ - `reward` and `rewards` formatted to 2 decimal places
400
+ - `done` and `success` are lowercase booleans
401
+ - Each task score must be in `(0.0, 1.0)` — strictly between, not exactly 0 or 1
402
 
403
+ ---
404
+
405
+ ## 📈 Baseline Scores
406
+
407
+ Performance of the built-in rule-based baseline (no LLM required):
408
+
409
+ | Task | Difficulty | Issues Detected | Score |
410
+ |------|-----------|----------------|-------|
411
+ | `task_easy` | 🟢 Easy | `null_pointer`, `missing_return` | **1.00** |
412
+ | `task_medium` | 🟡 Medium | `sql_injection`, `hardcoded_secret` | **1.00** |
413
+ | `task_hard` | 🔴 Hard | `race_condition`, `timing_attack`, `improper_error_handling` | **1.00** |
414
+
415
+ > The rule-based baseline uses pattern-matching heuristics (e.g., detecting `.get(` for null pointers, `f"select` for SQL injection). LLM agents are expected to **match or exceed** these scores while providing richer, more actionable review comments.
416
+
417
+ ---
418
+
419
+ ## 🚀 Deployment
420
 
421
+ ### Docker
422
 
423
  ```bash
424
+ # Build
425
+ docker build -t code-review-env:latest .
426
+
427
+ # Run
428
  docker run -p 8000:8000 code-review-env:latest
429
  ```
430
 
431
+ ### Hugging Face Spaces
432
+
433
+ This repo is structured for Docker-based deployment to HF Spaces.
434
 
435
  ```bash
436
+ # Using the OpenEnv CLI
437
+ openenv push --repo-id Dolphin-Syndrom/code-review-env
438
  ```
439
 
440
+ **Recommended Space Settings:**
441
+ - **SDK:** Docker
442
+ - **Hardware:** CPU Basic (sufficient)
443
+ - **Secrets:** Set `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` if you want LLM baseline enabled
444
 
445
+ ### Pre-Validation
446
+
447
+ Run the validation script before submitting:
448
 
449
  ```bash
450
+ ./scripts/validate-submission.sh https://Dolphin-Syndrom-code-review-env.hf.space .
 
451
  ```
452
 
453
+ ---
454
+
455
+ ## 🔌 API Reference
456
+
457
+ All endpoints are OpenEnv-compatible and return structured JSON.
458
 
459
+ | Method | Endpoint | Description |
460
+ |--------|----------|-------------|
461
+ | `GET` | `/health` | Health check |
462
+ | `GET` | `/tasks` | List all tasks with schemas |
463
+ | `POST` | `/reset` | Reset episode for a given task |
464
+ | `POST` | `/step` | Submit a review action |
465
+ | `GET` | `/state` | Get current episode state |
466
+ | `POST` | `/grader` | Score an action against a task |
467
+ | `POST` | `/baseline` | Run built-in rule-based baseline across all tasks |
468
+ | `WS` | `/ws` | WebSocket for real-time interaction |
469
+
470
+ ---
471
 
472
+ ## 📁 Project Structure
473
 
474
+ ```
475
+ code-review-env/
476
+ ├── __init__.py # Package exports
477
+ ├── client.py # CodeReviewEnv — WebSocket client (EnvClient subclass)
478
+ ├── models.py # ReviewAction, ReviewObservation, ReviewState, ISSUE_TAXONOMY
479
+ ├── inference.py # Baseline inference (LLM + rule-based fallback)
480
+ ├── openenv.yaml # OpenEnv manifest with grader blocks
481
+ ├── pyproject.toml # Project metadata & dependencies
482
+ ├── Dockerfile # Production container
483
  ├── README.md
484
+ ├── scripts/
485
+ │ └── validate-submission.sh # Pre-submission validator
486
  └── server/
487
  ├── __init__.py
488
+ ├── app.py # FastAPI server + Gradio dashboard
489
+ ├── code_review_env_environment.py # Environment implementation (reset/step/state)
490
+ ├── graders.py # Deterministic grading logic
491
+ ├── tasks.py # Task definitions with planted issues
492
  ├── requirements.txt
493
  └── Dockerfile
494
  ```
495
+
496
+ ---
497
+
498
+ ## 🏁 Submission Checklist
499
+
500
+ | # | Check | Status |
501
+ |---|-------|--------|
502
+ | 1 | Docker build succeeds | ✅ |
503
+ | 2 | `POST /reset` returns 200 | ✅ |
504
+ | 3 | 3 tasks with `grader:` blocks in `openenv.yaml` | ✅ |
505
+ | 4 | `inference.py` exits with code 0 | ✅ |
506
+ | 5 | `[START]`, `[STEP]`, `[END]` in stdout | ✅ |
507
+ | 6 | LLM calls via `API_BASE_URL` proxy | ✅ |
508
+ | 7 | All task scores strictly in `(0.0, 1.0)` | ✅ |
509
+
510
+ ### Submission Links (Both Required)
511
+
512
+ 1. **GitHub Repository:** [Dolphin-Syndrom/code-review-env](https://github.com/Dolphin-Syndrom/code-review-env)
513
+ 2. **Hugging Face Space:** [Dolphin-Syndrom/code-review-env](https://huggingface.co/spaces/Dolphin-Syndrom/code-review-env)
514
+
515
+ ---
516
+
517
+ ## 🔮 Extensibility
518
+
519
+ Possible next steps for this benchmark:
520
+
521
+ - **More languages** — Extend beyond Python to JavaScript, Go, Rust
522
+ - **Multi-file reviews** — Cross-file dependency analysis
523
+ - **Diff-based input** — Review git diffs instead of full files
524
+ - **Severity grading** — Score severity accuracy, not just issue detection
525
+ - **Exploit generation** — Ask agents to produce PoC exploits for found vulnerabilities
526
+ - **Agentic tool use** — Let agents run linters, type checkers, or tests as sub-actions
527
+
528
+ ---
529
+
530
+ <div align="center">
531
+
532
+ **Built with [OpenEnv](https://github.com/meta-pytorch/OpenEnv) by Meta** &nbsp;·&nbsp; **Deployed on [Hugging Face Spaces](https://huggingface.co/spaces)**
533
+
534
+ </div>
openenv.yaml CHANGED
@@ -1,19 +1,160 @@
1
- spec_version: 1
2
  name: code-review-env
3
- version: "1.0.0"
4
  description: >
5
- A code review agent environment where an AI agent reads buggy Python code
6
- and learns to identify security vulnerabilities, logic errors, and code smells.
7
- Simulates the real-world software engineering task of pull request review.
 
8
  author: Dolphin-Syndrom
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  tasks:
10
- - task_easy
11
- - task_medium
12
- - task_hard
13
- sdk: gradio
14
- sdk_version: "4.0"
15
- type: space
16
- runtime: fastapi
17
- app: server.app:app
18
- port: 8000
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
 
 
 
 
 
 
 
1
  name: code-review-env
2
+ version: 1.0.0
3
  description: >
4
+ An OpenEnv benchmark where an AI agent reviews buggy Python code and learns
5
+ to identify security vulnerabilities, logic errors, and code smells using a
6
+ fixed taxonomy of issue tags. Simulates the real-world software engineering
7
+ task of pull-request review with deterministic, multi-dimensional grading.
8
  author: Dolphin-Syndrom
9
+ license: BSD-3-Clause
10
+
11
+ spec:
12
+ observation_space:
13
+ task_id:
14
+ type: string
15
+ description: Current task identifier (task_easy, task_medium, task_hard)
16
+ file_name:
17
+ type: string
18
+ description: File name associated with the code snippet under review
19
+ task_description:
20
+ type: string
21
+ description: Instructions describing what the agent should review and return
22
+ code_snippet:
23
+ type: string
24
+ description: Python code snippet containing planted issues for review
25
+ feedback:
26
+ type: string
27
+ description: Grading feedback for the most recent action or startup guidance after reset
28
+ step_number:
29
+ type: integer
30
+ description: Current step number within the episode (starts at 0 after reset)
31
+ available_issue_tags:
32
+ type: array
33
+ description: >
34
+ Allowed issue tags the agent can use in issues_found —
35
+ null_pointer, missing_return, type_error, index_out_of_bounds,
36
+ sql_injection, hardcoded_secret, missing_input_validation,
37
+ race_condition, timing_attack, improper_error_handling,
38
+ integer_overflow, path_traversal
39
+
40
+ action_space:
41
+ review:
42
+ type: object
43
+ properties:
44
+ review_comment:
45
+ type: string
46
+ description: Human-readable review explaining identified issues and suggested fixes
47
+ issues_found:
48
+ type: array
49
+ items:
50
+ type: string
51
+ description: List of issue tags found by the agent, chosen from ISSUE_TAXONOMY
52
+ severity:
53
+ type: string
54
+ enum: [low, medium, high, critical]
55
+ description: Overall severity level assessed by the agent
56
+ required: [review_comment, issues_found, severity]
57
+
58
+ reward_range: [0.0, 1.0]
59
+ max_steps: 3
60
+
61
  tasks:
62
+ task_easy:
63
+ name: Easy — Null Pointer & Missing Return
64
+ description: >
65
+ Review a simple user-service function for a null-pointer dereference
66
+ and a missing return statement.
67
+ difficulty: easy
68
+ planted_issues: [null_pointer, missing_return]
69
+ grader:
70
+ type: deterministic
71
+ scoring: >
72
+ base = |correct ∩ planted| / |planted|;
73
+ bonus = +0.05 per correct issue with keyword match in comment;
74
+ penalty = −0.1 per false-positive;
75
+ score = clamp(base + bonus − penalty, 0.0, 1.0)
76
+
77
+ task_medium:
78
+ name: Medium — SQL Injection & Hardcoded Secret
79
+ description: >
80
+ Review an authentication module for SQL injection via f-string
81
+ interpolation and a hardcoded secret key.
82
+ difficulty: medium
83
+ planted_issues: [sql_injection, hardcoded_secret]
84
+ grader:
85
+ type: deterministic
86
+ scoring: >
87
+ base = |correct ∩ planted| / |planted|;
88
+ bonus = +0.05 per correct issue with keyword match in comment;
89
+ penalty = −0.1 per false-positive;
90
+ score = clamp(base + bonus − penalty, 0.0, 1.0)
91
+
92
+ task_hard:
93
+ name: Hard — Race Condition, Error Handling & Timing Attack
94
+ description: >
95
+ Review a payment-processing function for a non-atomic
96
+ balance check-and-decrement (race condition), a bare except that
97
+ silently swallows payment errors, and a non-constant-time
98
+ token comparison (timing attack).
99
+ difficulty: hard
100
+ planted_issues: [race_condition, improper_error_handling, timing_attack]
101
+ grader:
102
+ type: deterministic
103
+ scoring: >
104
+ base = |correct ∩ planted| / |planted|;
105
+ bonus = +0.05 per correct issue with keyword match in comment;
106
+ penalty = −0.1 per false-positive;
107
+ score = clamp(base + bonus − penalty, 0.0, 1.0)
108
+
109
+ reward_function:
110
+ summary:
111
+ - Dense rewards are provided per step so agents receive signal across the full trajectory.
112
+ - Final task scores are deterministic and normalized to 0.0–1.0 by the graders.
113
+ components:
114
+ recall_reward:
115
+ description: >
116
+ Fractional reward proportional to |correctly found issues| / |planted issues|.
117
+ This is the primary learning signal encouraging comprehensive detection.
118
+ quality_bonus:
119
+ value: +0.05
120
+ description: >
121
+ Per correctly-found issue whose associated keywords appear in the
122
+ agent's free-text review_comment (e.g. "sql" for sql_injection).
123
+ precision_penalty:
124
+ value: -0.10
125
+ description: >
126
+ Per false-positive issue tag submitted. Discourages hallucinated
127
+ or overly aggressive flagging.
128
+
129
+ server:
130
+ host: 0.0.0.0
131
+ port: 8000
132
+ entrypoint: server.app:app
133
+ endpoints:
134
+ - GET /health
135
+ - GET /tasks
136
+ - POST /reset
137
+ - POST /step
138
+ - GET /state
139
+ - POST /grader
140
+ - POST /baseline
141
+ - GET /ws
142
+
143
+ dependencies:
144
+ python: ">=3.10"
145
+ packages:
146
+ - openenv-core[core]>=0.2.2
147
+ - openai>=1.0
148
+ - httpx>=0.24.0
149
+ - plotly>=6.6.0
150
+ - pandas>=2.3.3
151
+ - gradio>=4.0
152
+ - pydantic>=2.0.0
153
+ - uvicorn>=0.24.0
154
+ - fastapi>=0.104.0
155
 
156
+ validation:
157
+ openenv_spec: true
158
+ docker_build: true
159
+ baseline_reproducible: true
160
+ tasks_count: 3