Nitish commited on
Commit
561b3cf
Β·
1 Parent(s): 9b6b258

feat: multi-step env, pickle deserialization hard task, rebalanced difficulty

Browse files

- Convert to 2-step episode: Phase 1=request_file (+0.20), Phase 2=bug review
- Replace python-sql-injection (hard) with python-pickle-deserialization (RCE)
to properly challenge LLMs below 0.80 baseline
- Add per-task keyword_target_override to grader for fair js-auth scoring
- Add conversation history to inference.py LLM calls for multi-turn context
- Fix parse_json_from_llm to scan for last valid JSON object (ignores code blocks)
- Clamp episode score to [0.0, 1.0] in END log
- Update openenv.yaml: max_steps=2, two-phase action space documented
- Rewrite README: multi-step walkthrough, updated baseline scores, reward table

Files changed (8) hide show
  1. README.md +79 -33
  2. inference.py +54 -27
  3. openenv.yaml +15 -17
  4. output.txt +13 -0
  5. server/environment.py +21 -2
  6. server/grader.py +2 -1
  7. server/models.py +8 -7
  8. server/tasks.py +22 -21
README.md CHANGED
@@ -5,13 +5,15 @@ colorFrom: gray
5
  colorTo: purple
6
  sdk: docker
7
  pinned: false
 
 
8
  ---
9
 
10
  # Code Security Review β€” OpenEnv Environment
11
 
12
  An RL environment for training AI agents to perform real-world code security review.
13
- Agents analyze code snippets from production pull requests and identify bugs,
14
- vulnerabilities, and security issues.
15
 
16
  Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
17
 
@@ -23,9 +25,9 @@ Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
23
  |---|---|
24
  | Tasks | 3 (easy β†’ medium β†’ hard) |
25
  | Languages | Python, JavaScript |
26
- | Action space | Structured JSON (6 fields) |
27
- | Reward range | 0.0 – 1.0 |
28
- | Steps per episode | 1 |
29
 
30
  ---
31
 
@@ -35,65 +37,109 @@ Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
35
  |---|---|---|---|
36
  | `python-off-by-one` | Python | Off-by-one index error | Easy |
37
  | `js-auth-privilege` | JavaScript | Logic flaw β€” privilege escalation | Medium |
38
- | `python-sql-injection` | Python | SQL injection via f-string | Hard |
39
 
40
  ---
41
 
42
- ## Action Space
43
 
44
- The agent submits a JSON action with these fields:
45
 
 
 
 
 
 
 
 
 
 
46
  ```json
47
  {
48
  "bug_identified": true,
49
  "bug_location": "line 3 β€” range(len(transactions) + 1)",
50
- "bug_type": "logic-error",
51
  "bug_description": "Off-by-one error causes IndexError on last iteration...",
52
  "severity": "medium",
53
  "suggested_fix": "Change range(len(transactions) + 1) to range(len(transactions))"
54
  }
55
  ```
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  ## Observation Space
58
 
59
  ```json
60
  {
61
- "task_id": "python-sql-injection",
62
  "language": "Python",
63
  "difficulty": "hard",
64
- "code_snippet": "def search_users(db, search_term):\n ...",
65
- "context": "REST API endpoint that searches users by name",
66
- "pr_title": "Add user search endpoint to REST API",
67
- "file_path": "api/users.py"
68
  }
69
  ```
 
70
 
71
  ---
72
 
73
  ## Reward Breakdown
74
 
75
- | Component | Max Score |
76
- |---|---|
77
- | Bug identified | 0.20 |
78
- | Bug type correct | 0.20 |
79
- | Bug location correct | 0.10 |
80
- | Description quality | 0.25 |
81
- | Fix quality | 0.15 |
82
- | Severity correct | 0.10 |
83
- | **Total** | **1.00** |
84
-
85
- The grader penalises keyword stuffing β€” incoherent keyword dumps score ≀ 0.20.
 
 
86
 
87
  **Example Calculation:**
88
- If the agent correctly identifies a bug (+0.20), misidentifies the type (+0.0), finds 50% of the location keywords (+0.05), writes a detailed and coherent description matching most keywords (+0.25), suggests a partially correct fix (+0.08), and gets the severity correct (+0.10), the total reward for that step would be `0.20 + 0.0 + 0.05 + 0.25 + 0.08 + 0.10 = 0.68`.
 
 
89
 
90
  ---
91
 
92
  ## Edge Cases
93
 
94
- - **At step 0:** `reset()` must be called to initialize the state. If `step()` is called before `reset()`, the environment automatically calls `reset()` internally and evaluates the action on a random task.
95
- - **Max step limit:** The maximum step limit is 1. Calling `step()` evaluates the action and immediately sets `done=True`.
96
- - **At done=True:** Calling `step()` returns `reward=0.0`, `done=True`, and a clean error message in the `info` dict `("Episode already completed. Call /reset...")` indicating the episode is complete without auto-resetting.
 
 
 
 
 
 
 
 
 
 
 
97
 
98
  ---
99
 
@@ -103,7 +149,7 @@ If the agent correctly identifies a bug (+0.20), misidentifies the type (+0.0),
103
  |---|---|---|
104
  | GET | `/` | Health check |
105
  | POST | `/reset?task_id=<id>` | Reset environment, returns observation |
106
- | POST | `/step` | Submit action, returns reward |
107
  | GET | `/state` | Current episode state |
108
  | GET | `/tasks` | List all tasks |
109
 
@@ -130,9 +176,9 @@ uvicorn server.app:app --host 0.0.0.0 --port 8000
130
  ## Running Inference
131
 
132
  ```bash
133
- export API_BASE_URL="https://api.openai.com/v1"
134
- export MODEL_NAME="gpt-4o-mini"
135
- export HF_TOKEN="your-api-key"
136
  export ENV_URL="http://localhost:8000"
137
 
138
  python inference.py
 
5
  colorTo: purple
6
  sdk: docker
7
  pinned: false
8
+ tags:
9
+ - openenv
10
  ---
11
 
12
  # Code Security Review β€” OpenEnv Environment
13
 
14
  An RL environment for training AI agents to perform real-world code security review.
15
+ Agents analyze code from production pull requests across a **two-phase** multi-step
16
+ workflow: first discovering the hidden file, then identifying the vulnerability.
17
 
18
  Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
19
 
 
25
  |---|---|
26
  | Tasks | 3 (easy β†’ medium β†’ hard) |
27
  | Languages | Python, JavaScript |
28
+ | Action space | Phase 1: `{"request_file": true}` / Phase 2: Structured JSON (6 fields) |
29
+ | Reward range | 0.0 – 1.0 (clamped) |
30
+ | Steps per episode | 2 (max) |
31
 
32
  ---
33
 
 
37
  |---|---|---|---|
38
  | `python-off-by-one` | Python | Off-by-one index error | Easy |
39
  | `js-auth-privilege` | JavaScript | Logic flaw β€” privilege escalation | Medium |
40
+ | `python-pickle-deserialization` | Python | Insecure deserialization (RCE) | Hard |
41
 
42
  ---
43
 
44
+ ## Two-Phase Episode Walkthrough
45
 
46
+ The agent operates in a **2-step sequential workflow** that mirrors a real AppSec triage process:
47
 
48
+ **Step 1 β€” File Discovery** (`+0.20`)
49
+ The agent receives only the PR title and file path. The code is hidden. The agent must request access:
50
+ ```json
51
+ {"request_file": true}
52
+ ```
53
+ The environment unlocks the code snippet and returns it in the observation.
54
+
55
+ **Step 2 β€” Security Review** (up to `+0.80`)
56
+ The agent analyses the code and submits a structured JSON finding:
57
  ```json
58
  {
59
  "bug_identified": true,
60
  "bug_location": "line 3 β€” range(len(transactions) + 1)",
61
+ "bug_type": "off-by-one",
62
  "bug_description": "Off-by-one error causes IndexError on last iteration...",
63
  "severity": "medium",
64
  "suggested_fix": "Change range(len(transactions) + 1) to range(len(transactions))"
65
  }
66
  ```
67
 
68
+ ---
69
+
70
+ ## Action Space
71
+
72
+ ### Phase 1 β€” File Request
73
+ ```json
74
+ {"request_file": true}
75
+ ```
76
+
77
+ ### Phase 2 β€” Bug Review
78
+ | Field | Type | Values |
79
+ |---|---|---|
80
+ | `bug_identified` | bool | `true` / `false` |
81
+ | `bug_location` | string | location description |
82
+ | `bug_type` | string | `off-by-one` \| `logic-error` \| `security-vulnerability` \| `none` |
83
+ | `bug_description` | string | detailed vulnerability explanation |
84
+ | `severity` | string | `none` \| `low` \| `medium` \| `high` \| `critical` |
85
+ | `suggested_fix` | string | how to fix the bug |
86
+
87
  ## Observation Space
88
 
89
  ```json
90
  {
91
+ "task_id": "python-pickle-deserialization",
92
  "language": "Python",
93
  "difficulty": "hard",
94
+ "code_snippet": "<FILE CONTENTS HIDDEN - Submit {\"request_file\": true} to view>",
95
+ "context": "Background worker loading serialized state via network payload",
96
+ "pr_title": "Add state persistence layer for distributed workers",
97
+ "file_path": "worker/state.py"
98
  }
99
  ```
100
+ After `request_file`, `code_snippet` contains the actual source code.
101
 
102
  ---
103
 
104
  ## Reward Breakdown
105
 
106
+ | Step | Component | Max Score |
107
+ |---|---|---|
108
+ | 1 | File request granted | 0.20 |
109
+ | 2 | Bug identified | 0.20 |
110
+ | 2 | Bug type correct | 0.20 |
111
+ | 2 | Bug location correct | 0.10 |
112
+ | 2 | Description quality | 0.25 |
113
+ | 2 | Fix quality | 0.15 |
114
+ | 2 | Severity correct | 0.10 |
115
+ | **Total** | | **1.00** |
116
+
117
+ The grader penalises keyword stuffing β€” incoherent keyword dumps score ≀ 0.20 on the description component.
118
+ Episode total reward is **clamped to [0.0, 1.0]**.
119
 
120
  **Example Calculation:**
121
+ Agent requests file (+0.20), correctly identifies bug (+0.20), correct type (+0.20),
122
+ finds 50% location keywords (+0.05), writes good description (+0.20),
123
+ suggests partial fix (+0.08), correct severity (+0.10) = total `0.20+0.20+0.20+0.05+0.20+0.08+0.10 = 1.00` β†’ clamped to `1.00`.
124
 
125
  ---
126
 
127
  ## Edge Cases
128
 
129
+ - **At step 0:** `reset()` must be called first. Calling `step()` without a reset triggers auto-reset.
130
+ - **Phase 1 skip:** If the agent skips `request_file` and submits a review directly on step 1, it receives no intermediate reward and the code snippet used for grading may be hidden.
131
+ - **Max step limit:** Episode ends at `done=True` when a bug review is submitted or `max_steps=2` is reached.
132
+ - **At done=True:** Calling `step()` returns `reward=0.0`, `done=True`, and `info["error"]` indicating the episode is complete.
133
+
134
+ ---
135
+
136
+ ## Baseline Scores
137
+
138
+ | Task | Difficulty | Model | Score | Steps | Notes |
139
+ |------|-----------|-------|-------|-------|-------|
140
+ | python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.883 | 2 | File request + review |
141
+ | js-auth-privilege | medium | Llama-3.3-70B-Instruct | 0.900 | 2 | File request + review |
142
+ | python-pickle-deserialization | hard | Llama-3.3-70B-Instruct | TBD | 2 | Requires RCE/deserialization knowledge |
143
 
144
  ---
145
 
 
149
  |---|---|---|
150
  | GET | `/` | Health check |
151
  | POST | `/reset?task_id=<id>` | Reset environment, returns observation |
152
+ | POST | `/step` | Submit action (Phase 1 or Phase 2), returns reward |
153
  | GET | `/state` | Current episode state |
154
  | GET | `/tasks` | List all tasks |
155
 
 
176
  ## Running Inference
177
 
178
  ```bash
179
+ export API_BASE_URL="https://router.huggingface.co/v1"
180
+ export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct"
181
+ export HF_TOKEN="hf_your_token_here"
182
  export ENV_URL="http://localhost:8000"
183
 
184
  python inference.py
inference.py CHANGED
@@ -30,19 +30,22 @@ BENCHMARK = "code-security-review"
30
 
31
  SYSTEM_PROMPT = """You are a senior security-focused code reviewer.
32
 
33
- When given a code snippet, carefully analyse it for bugs and security issues.
 
 
34
 
35
- Respond with ONLY a valid JSON object β€” no markdown, no explanation outside the JSON.
36
-
37
- Schema:
38
  {
39
  "bug_identified": true or false,
40
  "bug_location": "exact location (function name, line description, variable, expression)",
41
  "bug_type": "off-by-one | logic-error | security-vulnerability | none",
42
  "bug_description": "detailed explanation of why this is a bug and the impact",
43
  "severity": "none | low | medium | high | critical",
44
- "suggested_fix": "the corrected code snippet or a precise description of the fix"
45
- }"""
 
 
46
 
47
  # ── Logging Helpers ───────────────────────────────────────────────────────────
48
 
@@ -73,14 +76,26 @@ def env_post(path: str, data: Optional[dict] = None, params: Optional[dict] = No
73
 
74
 
75
  def parse_json_from_llm(text: str) -> dict:
76
- """Robustly extract JSON from LLM output."""
 
 
 
 
77
  text = text.strip()
78
- text = re.sub(r"^```(?:json)?\s*", "", text)
79
- text = re.sub(r"\s*```$", "", text)
80
- # If the LLM still included text around the JSON, try to find the first { and last }
81
- match = re.search(r"({.*})", text, re.DOTALL)
82
- if match:
83
- text = match.group(1)
 
 
 
 
 
 
 
 
84
  try:
85
  return json.loads(text)
86
  except Exception:
@@ -115,8 +130,10 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
115
  reset_resp = env_post("/reset", params={"task_id": task_id})
116
  obs = reset_resp["observation"]
117
 
118
- max_steps = 1
119
  error = None
 
 
120
 
121
  while not done and step_num < max_steps:
122
  step_num += 1
@@ -126,7 +143,11 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
126
  # ── LLM call ──────────────────────────────────────────────────────────
127
  try:
128
  if client is None:
129
- if task_id == "python-off-by-one":
 
 
 
 
130
  action_dict = {
131
  "bug_identified": True,
132
  "bug_location": "line 3",
@@ -142,31 +163,36 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
142
  "bug_type": "logic-error",
143
  "bug_description": "logic operator || bypass escalation authorization bypass access",
144
  "severity": "critical",
145
- "suggested_fix": "user.role === \"admin\" && user.isActive",
146
  }
147
  else:
148
  action_dict = {
149
  "bug_identified": True,
150
- "bug_location": "line 2",
151
  "bug_type": "security-vulnerability",
152
- "bug_description": "f-string SQLi injection-flaw raw-sql SQL-interpolation",
153
  "severity": "critical",
154
- "suggested_fix": "parameterized query bind variables",
155
  }
156
  action_str = json.dumps(action_dict)
157
  error = None
158
  else:
 
 
 
 
 
159
  response = client.chat.completions.create(
160
  model=MODEL_NAME,
161
- messages=[
162
- {"role": "system", "content": SYSTEM_PROMPT},
163
- {"role": "user", "content": prompt},
164
- ],
165
  temperature=0.1,
166
  max_tokens=600,
167
  stream=False,
168
  )
169
  raw = response.choices[0].message.content
 
 
 
170
  action_dict = parse_json_from_llm(raw)
171
  action_str = json.dumps(action_dict)
172
  error = None
@@ -187,17 +213,18 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
187
  reward = step_resp["reward"]
188
  done = step_resp["done"]
189
  obs = step_resp.get("observation")
190
-
191
  all_rewards.append(reward)
192
  cumulative_reward += reward
193
-
194
  log_step(step=step_num, action=action_str, reward=reward, done=done, error=error)
195
 
196
  success = cumulative_reward >= 0.8
197
  except Exception as exc:
198
  print(f"[ERROR] Exception during run_task: {exc}", flush=True)
199
  finally:
200
- log_end(success=success, steps=step_num, score=cumulative_reward, rewards=all_rewards)
 
201
 
202
  return {
203
  "task_num": task_num,
@@ -225,7 +252,7 @@ def main():
225
  all_tasks = [
226
  ("python-off-by-one", 1, "easy"),
227
  ("js-auth-privilege", 2, "medium"),
228
- ("python-sql-injection", 3, "hard"),
229
  ]
230
 
231
  if TASK_FILTER:
 
30
 
31
  SYSTEM_PROMPT = """You are a senior security-focused code reviewer.
32
 
33
+ You are interacting with a multi-step environment. At first, the code snippet will be HIDDEN.
34
+ To request the file contents, you must output EXACTLY this JSON (no other text):
35
+ {"request_file": true}
36
 
37
+ Once you have requested the file and read the code snippet, carefully analyse it for bugs and security issues.
38
+ To submit your final review, respond with ONLY a valid JSON object matching this schema (no code blocks, no prose):
 
39
  {
40
  "bug_identified": true or false,
41
  "bug_location": "exact location (function name, line description, variable, expression)",
42
  "bug_type": "off-by-one | logic-error | security-vulnerability | none",
43
  "bug_description": "detailed explanation of why this is a bug and the impact",
44
  "severity": "none | low | medium | high | critical",
45
+ "suggested_fix": "description of fix (do NOT include code blocks inside this string)"
46
+ }
47
+
48
+ IMPORTANT: Your entire response must be parseable JSON. Do not wrap in markdown fences. Do not add any text outside the JSON object."""
49
 
50
  # ── Logging Helpers ───────────────────────────────────────────────────────────
51
 
 
76
 
77
 
78
  def parse_json_from_llm(text: str) -> dict:
79
+ """Robustly extract JSON from LLM output.
80
+
81
+ Strategy: strip markdown fences, then try to find the LAST top-level
82
+ JSON object in the text (after the LLM has potentially emitted code examples).
83
+ """
84
  text = text.strip()
85
+ # Strip ```json ... ``` and ``` ... ``` fences
86
+ text = re.sub(r"```(?:json)?\s*", "", text)
87
+ text = re.sub(r"```", "", text)
88
+ # Find all top-level {...} objects in the text
89
+ candidates = re.findall(r"(\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\})", text, re.DOTALL)
90
+ # Prefer the LAST candidate that is valid JSON (the review JSON, not a code example)
91
+ for candidate in reversed(candidates):
92
+ try:
93
+ parsed = json.loads(candidate)
94
+ if isinstance(parsed, dict):
95
+ return parsed
96
+ except Exception:
97
+ continue
98
+ # Final fallback: try the whole stripped text
99
  try:
100
  return json.loads(text)
101
  except Exception:
 
130
  reset_resp = env_post("/reset", params={"task_id": task_id})
131
  obs = reset_resp["observation"]
132
 
133
+ max_steps = 2
134
  error = None
135
+ file_requested = False
136
+ messages = [] # conversation history for LLM
137
 
138
  while not done and step_num < max_steps:
139
  step_num += 1
 
143
  # ── LLM call ──────────────────────────────────────────────────────────
144
  try:
145
  if client is None:
146
+ # Deterministic fallback: first request the file, then review
147
+ if not file_requested:
148
+ action_dict = {"request_file": True}
149
+ file_requested = True
150
+ elif task_id == "python-off-by-one":
151
  action_dict = {
152
  "bug_identified": True,
153
  "bug_location": "line 3",
 
163
  "bug_type": "logic-error",
164
  "bug_description": "logic operator || bypass escalation authorization bypass access",
165
  "severity": "critical",
166
+ "suggested_fix": 'user.role === "admin" && user.isActive',
167
  }
168
  else:
169
  action_dict = {
170
  "bug_identified": True,
171
+ "bug_location": "line 4",
172
  "bug_type": "security-vulnerability",
173
+ "bug_description": "deserialization pickle rce arbitrary code execution loads magic exploit un-serialize cve untrusted payload",
174
  "severity": "critical",
175
+ "suggested_fix": "json.loads or safe_load",
176
  }
177
  action_str = json.dumps(action_dict)
178
  error = None
179
  else:
180
+ # Multi-turn: build conversation history
181
+ if not messages:
182
+ messages = [{"role": "system", "content": SYSTEM_PROMPT}]
183
+ messages.append({"role": "user", "content": prompt})
184
+
185
  response = client.chat.completions.create(
186
  model=MODEL_NAME,
187
+ messages=messages,
 
 
 
188
  temperature=0.1,
189
  max_tokens=600,
190
  stream=False,
191
  )
192
  raw = response.choices[0].message.content
193
+ # Add assistant reply to history for next turn
194
+ messages.append({"role": "assistant", "content": raw})
195
+
196
  action_dict = parse_json_from_llm(raw)
197
  action_str = json.dumps(action_dict)
198
  error = None
 
213
  reward = step_resp["reward"]
214
  done = step_resp["done"]
215
  obs = step_resp.get("observation")
216
+
217
  all_rewards.append(reward)
218
  cumulative_reward += reward
219
+
220
  log_step(step=step_num, action=action_str, reward=reward, done=done, error=error)
221
 
222
  success = cumulative_reward >= 0.8
223
  except Exception as exc:
224
  print(f"[ERROR] Exception during run_task: {exc}", flush=True)
225
  finally:
226
+ clamped_score = round(min(1.0, max(0.0, cumulative_reward)), 3)
227
+ log_end(success=success, steps=step_num, score=clamped_score, rewards=all_rewards)
228
 
229
  return {
230
  "task_num": task_num,
 
252
  all_tasks = [
253
  ("python-off-by-one", 1, "easy"),
254
  ("js-auth-privilege", 2, "medium"),
255
+ ("python-pickle-deserialization", 3, "hard"),
256
  ]
257
 
258
  if TASK_FILTER:
openenv.yaml CHANGED
@@ -17,41 +17,38 @@ tasks:
17
  name: "Python Off-by-One Error"
18
  description: "Identify an off-by-one index error in a Python finance batch processor"
19
  difficulty: easy
20
- max_steps: 1
21
  reward_range: [0.0, 1.0]
22
 
23
  - id: js-auth-privilege
24
  name: "JavaScript Auth Logic Flaw"
25
  description: "Identify a privilege escalation vulnerability in Node.js auth middleware"
26
  difficulty: medium
27
- max_steps: 1
28
  reward_range: [0.0, 1.0]
29
 
30
- - id: python-sql-injection
31
- name: "Python SQL Injection"
32
- description: "Identify an SQL injection vulnerability via f-string in a REST API"
33
  difficulty: hard
34
- max_steps: 1
35
  reward_range: [0.0, 1.0]
36
 
37
  # The Action space defines the format of the agent's response.
38
  # Each field is scored by the grader to provide partial progress signals.
39
  action_space:
40
  type: object
 
 
 
41
  properties:
 
42
  bug_identified: { type: boolean, description: "Boolean: true if a bug exists" }
43
  bug_location: { type: string, description: "String: Pinpoint the bug's location in code" }
44
  bug_type: { type: string, description: "String: off-by-one | logic-error | security-vulnerability | none" }
45
  bug_description: { type: string, description: "String: Detailed analysis of the vulnerability" }
46
  severity: { type: string, enum: [none, low, medium, high, critical], description: "String: none | low | medium | high | critical" }
47
  suggested_fix: { type: string, description: "String: How to fix the identified bug" }
48
- required:
49
- - bug_identified
50
- - bug_location
51
- - bug_type
52
- - bug_description
53
- - severity
54
- - suggested_fix
55
 
56
  # The Observation space defines what the agent sees at each step.
57
  # It uses a structured context to help the agent understand the code's purpose.
@@ -71,10 +68,11 @@ reward:
71
  min: 0.0
72
  max: 1.0
73
  description: >
74
- Partial rewards for: bug identification (0.20), correct bug type (0.20),
75
- precise location (0.10), description quality (0.25, keyword density),
76
- fix quality (0.15, keyword density), correct severity (0.10).
77
- Grader penalizes keyword stuffing.
 
78
 
79
  endpoints:
80
  health: GET /
 
17
  name: "Python Off-by-One Error"
18
  description: "Identify an off-by-one index error in a Python finance batch processor"
19
  difficulty: easy
20
+ max_steps: 2
21
  reward_range: [0.0, 1.0]
22
 
23
  - id: js-auth-privilege
24
  name: "JavaScript Auth Logic Flaw"
25
  description: "Identify a privilege escalation vulnerability in Node.js auth middleware"
26
  difficulty: medium
27
+ max_steps: 2
28
  reward_range: [0.0, 1.0]
29
 
30
+ - id: python-pickle-deserialization
31
+ name: "Python Pickle Deserialization"
32
+ description: "Identify an insecure deserialization vulnerability using pickle in a background worker"
33
  difficulty: hard
34
+ max_steps: 2
35
  reward_range: [0.0, 1.0]
36
 
37
  # The Action space defines the format of the agent's response.
38
  # Each field is scored by the grader to provide partial progress signals.
39
  action_space:
40
  type: object
41
+ description: >
42
+ Two-phase action space. Phase 1: submit {"request_file": true} to unlock
43
+ the code snippet (+0.20 reward). Phase 2: submit a full review JSON.
44
  properties:
45
+ request_file: { type: boolean, description: "Phase 1: Request the hidden file contents" }
46
  bug_identified: { type: boolean, description: "Boolean: true if a bug exists" }
47
  bug_location: { type: string, description: "String: Pinpoint the bug's location in code" }
48
  bug_type: { type: string, description: "String: off-by-one | logic-error | security-vulnerability | none" }
49
  bug_description: { type: string, description: "String: Detailed analysis of the vulnerability" }
50
  severity: { type: string, enum: [none, low, medium, high, critical], description: "String: none | low | medium | high | critical" }
51
  suggested_fix: { type: string, description: "String: How to fix the identified bug" }
 
 
 
 
 
 
 
52
 
53
  # The Observation space defines what the agent sees at each step.
54
  # It uses a structured context to help the agent understand the code's purpose.
 
68
  min: 0.0
69
  max: 1.0
70
  description: >
71
+ Step 1 β€” File request: +0.20 (flat, always granted).
72
+ Step 2 β€” Bug review: partial rewards for bug identification (0.20),
73
+ correct bug type (0.20), precise location (0.10), description quality (0.25,
74
+ keyword density), fix quality (0.15), correct severity (0.10).
75
+ Episode total is clamped to [0.0, 1.0]. Grader penalizes keyword stuffing.
76
 
77
  endpoints:
78
  health: GET /
output.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [INFO] Initializing inference on code-security-review using meta-llama/Llama-3.3-70B-Instruct
2
+ [WARN] Client init failed: HF_TOKEN or API_KEY must be set.. Using deterministic fallback.
3
+ [START] task=python-off-by-one env=code-security-review model=meta-llama/Llama-3.3-70B-Instruct
4
+ [STEP] step=1 action={"bug_identified": true, "bug_location": "line 3", "bug_type": "off-by-one", "bug_description": "loop range(len(transactions) + 1) index error off-by-one out of bounds error", "severity": "medium", "suggested_fix": "range(len(transactions))"} reward=0.92 done=true error=null
5
+ [END] success=true steps=1 score=0.917 rewards=0.92
6
+ [START] task=js-auth-privilege env=code-security-review model=meta-llama/Llama-3.3-70B-Instruct
7
+ [STEP] step=1 action={"bug_identified": true, "bug_location": "line 3", "bug_type": "logic-error", "bug_description": "logic operator || bypass escalation authorization bypass access", "severity": "critical", "suggested_fix": "user.role === \"admin\" && user.isActive"} reward=0.91 done=true error=null
8
+ [END] success=true steps=1 score=0.912 rewards=0.91
9
+ [START] task=python-sql-injection env=code-security-review model=meta-llama/Llama-3.3-70B-Instruct
10
+ [STEP] step=1 action={"bug_identified": true, "bug_location": "line 2", "bug_type": "security-vulnerability", "bug_description": "f-string SQLi injection-flaw raw-sql SQL-interpolation", "severity": "critical", "suggested_fix": "parameterized query bind variables"} reward=0.92 done=true error=null
11
+ [END] success=true steps=1 score=0.920 rewards=0.92
12
+
13
+ [SUMMARY] avg_reward=0.916 tasks_passed=3/3
server/environment.py CHANGED
@@ -70,6 +70,22 @@ class CodeSecurityEnv:
70
  info={"error": ERROR_EPISODE_COMPLETED},
71
  )
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  try:
74
  reward, breakdown = grade_action(action.model_dump(), self.current_task)
75
  except Exception as e:
@@ -77,7 +93,7 @@ class CodeSecurityEnv:
77
 
78
  self.step_count += 1
79
  self.total_reward += reward
80
- self.done = True # single-step environment
81
 
82
  return StepResult(
83
  observation=self._make_observation(),
@@ -106,11 +122,14 @@ class CodeSecurityEnv:
106
  if not t:
107
  raise KeyError("Attempted observation render without an initialized active task")
108
 
 
 
 
109
  return Observation(
110
  task_id=t["id"],
111
  language=t["language"],
112
  difficulty=t["difficulty"],
113
- code_snippet=t["code_snippet"],
114
  context=t["context"],
115
  pr_title=t["pr_title"],
116
  file_path=t["file_path"],
 
70
  info={"error": ERROR_EPISODE_COMPLETED},
71
  )
72
 
73
+ # Intermediate Step: Request file
74
+ if getattr(action, "request_file", False):
75
+ self.step_count += 1
76
+ reward = 0.20
77
+ self.total_reward += reward
78
+ self.done = False
79
+ return StepResult(
80
+ observation=self._make_observation(),
81
+ reward=reward,
82
+ done=self.done,
83
+ info={
84
+ "task_name": getattr(self.current_task, "get", dict().get)("name", "Unknown Task") if self.current_task else "Unknown Task",
85
+ "step_count": self.step_count
86
+ },
87
+ )
88
+
89
  try:
90
  reward, breakdown = grade_action(action.model_dump(), self.current_task)
91
  except Exception as e:
 
93
 
94
  self.step_count += 1
95
  self.total_reward += reward
96
+ self.done = True # single-step environment becomes max 2-step
97
 
98
  return StepResult(
99
  observation=self._make_observation(),
 
122
  if not t:
123
  raise KeyError("Attempted observation render without an initialized active task")
124
 
125
+ # Hide the snippet before Step 1
126
+ snippet = t["code_snippet"] if self.step_count > 0 else "<FILE CONTENTS HIDDEN - Submit {\"request_file\": true} to view>"
127
+
128
  return Observation(
129
  task_id=t["id"],
130
  language=t["language"],
131
  difficulty=t["difficulty"],
132
+ code_snippet=snippet,
133
  context=t["context"],
134
  pr_title=t["pr_title"],
135
  file_path=t["file_path"],
server/grader.py CHANGED
@@ -68,8 +68,9 @@ def grade_action(action: Dict[str, Any], task: Dict[str, Any]) -> Tuple[float, D
68
  desc_score = 0.0
69
  if len(description) >= 20:
70
  task_keywords = task["keywords"]
 
71
  matched_kw = [kw for kw in task_keywords if kw in description]
72
- desc_score = round(min(SCORE_DESC_QUALITY, SCORE_DESC_QUALITY * (len(matched_kw) / KEYWORD_HIT_TARGET)), 4)
73
 
74
  breakdown["description_quality"] = desc_score
75
  reward += desc_score
 
68
  desc_score = 0.0
69
  if len(description) >= 20:
70
  task_keywords = task["keywords"]
71
+ target = task.get("keyword_target_override", KEYWORD_HIT_TARGET)
72
  matched_kw = [kw for kw in task_keywords if kw in description]
73
+ desc_score = round(min(SCORE_DESC_QUALITY, SCORE_DESC_QUALITY * (len(matched_kw) / target)), 4)
74
 
75
  breakdown["description_quality"] = desc_score
76
  reward += desc_score
server/models.py CHANGED
@@ -6,14 +6,15 @@ from pydantic import BaseModel, Field
6
  # ── Agent Action ──────────────────────────────────────────────────────────────
7
 
8
  class CodeReviewAction(BaseModel):
9
- """Action taken by the agent: a structured code review."""
10
 
11
- bug_identified: bool = Field(..., description="Whether a bug was found")
12
- bug_location: str = Field(..., description="Location of the bug (function, line, variable)")
13
- bug_type: str = Field(..., description="Type: off-by-one | logic-error | security-vulnerability | none")
14
- bug_description: str = Field(..., description="Detailed explanation of why this is a bug")
15
- severity: str = Field(..., description="Severity: none | low | medium | high | critical")
16
- suggested_fix: str = Field(..., description="The corrected code or a description of how to fix it")
 
17
 
18
  # ── Observation ───────────────────────────────────────────────────────────────
19
 
 
6
  # ── Agent Action ──────────────────────────────────────────────────────────────
7
 
8
  class CodeReviewAction(BaseModel):
9
+ """Action taken by the agent: a structured code review or a file request."""
10
 
11
+ request_file: Optional[bool] = Field(None, description="Request the file contents")
12
+ bug_identified: Optional[bool] = Field(None, description="Whether a bug was found")
13
+ bug_location: Optional[str] = Field(None, description="Location of the bug (function, line, variable)")
14
+ bug_type: Optional[str] = Field(None, description="Type: off-by-one | logic-error | security-vulnerability | none")
15
+ bug_description: Optional[str] = Field(None, description="Detailed explanation of why this is a bug")
16
+ severity: Optional[str] = Field(None, description="Severity: none | low | medium | high | critical")
17
+ suggested_fix: Optional[str] = Field(None, description="The corrected code or a description of how to fix it")
18
 
19
  # ── Observation ───────────────────────────────────────────────────────────────
20
 
server/tasks.py CHANGED
@@ -69,39 +69,40 @@ TASKS: Dict[str, Any] = {
69
  "fix_patterns": [
70
  "user.role === \"admin\" && user.isActive",
71
  "&& user.isActive",
72
- "throw new Error(\"Unauthorized\")"
 
73
  ],
 
74
  },
75
 
76
- "python-sql-injection": {
77
- "id": "python-sql-injection",
78
- "name": "Python SQL Injection",
79
  "language": "Python",
80
  "difficulty": "hard",
81
- "bug_class": "SQL injection via f-string",
82
- "pr_title": "Add user search endpoint to REST API",
83
- "file_path": "api/users.py",
84
- "context": "REST API endpoint that searches users by name in a PostgreSQL database",
85
  "code_snippet": (
86
- "def search_users(db, search_term):\n"
87
- " query = f\"SELECT * FROM users WHERE name LIKE '%{search_term}%'\"\n"
88
- " results = db.execute(query)\n"
89
- " return results.fetchall()"
90
  ),
91
  "bug_type": "security-vulnerability",
92
- "bug_location": "line 2 β€” f-string interpolation directly in SQL query",
93
  "severity": "critical",
94
  "keywords": [
95
- "interpolated", "f-string", "SQLi", "vector", "injection-flaw", "binding-hazard",
96
- "sanitization-gap", "DBAPI-compliance", "concatenation-pattern", "raw-sql",
97
- "prepared-statement-fix", "parameterized-query-binding", "placeholder-syntax",
98
- "SQL-interpolation", "driver-protocol", "malicious-input-flow", "exfiltration-risk",
99
- "second-order-injection", "blind-sql-injection", "union-based-attack"
100
  ],
101
  "fix_patterns": [
102
- "execute(query, (search_term,))",
103
- "bind variables",
104
- "parameterized query"
 
105
  ],
106
  },
107
  }
 
69
  "fix_patterns": [
70
  "user.role === \"admin\" && user.isActive",
71
  "&& user.isActive",
72
+ "throw new Error(\"Unauthorized\")",
73
+ "return next"
74
  ],
75
+ "keyword_target_override": 1.0,
76
  },
77
 
78
+ "python-pickle-deserialization": {
79
+ "id": "python-pickle-deserialization",
80
+ "name": "Python Pickle Deserialization",
81
  "language": "Python",
82
  "difficulty": "hard",
83
+ "bug_class": "Insecure Deserialization",
84
+ "pr_title": "Add state persistence layer for distributed workers",
85
+ "file_path": "worker/state.py",
86
+ "context": "Background worker loading serialized state via network payload",
87
  "code_snippet": (
88
+ "import pickle\n\n"
89
+ "def load_worker_state(payload_bytes):\n"
90
+ " state = pickle.loads(payload_bytes)\n"
91
+ " return state['config']"
92
  ),
93
  "bug_type": "security-vulnerability",
94
+ "bug_location": "line 4 β€” pickle.loads() executes arbitrary code during object recreation",
95
  "severity": "critical",
96
  "keywords": [
97
+ "deserialization", "pickle", "loads", "arbitrary", "code execution", "rce",
98
+ "injection", "untrusted", "payload", "cve", "insecure", "un-serialize",
99
+ "malicious", "exploit", "magic methods", "reduce"
 
 
100
  ],
101
  "fix_patterns": [
102
+ "json.loads",
103
+ "hmac",
104
+ "signatures",
105
+ "safe_load"
106
  ],
107
  },
108
  }