Spaces:
Running
Running
Nitish commited on
Commit Β·
561b3cf
1
Parent(s): 9b6b258
feat: multi-step env, pickle deserialization hard task, rebalanced difficulty
Browse files- Convert to 2-step episode: Phase 1=request_file (+0.20), Phase 2=bug review
- Replace python-sql-injection (hard) with python-pickle-deserialization (RCE)
to properly challenge LLMs below 0.80 baseline
- Add per-task keyword_target_override to grader for fair js-auth scoring
- Add conversation history to inference.py LLM calls for multi-turn context
- Fix parse_json_from_llm to scan for last valid JSON object (ignores code blocks)
- Clamp episode score to [0.0, 1.0] in END log
- Update openenv.yaml: max_steps=2, two-phase action space documented
- Rewrite README: multi-step walkthrough, updated baseline scores, reward table
- README.md +79 -33
- inference.py +54 -27
- openenv.yaml +15 -17
- output.txt +13 -0
- server/environment.py +21 -2
- server/grader.py +2 -1
- server/models.py +8 -7
- server/tasks.py +22 -21
README.md
CHANGED
|
@@ -5,13 +5,15 @@ colorFrom: gray
|
|
| 5 |
colorTo: purple
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
# Code Security Review β OpenEnv Environment
|
| 11 |
|
| 12 |
An RL environment for training AI agents to perform real-world code security review.
|
| 13 |
-
Agents analyze code
|
| 14 |
-
|
| 15 |
|
| 16 |
Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
|
| 17 |
|
|
@@ -23,9 +25,9 @@ Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
|
|
| 23 |
|---|---|
|
| 24 |
| Tasks | 3 (easy β medium β hard) |
|
| 25 |
| Languages | Python, JavaScript |
|
| 26 |
-
| Action space | Structured JSON (6 fields) |
|
| 27 |
-
| Reward range | 0.0 β 1.0 |
|
| 28 |
-
| Steps per episode |
|
| 29 |
|
| 30 |
---
|
| 31 |
|
|
@@ -35,65 +37,109 @@ Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
|
|
| 35 |
|---|---|---|---|
|
| 36 |
| `python-off-by-one` | Python | Off-by-one index error | Easy |
|
| 37 |
| `js-auth-privilege` | JavaScript | Logic flaw β privilege escalation | Medium |
|
| 38 |
-
| `python-
|
| 39 |
|
| 40 |
---
|
| 41 |
|
| 42 |
-
##
|
| 43 |
|
| 44 |
-
The agent
|
| 45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
```json
|
| 47 |
{
|
| 48 |
"bug_identified": true,
|
| 49 |
"bug_location": "line 3 β range(len(transactions) + 1)",
|
| 50 |
-
"bug_type": "
|
| 51 |
"bug_description": "Off-by-one error causes IndexError on last iteration...",
|
| 52 |
"severity": "medium",
|
| 53 |
"suggested_fix": "Change range(len(transactions) + 1) to range(len(transactions))"
|
| 54 |
}
|
| 55 |
```
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
## Observation Space
|
| 58 |
|
| 59 |
```json
|
| 60 |
{
|
| 61 |
-
"task_id": "python-
|
| 62 |
"language": "Python",
|
| 63 |
"difficulty": "hard",
|
| 64 |
-
"code_snippet": "
|
| 65 |
-
"context": "
|
| 66 |
-
"pr_title": "Add
|
| 67 |
-
"file_path": "
|
| 68 |
}
|
| 69 |
```
|
|
|
|
| 70 |
|
| 71 |
---
|
| 72 |
|
| 73 |
## Reward Breakdown
|
| 74 |
|
| 75 |
-
| Component | Max Score |
|
| 76 |
-
|---|---|
|
| 77 |
-
|
|
| 78 |
-
|
|
| 79 |
-
| Bug
|
| 80 |
-
|
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
|
| 85 |
-
|
|
|
|
|
|
|
| 86 |
|
| 87 |
**Example Calculation:**
|
| 88 |
-
|
|
|
|
|
|
|
| 89 |
|
| 90 |
---
|
| 91 |
|
| 92 |
## Edge Cases
|
| 93 |
|
| 94 |
-
- **At step 0:** `reset()` must be called
|
| 95 |
-
- **
|
| 96 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
---
|
| 99 |
|
|
@@ -103,7 +149,7 @@ If the agent correctly identifies a bug (+0.20), misidentifies the type (+0.0),
|
|
| 103 |
|---|---|---|
|
| 104 |
| GET | `/` | Health check |
|
| 105 |
| POST | `/reset?task_id=<id>` | Reset environment, returns observation |
|
| 106 |
-
| POST | `/step` | Submit action, returns reward |
|
| 107 |
| GET | `/state` | Current episode state |
|
| 108 |
| GET | `/tasks` | List all tasks |
|
| 109 |
|
|
@@ -130,9 +176,9 @@ uvicorn server.app:app --host 0.0.0.0 --port 8000
|
|
| 130 |
## Running Inference
|
| 131 |
|
| 132 |
```bash
|
| 133 |
-
export API_BASE_URL="https://
|
| 134 |
-
export MODEL_NAME="
|
| 135 |
-
export HF_TOKEN="
|
| 136 |
export ENV_URL="http://localhost:8000"
|
| 137 |
|
| 138 |
python inference.py
|
|
|
|
| 5 |
colorTo: purple
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
+
tags:
|
| 9 |
+
- openenv
|
| 10 |
---
|
| 11 |
|
| 12 |
# Code Security Review β OpenEnv Environment
|
| 13 |
|
| 14 |
An RL environment for training AI agents to perform real-world code security review.
|
| 15 |
+
Agents analyze code from production pull requests across a **two-phase** multi-step
|
| 16 |
+
workflow: first discovering the hidden file, then identifying the vulnerability.
|
| 17 |
|
| 18 |
Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
|
| 19 |
|
|
|
|
| 25 |
|---|---|
|
| 26 |
| Tasks | 3 (easy β medium β hard) |
|
| 27 |
| Languages | Python, JavaScript |
|
| 28 |
+
| Action space | Phase 1: `{"request_file": true}` / Phase 2: Structured JSON (6 fields) |
|
| 29 |
+
| Reward range | 0.0 β 1.0 (clamped) |
|
| 30 |
+
| Steps per episode | 2 (max) |
|
| 31 |
|
| 32 |
---
|
| 33 |
|
|
|
|
| 37 |
|---|---|---|---|
|
| 38 |
| `python-off-by-one` | Python | Off-by-one index error | Easy |
|
| 39 |
| `js-auth-privilege` | JavaScript | Logic flaw β privilege escalation | Medium |
|
| 40 |
+
| `python-pickle-deserialization` | Python | Insecure deserialization (RCE) | Hard |
|
| 41 |
|
| 42 |
---
|
| 43 |
|
| 44 |
+
## Two-Phase Episode Walkthrough
|
| 45 |
|
| 46 |
+
The agent operates in a **2-step sequential workflow** that mirrors a real AppSec triage process:
|
| 47 |
|
| 48 |
+
**Step 1 β File Discovery** (`+0.20`)
|
| 49 |
+
The agent receives only the PR title and file path. The code is hidden. The agent must request access:
|
| 50 |
+
```json
|
| 51 |
+
{"request_file": true}
|
| 52 |
+
```
|
| 53 |
+
The environment unlocks the code snippet and returns it in the observation.
|
| 54 |
+
|
| 55 |
+
**Step 2 β Security Review** (up to `+0.80`)
|
| 56 |
+
The agent analyses the code and submits a structured JSON finding:
|
| 57 |
```json
|
| 58 |
{
|
| 59 |
"bug_identified": true,
|
| 60 |
"bug_location": "line 3 β range(len(transactions) + 1)",
|
| 61 |
+
"bug_type": "off-by-one",
|
| 62 |
"bug_description": "Off-by-one error causes IndexError on last iteration...",
|
| 63 |
"severity": "medium",
|
| 64 |
"suggested_fix": "Change range(len(transactions) + 1) to range(len(transactions))"
|
| 65 |
}
|
| 66 |
```
|
| 67 |
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## Action Space
|
| 71 |
+
|
| 72 |
+
### Phase 1 β File Request
|
| 73 |
+
```json
|
| 74 |
+
{"request_file": true}
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
### Phase 2 β Bug Review
|
| 78 |
+
| Field | Type | Values |
|
| 79 |
+
|---|---|---|
|
| 80 |
+
| `bug_identified` | bool | `true` / `false` |
|
| 81 |
+
| `bug_location` | string | location description |
|
| 82 |
+
| `bug_type` | string | `off-by-one` \| `logic-error` \| `security-vulnerability` \| `none` |
|
| 83 |
+
| `bug_description` | string | detailed vulnerability explanation |
|
| 84 |
+
| `severity` | string | `none` \| `low` \| `medium` \| `high` \| `critical` |
|
| 85 |
+
| `suggested_fix` | string | how to fix the bug |
|
| 86 |
+
|
| 87 |
## Observation Space
|
| 88 |
|
| 89 |
```json
|
| 90 |
{
|
| 91 |
+
"task_id": "python-pickle-deserialization",
|
| 92 |
"language": "Python",
|
| 93 |
"difficulty": "hard",
|
| 94 |
+
"code_snippet": "<FILE CONTENTS HIDDEN - Submit {\"request_file\": true} to view>",
|
| 95 |
+
"context": "Background worker loading serialized state via network payload",
|
| 96 |
+
"pr_title": "Add state persistence layer for distributed workers",
|
| 97 |
+
"file_path": "worker/state.py"
|
| 98 |
}
|
| 99 |
```
|
| 100 |
+
After `request_file`, `code_snippet` contains the actual source code.
|
| 101 |
|
| 102 |
---
|
| 103 |
|
| 104 |
## Reward Breakdown
|
| 105 |
|
| 106 |
+
| Step | Component | Max Score |
|
| 107 |
+
|---|---|---|
|
| 108 |
+
| 1 | File request granted | 0.20 |
|
| 109 |
+
| 2 | Bug identified | 0.20 |
|
| 110 |
+
| 2 | Bug type correct | 0.20 |
|
| 111 |
+
| 2 | Bug location correct | 0.10 |
|
| 112 |
+
| 2 | Description quality | 0.25 |
|
| 113 |
+
| 2 | Fix quality | 0.15 |
|
| 114 |
+
| 2 | Severity correct | 0.10 |
|
| 115 |
+
| **Total** | | **1.00** |
|
| 116 |
+
|
| 117 |
+
The grader penalises keyword stuffing β incoherent keyword dumps score β€ 0.20 on the description component.
|
| 118 |
+
Episode total reward is **clamped to [0.0, 1.0]**.
|
| 119 |
|
| 120 |
**Example Calculation:**
|
| 121 |
+
Agent requests file (+0.20), correctly identifies bug (+0.20), correct type (+0.20),
|
| 122 |
+
finds 50% location keywords (+0.05), writes good description (+0.20),
|
| 123 |
+
suggests partial fix (+0.08), correct severity (+0.10) = total `0.20+0.20+0.20+0.05+0.20+0.08+0.10 = 1.00` β clamped to `1.00`.
|
| 124 |
|
| 125 |
---
|
| 126 |
|
| 127 |
## Edge Cases
|
| 128 |
|
| 129 |
+
- **At step 0:** `reset()` must be called first. Calling `step()` without a reset triggers auto-reset.
|
| 130 |
+
- **Phase 1 skip:** If the agent skips `request_file` and submits a review directly on step 1, it receives no intermediate reward and the code snippet used for grading may be hidden.
|
| 131 |
+
- **Max step limit:** Episode ends at `done=True` when a bug review is submitted or `max_steps=2` is reached.
|
| 132 |
+
- **At done=True:** Calling `step()` returns `reward=0.0`, `done=True`, and `info["error"]` indicating the episode is complete.
|
| 133 |
+
|
| 134 |
+
---
|
| 135 |
+
|
| 136 |
+
## Baseline Scores
|
| 137 |
+
|
| 138 |
+
| Task | Difficulty | Model | Score | Steps | Notes |
|
| 139 |
+
|------|-----------|-------|-------|-------|-------|
|
| 140 |
+
| python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.883 | 2 | File request + review |
|
| 141 |
+
| js-auth-privilege | medium | Llama-3.3-70B-Instruct | 0.900 | 2 | File request + review |
|
| 142 |
+
| python-pickle-deserialization | hard | Llama-3.3-70B-Instruct | TBD | 2 | Requires RCE/deserialization knowledge |
|
| 143 |
|
| 144 |
---
|
| 145 |
|
|
|
|
| 149 |
|---|---|---|
|
| 150 |
| GET | `/` | Health check |
|
| 151 |
| POST | `/reset?task_id=<id>` | Reset environment, returns observation |
|
| 152 |
+
| POST | `/step` | Submit action (Phase 1 or Phase 2), returns reward |
|
| 153 |
| GET | `/state` | Current episode state |
|
| 154 |
| GET | `/tasks` | List all tasks |
|
| 155 |
|
|
|
|
| 176 |
## Running Inference
|
| 177 |
|
| 178 |
```bash
|
| 179 |
+
export API_BASE_URL="https://router.huggingface.co/v1"
|
| 180 |
+
export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct"
|
| 181 |
+
export HF_TOKEN="hf_your_token_here"
|
| 182 |
export ENV_URL="http://localhost:8000"
|
| 183 |
|
| 184 |
python inference.py
|
inference.py
CHANGED
|
@@ -30,19 +30,22 @@ BENCHMARK = "code-security-review"
|
|
| 30 |
|
| 31 |
SYSTEM_PROMPT = """You are a senior security-focused code reviewer.
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
Schema:
|
| 38 |
{
|
| 39 |
"bug_identified": true or false,
|
| 40 |
"bug_location": "exact location (function name, line description, variable, expression)",
|
| 41 |
"bug_type": "off-by-one | logic-error | security-vulnerability | none",
|
| 42 |
"bug_description": "detailed explanation of why this is a bug and the impact",
|
| 43 |
"severity": "none | low | medium | high | critical",
|
| 44 |
-
"suggested_fix": "
|
| 45 |
-
}
|
|
|
|
|
|
|
| 46 |
|
| 47 |
# ββ Logging Helpers βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 48 |
|
|
@@ -73,14 +76,26 @@ def env_post(path: str, data: Optional[dict] = None, params: Optional[dict] = No
|
|
| 73 |
|
| 74 |
|
| 75 |
def parse_json_from_llm(text: str) -> dict:
|
| 76 |
-
"""Robustly extract JSON from LLM output.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
text = text.strip()
|
| 78 |
-
|
| 79 |
-
text = re.sub(r"
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
try:
|
| 85 |
return json.loads(text)
|
| 86 |
except Exception:
|
|
@@ -115,8 +130,10 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
|
|
| 115 |
reset_resp = env_post("/reset", params={"task_id": task_id})
|
| 116 |
obs = reset_resp["observation"]
|
| 117 |
|
| 118 |
-
max_steps =
|
| 119 |
error = None
|
|
|
|
|
|
|
| 120 |
|
| 121 |
while not done and step_num < max_steps:
|
| 122 |
step_num += 1
|
|
@@ -126,7 +143,11 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
|
|
| 126 |
# ββ LLM call ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 127 |
try:
|
| 128 |
if client is None:
|
| 129 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
action_dict = {
|
| 131 |
"bug_identified": True,
|
| 132 |
"bug_location": "line 3",
|
|
@@ -142,31 +163,36 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
|
|
| 142 |
"bug_type": "logic-error",
|
| 143 |
"bug_description": "logic operator || bypass escalation authorization bypass access",
|
| 144 |
"severity": "critical",
|
| 145 |
-
"suggested_fix":
|
| 146 |
}
|
| 147 |
else:
|
| 148 |
action_dict = {
|
| 149 |
"bug_identified": True,
|
| 150 |
-
"bug_location": "line
|
| 151 |
"bug_type": "security-vulnerability",
|
| 152 |
-
"bug_description": "
|
| 153 |
"severity": "critical",
|
| 154 |
-
"suggested_fix": "
|
| 155 |
}
|
| 156 |
action_str = json.dumps(action_dict)
|
| 157 |
error = None
|
| 158 |
else:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
response = client.chat.completions.create(
|
| 160 |
model=MODEL_NAME,
|
| 161 |
-
messages=
|
| 162 |
-
{"role": "system", "content": SYSTEM_PROMPT},
|
| 163 |
-
{"role": "user", "content": prompt},
|
| 164 |
-
],
|
| 165 |
temperature=0.1,
|
| 166 |
max_tokens=600,
|
| 167 |
stream=False,
|
| 168 |
)
|
| 169 |
raw = response.choices[0].message.content
|
|
|
|
|
|
|
|
|
|
| 170 |
action_dict = parse_json_from_llm(raw)
|
| 171 |
action_str = json.dumps(action_dict)
|
| 172 |
error = None
|
|
@@ -187,17 +213,18 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
|
|
| 187 |
reward = step_resp["reward"]
|
| 188 |
done = step_resp["done"]
|
| 189 |
obs = step_resp.get("observation")
|
| 190 |
-
|
| 191 |
all_rewards.append(reward)
|
| 192 |
cumulative_reward += reward
|
| 193 |
-
|
| 194 |
log_step(step=step_num, action=action_str, reward=reward, done=done, error=error)
|
| 195 |
|
| 196 |
success = cumulative_reward >= 0.8
|
| 197 |
except Exception as exc:
|
| 198 |
print(f"[ERROR] Exception during run_task: {exc}", flush=True)
|
| 199 |
finally:
|
| 200 |
-
|
|
|
|
| 201 |
|
| 202 |
return {
|
| 203 |
"task_num": task_num,
|
|
@@ -225,7 +252,7 @@ def main():
|
|
| 225 |
all_tasks = [
|
| 226 |
("python-off-by-one", 1, "easy"),
|
| 227 |
("js-auth-privilege", 2, "medium"),
|
| 228 |
-
("python-
|
| 229 |
]
|
| 230 |
|
| 231 |
if TASK_FILTER:
|
|
|
|
| 30 |
|
| 31 |
SYSTEM_PROMPT = """You are a senior security-focused code reviewer.
|
| 32 |
|
| 33 |
+
You are interacting with a multi-step environment. At first, the code snippet will be HIDDEN.
|
| 34 |
+
To request the file contents, you must output EXACTLY this JSON (no other text):
|
| 35 |
+
{"request_file": true}
|
| 36 |
|
| 37 |
+
Once you have requested the file and read the code snippet, carefully analyse it for bugs and security issues.
|
| 38 |
+
To submit your final review, respond with ONLY a valid JSON object matching this schema (no code blocks, no prose):
|
|
|
|
| 39 |
{
|
| 40 |
"bug_identified": true or false,
|
| 41 |
"bug_location": "exact location (function name, line description, variable, expression)",
|
| 42 |
"bug_type": "off-by-one | logic-error | security-vulnerability | none",
|
| 43 |
"bug_description": "detailed explanation of why this is a bug and the impact",
|
| 44 |
"severity": "none | low | medium | high | critical",
|
| 45 |
+
"suggested_fix": "description of fix (do NOT include code blocks inside this string)"
|
| 46 |
+
}
|
| 47 |
+
|
| 48 |
+
IMPORTANT: Your entire response must be parseable JSON. Do not wrap in markdown fences. Do not add any text outside the JSON object."""
|
| 49 |
|
| 50 |
# ββ Logging Helpers βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 51 |
|
|
|
|
| 76 |
|
| 77 |
|
| 78 |
def parse_json_from_llm(text: str) -> dict:
|
| 79 |
+
"""Robustly extract JSON from LLM output.
|
| 80 |
+
|
| 81 |
+
Strategy: strip markdown fences, then try to find the LAST top-level
|
| 82 |
+
JSON object in the text (after the LLM has potentially emitted code examples).
|
| 83 |
+
"""
|
| 84 |
text = text.strip()
|
| 85 |
+
# Strip ```json ... ``` and ``` ... ``` fences
|
| 86 |
+
text = re.sub(r"```(?:json)?\s*", "", text)
|
| 87 |
+
text = re.sub(r"```", "", text)
|
| 88 |
+
# Find all top-level {...} objects in the text
|
| 89 |
+
candidates = re.findall(r"(\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\})", text, re.DOTALL)
|
| 90 |
+
# Prefer the LAST candidate that is valid JSON (the review JSON, not a code example)
|
| 91 |
+
for candidate in reversed(candidates):
|
| 92 |
+
try:
|
| 93 |
+
parsed = json.loads(candidate)
|
| 94 |
+
if isinstance(parsed, dict):
|
| 95 |
+
return parsed
|
| 96 |
+
except Exception:
|
| 97 |
+
continue
|
| 98 |
+
# Final fallback: try the whole stripped text
|
| 99 |
try:
|
| 100 |
return json.loads(text)
|
| 101 |
except Exception:
|
|
|
|
| 130 |
reset_resp = env_post("/reset", params={"task_id": task_id})
|
| 131 |
obs = reset_resp["observation"]
|
| 132 |
|
| 133 |
+
max_steps = 2
|
| 134 |
error = None
|
| 135 |
+
file_requested = False
|
| 136 |
+
messages = [] # conversation history for LLM
|
| 137 |
|
| 138 |
while not done and step_num < max_steps:
|
| 139 |
step_num += 1
|
|
|
|
| 143 |
# ββ LLM call ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 144 |
try:
|
| 145 |
if client is None:
|
| 146 |
+
# Deterministic fallback: first request the file, then review
|
| 147 |
+
if not file_requested:
|
| 148 |
+
action_dict = {"request_file": True}
|
| 149 |
+
file_requested = True
|
| 150 |
+
elif task_id == "python-off-by-one":
|
| 151 |
action_dict = {
|
| 152 |
"bug_identified": True,
|
| 153 |
"bug_location": "line 3",
|
|
|
|
| 163 |
"bug_type": "logic-error",
|
| 164 |
"bug_description": "logic operator || bypass escalation authorization bypass access",
|
| 165 |
"severity": "critical",
|
| 166 |
+
"suggested_fix": 'user.role === "admin" && user.isActive',
|
| 167 |
}
|
| 168 |
else:
|
| 169 |
action_dict = {
|
| 170 |
"bug_identified": True,
|
| 171 |
+
"bug_location": "line 4",
|
| 172 |
"bug_type": "security-vulnerability",
|
| 173 |
+
"bug_description": "deserialization pickle rce arbitrary code execution loads magic exploit un-serialize cve untrusted payload",
|
| 174 |
"severity": "critical",
|
| 175 |
+
"suggested_fix": "json.loads or safe_load",
|
| 176 |
}
|
| 177 |
action_str = json.dumps(action_dict)
|
| 178 |
error = None
|
| 179 |
else:
|
| 180 |
+
# Multi-turn: build conversation history
|
| 181 |
+
if not messages:
|
| 182 |
+
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
|
| 183 |
+
messages.append({"role": "user", "content": prompt})
|
| 184 |
+
|
| 185 |
response = client.chat.completions.create(
|
| 186 |
model=MODEL_NAME,
|
| 187 |
+
messages=messages,
|
|
|
|
|
|
|
|
|
|
| 188 |
temperature=0.1,
|
| 189 |
max_tokens=600,
|
| 190 |
stream=False,
|
| 191 |
)
|
| 192 |
raw = response.choices[0].message.content
|
| 193 |
+
# Add assistant reply to history for next turn
|
| 194 |
+
messages.append({"role": "assistant", "content": raw})
|
| 195 |
+
|
| 196 |
action_dict = parse_json_from_llm(raw)
|
| 197 |
action_str = json.dumps(action_dict)
|
| 198 |
error = None
|
|
|
|
| 213 |
reward = step_resp["reward"]
|
| 214 |
done = step_resp["done"]
|
| 215 |
obs = step_resp.get("observation")
|
| 216 |
+
|
| 217 |
all_rewards.append(reward)
|
| 218 |
cumulative_reward += reward
|
| 219 |
+
|
| 220 |
log_step(step=step_num, action=action_str, reward=reward, done=done, error=error)
|
| 221 |
|
| 222 |
success = cumulative_reward >= 0.8
|
| 223 |
except Exception as exc:
|
| 224 |
print(f"[ERROR] Exception during run_task: {exc}", flush=True)
|
| 225 |
finally:
|
| 226 |
+
clamped_score = round(min(1.0, max(0.0, cumulative_reward)), 3)
|
| 227 |
+
log_end(success=success, steps=step_num, score=clamped_score, rewards=all_rewards)
|
| 228 |
|
| 229 |
return {
|
| 230 |
"task_num": task_num,
|
|
|
|
| 252 |
all_tasks = [
|
| 253 |
("python-off-by-one", 1, "easy"),
|
| 254 |
("js-auth-privilege", 2, "medium"),
|
| 255 |
+
("python-pickle-deserialization", 3, "hard"),
|
| 256 |
]
|
| 257 |
|
| 258 |
if TASK_FILTER:
|
openenv.yaml
CHANGED
|
@@ -17,41 +17,38 @@ tasks:
|
|
| 17 |
name: "Python Off-by-One Error"
|
| 18 |
description: "Identify an off-by-one index error in a Python finance batch processor"
|
| 19 |
difficulty: easy
|
| 20 |
-
max_steps:
|
| 21 |
reward_range: [0.0, 1.0]
|
| 22 |
|
| 23 |
- id: js-auth-privilege
|
| 24 |
name: "JavaScript Auth Logic Flaw"
|
| 25 |
description: "Identify a privilege escalation vulnerability in Node.js auth middleware"
|
| 26 |
difficulty: medium
|
| 27 |
-
max_steps:
|
| 28 |
reward_range: [0.0, 1.0]
|
| 29 |
|
| 30 |
-
- id: python-
|
| 31 |
-
name: "Python
|
| 32 |
-
description: "Identify an
|
| 33 |
difficulty: hard
|
| 34 |
-
max_steps:
|
| 35 |
reward_range: [0.0, 1.0]
|
| 36 |
|
| 37 |
# The Action space defines the format of the agent's response.
|
| 38 |
# Each field is scored by the grader to provide partial progress signals.
|
| 39 |
action_space:
|
| 40 |
type: object
|
|
|
|
|
|
|
|
|
|
| 41 |
properties:
|
|
|
|
| 42 |
bug_identified: { type: boolean, description: "Boolean: true if a bug exists" }
|
| 43 |
bug_location: { type: string, description: "String: Pinpoint the bug's location in code" }
|
| 44 |
bug_type: { type: string, description: "String: off-by-one | logic-error | security-vulnerability | none" }
|
| 45 |
bug_description: { type: string, description: "String: Detailed analysis of the vulnerability" }
|
| 46 |
severity: { type: string, enum: [none, low, medium, high, critical], description: "String: none | low | medium | high | critical" }
|
| 47 |
suggested_fix: { type: string, description: "String: How to fix the identified bug" }
|
| 48 |
-
required:
|
| 49 |
-
- bug_identified
|
| 50 |
-
- bug_location
|
| 51 |
-
- bug_type
|
| 52 |
-
- bug_description
|
| 53 |
-
- severity
|
| 54 |
-
- suggested_fix
|
| 55 |
|
| 56 |
# The Observation space defines what the agent sees at each step.
|
| 57 |
# It uses a structured context to help the agent understand the code's purpose.
|
|
@@ -71,10 +68,11 @@ reward:
|
|
| 71 |
min: 0.0
|
| 72 |
max: 1.0
|
| 73 |
description: >
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
|
|
|
| 78 |
|
| 79 |
endpoints:
|
| 80 |
health: GET /
|
|
|
|
| 17 |
name: "Python Off-by-One Error"
|
| 18 |
description: "Identify an off-by-one index error in a Python finance batch processor"
|
| 19 |
difficulty: easy
|
| 20 |
+
max_steps: 2
|
| 21 |
reward_range: [0.0, 1.0]
|
| 22 |
|
| 23 |
- id: js-auth-privilege
|
| 24 |
name: "JavaScript Auth Logic Flaw"
|
| 25 |
description: "Identify a privilege escalation vulnerability in Node.js auth middleware"
|
| 26 |
difficulty: medium
|
| 27 |
+
max_steps: 2
|
| 28 |
reward_range: [0.0, 1.0]
|
| 29 |
|
| 30 |
+
- id: python-pickle-deserialization
|
| 31 |
+
name: "Python Pickle Deserialization"
|
| 32 |
+
description: "Identify an insecure deserialization vulnerability using pickle in a background worker"
|
| 33 |
difficulty: hard
|
| 34 |
+
max_steps: 2
|
| 35 |
reward_range: [0.0, 1.0]
|
| 36 |
|
| 37 |
# The Action space defines the format of the agent's response.
|
| 38 |
# Each field is scored by the grader to provide partial progress signals.
|
| 39 |
action_space:
|
| 40 |
type: object
|
| 41 |
+
description: >
|
| 42 |
+
Two-phase action space. Phase 1: submit {"request_file": true} to unlock
|
| 43 |
+
the code snippet (+0.20 reward). Phase 2: submit a full review JSON.
|
| 44 |
properties:
|
| 45 |
+
request_file: { type: boolean, description: "Phase 1: Request the hidden file contents" }
|
| 46 |
bug_identified: { type: boolean, description: "Boolean: true if a bug exists" }
|
| 47 |
bug_location: { type: string, description: "String: Pinpoint the bug's location in code" }
|
| 48 |
bug_type: { type: string, description: "String: off-by-one | logic-error | security-vulnerability | none" }
|
| 49 |
bug_description: { type: string, description: "String: Detailed analysis of the vulnerability" }
|
| 50 |
severity: { type: string, enum: [none, low, medium, high, critical], description: "String: none | low | medium | high | critical" }
|
| 51 |
suggested_fix: { type: string, description: "String: How to fix the identified bug" }
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
# The Observation space defines what the agent sees at each step.
|
| 54 |
# It uses a structured context to help the agent understand the code's purpose.
|
|
|
|
| 68 |
min: 0.0
|
| 69 |
max: 1.0
|
| 70 |
description: >
|
| 71 |
+
Step 1 β File request: +0.20 (flat, always granted).
|
| 72 |
+
Step 2 β Bug review: partial rewards for bug identification (0.20),
|
| 73 |
+
correct bug type (0.20), precise location (0.10), description quality (0.25,
|
| 74 |
+
keyword density), fix quality (0.15), correct severity (0.10).
|
| 75 |
+
Episode total is clamped to [0.0, 1.0]. Grader penalizes keyword stuffing.
|
| 76 |
|
| 77 |
endpoints:
|
| 78 |
health: GET /
|
output.txt
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[INFO] Initializing inference on code-security-review using meta-llama/Llama-3.3-70B-Instruct
|
| 2 |
+
[WARN] Client init failed: HF_TOKEN or API_KEY must be set.. Using deterministic fallback.
|
| 3 |
+
[START] task=python-off-by-one env=code-security-review model=meta-llama/Llama-3.3-70B-Instruct
|
| 4 |
+
[STEP] step=1 action={"bug_identified": true, "bug_location": "line 3", "bug_type": "off-by-one", "bug_description": "loop range(len(transactions) + 1) index error off-by-one out of bounds error", "severity": "medium", "suggested_fix": "range(len(transactions))"} reward=0.92 done=true error=null
|
| 5 |
+
[END] success=true steps=1 score=0.917 rewards=0.92
|
| 6 |
+
[START] task=js-auth-privilege env=code-security-review model=meta-llama/Llama-3.3-70B-Instruct
|
| 7 |
+
[STEP] step=1 action={"bug_identified": true, "bug_location": "line 3", "bug_type": "logic-error", "bug_description": "logic operator || bypass escalation authorization bypass access", "severity": "critical", "suggested_fix": "user.role === \"admin\" && user.isActive"} reward=0.91 done=true error=null
|
| 8 |
+
[END] success=true steps=1 score=0.912 rewards=0.91
|
| 9 |
+
[START] task=python-sql-injection env=code-security-review model=meta-llama/Llama-3.3-70B-Instruct
|
| 10 |
+
[STEP] step=1 action={"bug_identified": true, "bug_location": "line 2", "bug_type": "security-vulnerability", "bug_description": "f-string SQLi injection-flaw raw-sql SQL-interpolation", "severity": "critical", "suggested_fix": "parameterized query bind variables"} reward=0.92 done=true error=null
|
| 11 |
+
[END] success=true steps=1 score=0.920 rewards=0.92
|
| 12 |
+
|
| 13 |
+
[SUMMARY] avg_reward=0.916 tasks_passed=3/3
|
server/environment.py
CHANGED
|
@@ -70,6 +70,22 @@ class CodeSecurityEnv:
|
|
| 70 |
info={"error": ERROR_EPISODE_COMPLETED},
|
| 71 |
)
|
| 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
try:
|
| 74 |
reward, breakdown = grade_action(action.model_dump(), self.current_task)
|
| 75 |
except Exception as e:
|
|
@@ -77,7 +93,7 @@ class CodeSecurityEnv:
|
|
| 77 |
|
| 78 |
self.step_count += 1
|
| 79 |
self.total_reward += reward
|
| 80 |
-
self.done = True # single-step environment
|
| 81 |
|
| 82 |
return StepResult(
|
| 83 |
observation=self._make_observation(),
|
|
@@ -106,11 +122,14 @@ class CodeSecurityEnv:
|
|
| 106 |
if not t:
|
| 107 |
raise KeyError("Attempted observation render without an initialized active task")
|
| 108 |
|
|
|
|
|
|
|
|
|
|
| 109 |
return Observation(
|
| 110 |
task_id=t["id"],
|
| 111 |
language=t["language"],
|
| 112 |
difficulty=t["difficulty"],
|
| 113 |
-
code_snippet=
|
| 114 |
context=t["context"],
|
| 115 |
pr_title=t["pr_title"],
|
| 116 |
file_path=t["file_path"],
|
|
|
|
| 70 |
info={"error": ERROR_EPISODE_COMPLETED},
|
| 71 |
)
|
| 72 |
|
| 73 |
+
# Intermediate Step: Request file
|
| 74 |
+
if getattr(action, "request_file", False):
|
| 75 |
+
self.step_count += 1
|
| 76 |
+
reward = 0.20
|
| 77 |
+
self.total_reward += reward
|
| 78 |
+
self.done = False
|
| 79 |
+
return StepResult(
|
| 80 |
+
observation=self._make_observation(),
|
| 81 |
+
reward=reward,
|
| 82 |
+
done=self.done,
|
| 83 |
+
info={
|
| 84 |
+
"task_name": getattr(self.current_task, "get", dict().get)("name", "Unknown Task") if self.current_task else "Unknown Task",
|
| 85 |
+
"step_count": self.step_count
|
| 86 |
+
},
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
try:
|
| 90 |
reward, breakdown = grade_action(action.model_dump(), self.current_task)
|
| 91 |
except Exception as e:
|
|
|
|
| 93 |
|
| 94 |
self.step_count += 1
|
| 95 |
self.total_reward += reward
|
| 96 |
+
self.done = True # single-step environment becomes max 2-step
|
| 97 |
|
| 98 |
return StepResult(
|
| 99 |
observation=self._make_observation(),
|
|
|
|
| 122 |
if not t:
|
| 123 |
raise KeyError("Attempted observation render without an initialized active task")
|
| 124 |
|
| 125 |
+
# Hide the snippet before Step 1
|
| 126 |
+
snippet = t["code_snippet"] if self.step_count > 0 else "<FILE CONTENTS HIDDEN - Submit {\"request_file\": true} to view>"
|
| 127 |
+
|
| 128 |
return Observation(
|
| 129 |
task_id=t["id"],
|
| 130 |
language=t["language"],
|
| 131 |
difficulty=t["difficulty"],
|
| 132 |
+
code_snippet=snippet,
|
| 133 |
context=t["context"],
|
| 134 |
pr_title=t["pr_title"],
|
| 135 |
file_path=t["file_path"],
|
server/grader.py
CHANGED
|
@@ -68,8 +68,9 @@ def grade_action(action: Dict[str, Any], task: Dict[str, Any]) -> Tuple[float, D
|
|
| 68 |
desc_score = 0.0
|
| 69 |
if len(description) >= 20:
|
| 70 |
task_keywords = task["keywords"]
|
|
|
|
| 71 |
matched_kw = [kw for kw in task_keywords if kw in description]
|
| 72 |
-
desc_score = round(min(SCORE_DESC_QUALITY, SCORE_DESC_QUALITY * (len(matched_kw) /
|
| 73 |
|
| 74 |
breakdown["description_quality"] = desc_score
|
| 75 |
reward += desc_score
|
|
|
|
| 68 |
desc_score = 0.0
|
| 69 |
if len(description) >= 20:
|
| 70 |
task_keywords = task["keywords"]
|
| 71 |
+
target = task.get("keyword_target_override", KEYWORD_HIT_TARGET)
|
| 72 |
matched_kw = [kw for kw in task_keywords if kw in description]
|
| 73 |
+
desc_score = round(min(SCORE_DESC_QUALITY, SCORE_DESC_QUALITY * (len(matched_kw) / target)), 4)
|
| 74 |
|
| 75 |
breakdown["description_quality"] = desc_score
|
| 76 |
reward += desc_score
|
server/models.py
CHANGED
|
@@ -6,14 +6,15 @@ from pydantic import BaseModel, Field
|
|
| 6 |
# ββ Agent Action ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 7 |
|
| 8 |
class CodeReviewAction(BaseModel):
|
| 9 |
-
"""Action taken by the agent: a structured code review."""
|
| 10 |
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
|
|
|
| 17 |
|
| 18 |
# ββ Observation βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 19 |
|
|
|
|
| 6 |
# ββ Agent Action ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 7 |
|
| 8 |
class CodeReviewAction(BaseModel):
|
| 9 |
+
"""Action taken by the agent: a structured code review or a file request."""
|
| 10 |
|
| 11 |
+
request_file: Optional[bool] = Field(None, description="Request the file contents")
|
| 12 |
+
bug_identified: Optional[bool] = Field(None, description="Whether a bug was found")
|
| 13 |
+
bug_location: Optional[str] = Field(None, description="Location of the bug (function, line, variable)")
|
| 14 |
+
bug_type: Optional[str] = Field(None, description="Type: off-by-one | logic-error | security-vulnerability | none")
|
| 15 |
+
bug_description: Optional[str] = Field(None, description="Detailed explanation of why this is a bug")
|
| 16 |
+
severity: Optional[str] = Field(None, description="Severity: none | low | medium | high | critical")
|
| 17 |
+
suggested_fix: Optional[str] = Field(None, description="The corrected code or a description of how to fix it")
|
| 18 |
|
| 19 |
# ββ Observation βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 20 |
|
server/tasks.py
CHANGED
|
@@ -69,39 +69,40 @@ TASKS: Dict[str, Any] = {
|
|
| 69 |
"fix_patterns": [
|
| 70 |
"user.role === \"admin\" && user.isActive",
|
| 71 |
"&& user.isActive",
|
| 72 |
-
"throw new Error(\"Unauthorized\")"
|
|
|
|
| 73 |
],
|
|
|
|
| 74 |
},
|
| 75 |
|
| 76 |
-
"python-
|
| 77 |
-
"id": "python-
|
| 78 |
-
"name": "Python
|
| 79 |
"language": "Python",
|
| 80 |
"difficulty": "hard",
|
| 81 |
-
"bug_class": "
|
| 82 |
-
"pr_title": "Add
|
| 83 |
-
"file_path": "
|
| 84 |
-
"context": "
|
| 85 |
"code_snippet": (
|
| 86 |
-
"
|
| 87 |
-
"
|
| 88 |
-
"
|
| 89 |
-
" return
|
| 90 |
),
|
| 91 |
"bug_type": "security-vulnerability",
|
| 92 |
-
"bug_location": "line
|
| 93 |
"severity": "critical",
|
| 94 |
"keywords": [
|
| 95 |
-
"
|
| 96 |
-
"
|
| 97 |
-
"
|
| 98 |
-
"SQL-interpolation", "driver-protocol", "malicious-input-flow", "exfiltration-risk",
|
| 99 |
-
"second-order-injection", "blind-sql-injection", "union-based-attack"
|
| 100 |
],
|
| 101 |
"fix_patterns": [
|
| 102 |
-
"
|
| 103 |
-
"
|
| 104 |
-
"
|
|
|
|
| 105 |
],
|
| 106 |
},
|
| 107 |
}
|
|
|
|
| 69 |
"fix_patterns": [
|
| 70 |
"user.role === \"admin\" && user.isActive",
|
| 71 |
"&& user.isActive",
|
| 72 |
+
"throw new Error(\"Unauthorized\")",
|
| 73 |
+
"return next"
|
| 74 |
],
|
| 75 |
+
"keyword_target_override": 1.0,
|
| 76 |
},
|
| 77 |
|
| 78 |
+
"python-pickle-deserialization": {
|
| 79 |
+
"id": "python-pickle-deserialization",
|
| 80 |
+
"name": "Python Pickle Deserialization",
|
| 81 |
"language": "Python",
|
| 82 |
"difficulty": "hard",
|
| 83 |
+
"bug_class": "Insecure Deserialization",
|
| 84 |
+
"pr_title": "Add state persistence layer for distributed workers",
|
| 85 |
+
"file_path": "worker/state.py",
|
| 86 |
+
"context": "Background worker loading serialized state via network payload",
|
| 87 |
"code_snippet": (
|
| 88 |
+
"import pickle\n\n"
|
| 89 |
+
"def load_worker_state(payload_bytes):\n"
|
| 90 |
+
" state = pickle.loads(payload_bytes)\n"
|
| 91 |
+
" return state['config']"
|
| 92 |
),
|
| 93 |
"bug_type": "security-vulnerability",
|
| 94 |
+
"bug_location": "line 4 β pickle.loads() executes arbitrary code during object recreation",
|
| 95 |
"severity": "critical",
|
| 96 |
"keywords": [
|
| 97 |
+
"deserialization", "pickle", "loads", "arbitrary", "code execution", "rce",
|
| 98 |
+
"injection", "untrusted", "payload", "cve", "insecure", "un-serialize",
|
| 99 |
+
"malicious", "exploit", "magic methods", "reduce"
|
|
|
|
|
|
|
| 100 |
],
|
| 101 |
"fix_patterns": [
|
| 102 |
+
"json.loads",
|
| 103 |
+
"hmac",
|
| 104 |
+
"signatures",
|
| 105 |
+
"safe_load"
|
| 106 |
],
|
| 107 |
},
|
| 108 |
}
|