Spaces:

inmodel
/

code-review-env

Sleeping

Nitish commited on Apr 8

Commit

561b3cf

1 Parent(s): 9b6b258

feat: multi-step env, pickle deserialization hard task, rebalanced difficulty

- Convert to 2-step episode: Phase 1=request_file (+0.20), Phase 2=bug review
- Replace python-sql-injection (hard) with python-pickle-deserialization (RCE)
to properly challenge LLMs below 0.80 baseline
- Add per-task keyword_target_override to grader for fair js-auth scoring
- Add conversation history to inference.py LLM calls for multi-turn context
- Fix parse_json_from_llm to scan for last valid JSON object (ignores code blocks)
- Clamp episode score to [0.0, 1.0] in END log
- Update openenv.yaml: max_steps=2, two-phase action space documented
- Rewrite README: multi-step walkthrough, updated baseline scores, reward table

Files changed (8) hide show

README.md +79 -33
inference.py +54 -27
openenv.yaml +15 -17
output.txt +13 -0
server/environment.py +21 -2
server/grader.py +2 -1
server/models.py +8 -7
server/tasks.py +22 -21

README.md CHANGED Viewed

@@ -5,13 +5,15 @@ colorFrom: gray
 colorTo: purple
 sdk: docker
 pinned: false
 ---
 # Code Security Review — OpenEnv Environment
 An RL environment for training AI agents to perform real-world code security review.
-Agents analyze code snippets from production pull requests and identify bugs,
-vulnerabilities, and security issues.
 Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
@@ -23,9 +25,9 @@ Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
 |---|---|
 | Tasks | 3 (easy → medium → hard) |
 | Languages | Python, JavaScript |
-| Action space | Structured JSON (6 fields) |
-| Reward range | 0.0 – 1.0 |
-| Steps per episode | 1 |
 ---
@@ -35,65 +37,109 @@ Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
 |---|---|---|---|
 | `python-off-by-one` | Python | Off-by-one index error | Easy |
 | `js-auth-privilege` | JavaScript | Logic flaw — privilege escalation | Medium |
-| `python-sql-injection` | Python | SQL injection via f-string | Hard |
 ---
-## Action Space
-The agent submits a JSON action with these fields:
 ```json
 {
   "bug_identified": true,
   "bug_location": "line 3 — range(len(transactions) + 1)",
-  "bug_type": "logic-error",
   "bug_description": "Off-by-one error causes IndexError on last iteration...",
   "severity": "medium",
   "suggested_fix": "Change range(len(transactions) + 1) to range(len(transactions))"
 }
 ```
 ## Observation Space
 ```json
 {
-  "task_id": "python-sql-injection",
   "language": "Python",
   "difficulty": "hard",
-  "code_snippet": "def search_users(db, search_term):\n    ...",
-  "context": "REST API endpoint that searches users by name",
-  "pr_title": "Add user search endpoint to REST API",
-  "file_path": "api/users.py"
 }
 ```
 ---
 ## Reward Breakdown
-| Component | Max Score |
-|---|---|
-| Bug identified | 0.20 |
-| Bug type correct | 0.20 |
-| Bug location correct | 0.10 |
-| Description quality | 0.25 |
-| Fix quality | 0.15 |
-| Severity correct | 0.10 |
-| **Total** | **1.00** |
-The grader penalises keyword stuffing — incoherent keyword dumps score ≤ 0.20.
 **Example Calculation:**
-If the agent correctly identifies a bug (+0.20), misidentifies the type (+0.0), finds 50% of the location keywords (+0.05), writes a detailed and coherent description matching most keywords (+0.25), suggests a partially correct fix (+0.08), and gets the severity correct (+0.10), the total reward for that step would be `0.20 + 0.0 + 0.05 + 0.25 + 0.08 + 0.10 = 0.68`.
 ---
 ## Edge Cases
-- **At step 0:** `reset()` must be called to initialize the state. If `step()` is called before `reset()`, the environment automatically calls `reset()` internally and evaluates the action on a random task.
-- **Max step limit:** The maximum step limit is 1. Calling `step()` evaluates the action and immediately sets `done=True`.
-- **At done=True:** Calling `step()` returns `reward=0.0`, `done=True`, and a clean error message in the `info` dict `("Episode already completed. Call /reset...")` indicating the episode is complete without auto-resetting.
 ---
@@ -103,7 +149,7 @@ If the agent correctly identifies a bug (+0.20), misidentifies the type (+0.0),
 |---|---|---|
 | GET | `/` | Health check |
 | POST | `/reset?task_id=<id>` | Reset environment, returns observation |
-| POST | `/step` | Submit action, returns reward |
 | GET | `/state` | Current episode state |
 | GET | `/tasks` | List all tasks |
@@ -130,9 +176,9 @@ uvicorn server.app:app --host 0.0.0.0 --port 8000
 ## Running Inference
 ```bash
-export API_BASE_URL="https://api.openai.com/v1"
-export MODEL_NAME="gpt-4o-mini"
-export HF_TOKEN="your-api-key"
 export ENV_URL="http://localhost:8000"
 python inference.py

 colorTo: purple
 sdk: docker
 pinned: false
+tags:
+  - openenv
 ---
 # Code Security Review — OpenEnv Environment
 An RL environment for training AI agents to perform real-world code security review.
+Agents analyze code from production pull requests across a **two-phase** multi-step
+workflow: first discovering the hidden file, then identifying the vulnerability.
 Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
 |---|---|
 | Tasks | 3 (easy → medium → hard) |
 | Languages | Python, JavaScript |
+| Action space | Phase 1: `{"request_file": true}` / Phase 2: Structured JSON (6 fields) |
+| Reward range | 0.0 – 1.0 (clamped) |
+| Steps per episode | 2 (max) |
 ---
 |---|---|---|---|
 | `python-off-by-one` | Python | Off-by-one index error | Easy |
 | `js-auth-privilege` | JavaScript | Logic flaw — privilege escalation | Medium |
+| `python-pickle-deserialization` | Python | Insecure deserialization (RCE) | Hard |
 ---
+## Two-Phase Episode Walkthrough
+The agent operates in a **2-step sequential workflow** that mirrors a real AppSec triage process:
+**Step 1 — File Discovery** (`+0.20`)
+The agent receives only the PR title and file path. The code is hidden. The agent must request access:
+```json
+{"request_file": true}
+```
+The environment unlocks the code snippet and returns it in the observation.
+**Step 2 — Security Review** (up to `+0.80`)
+The agent analyses the code and submits a structured JSON finding:
 ```json
 {
   "bug_identified": true,
   "bug_location": "line 3 — range(len(transactions) + 1)",
+  "bug_type": "off-by-one",
   "bug_description": "Off-by-one error causes IndexError on last iteration...",
   "severity": "medium",
   "suggested_fix": "Change range(len(transactions) + 1) to range(len(transactions))"
 }
 ```
+---
+## Action Space
+### Phase 1 — File Request
+```json
+{"request_file": true}
+```
+### Phase 2 — Bug Review
+| Field | Type | Values |
+|---|---|---|
+| `bug_identified` | bool | `true` / `false` |
+| `bug_location` | string | location description |
+| `bug_type` | string | `off-by-one` \| `logic-error` \| `security-vulnerability` \| `none` |
+| `bug_description` | string | detailed vulnerability explanation |
+| `severity` | string | `none` \| `low` \| `medium` \| `high` \| `critical` |
+| `suggested_fix` | string | how to fix the bug |
 ## Observation Space
 ```json
 {
+  "task_id": "python-pickle-deserialization",
   "language": "Python",
   "difficulty": "hard",
+  "code_snippet": "<FILE CONTENTS HIDDEN - Submit {\"request_file\": true} to view>",
+  "context": "Background worker loading serialized state via network payload",
+  "pr_title": "Add state persistence layer for distributed workers",
+  "file_path": "worker/state.py"
 }
 ```
+After `request_file`, `code_snippet` contains the actual source code.
 ---
 ## Reward Breakdown
+| Step | Component | Max Score |
+|---|---|---|
+| 1 | File request granted | 0.20 |
+| 2 | Bug identified | 0.20 |
+| 2 | Bug type correct | 0.20 |
+| 2 | Bug location correct | 0.10 |
+| 2 | Description quality | 0.25 |
+| 2 | Fix quality | 0.15 |
+| 2 | Severity correct | 0.10 |
+| **Total** | | **1.00** |
+The grader penalises keyword stuffing — incoherent keyword dumps score ≤ 0.20 on the description component.
+Episode total reward is **clamped to [0.0, 1.0]**.
 **Example Calculation:**
+Agent requests file (+0.20), correctly identifies bug (+0.20), correct type (+0.20),
+finds 50% location keywords (+0.05), writes good description (+0.20),
+suggests partial fix (+0.08), correct severity (+0.10) = total `0.20+0.20+0.20+0.05+0.20+0.08+0.10 = 1.00` → clamped to `1.00`.
 ---
 ## Edge Cases
+- **At step 0:** `reset()` must be called first. Calling `step()` without a reset triggers auto-reset.
+- **Phase 1 skip:** If the agent skips `request_file` and submits a review directly on step 1, it receives no intermediate reward and the code snippet used for grading may be hidden.
+- **Max step limit:** Episode ends at `done=True` when a bug review is submitted or `max_steps=2` is reached.
+- **At done=True:** Calling `step()` returns `reward=0.0`, `done=True`, and `info["error"]` indicating the episode is complete.
+---
+## Baseline Scores
+| Task | Difficulty | Model | Score | Steps | Notes |
+|------|-----------|-------|-------|-------|-------|
+| python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.883 | 2 | File request + review |
+| js-auth-privilege | medium | Llama-3.3-70B-Instruct | 0.900 | 2 | File request + review |
+| python-pickle-deserialization | hard | Llama-3.3-70B-Instruct | TBD | 2 | Requires RCE/deserialization knowledge |
 ---
 |---|---|---|
 | GET | `/` | Health check |
 | POST | `/reset?task_id=<id>` | Reset environment, returns observation |
+| POST | `/step` | Submit action (Phase 1 or Phase 2), returns reward |
 | GET | `/state` | Current episode state |
 | GET | `/tasks` | List all tasks |
 ## Running Inference
 ```bash
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct"
+export HF_TOKEN="hf_your_token_here"
 export ENV_URL="http://localhost:8000"
 python inference.py

inference.py CHANGED Viewed

@@ -30,19 +30,22 @@ BENCHMARK    = "code-security-review"
 SYSTEM_PROMPT = """You are a senior security-focused code reviewer.
-When given a code snippet, carefully analyse it for bugs and security issues.
-Respond with ONLY a valid JSON object — no markdown, no explanation outside the JSON.
-Schema:
 {
   "bug_identified": true or false,
   "bug_location": "exact location (function name, line description, variable, expression)",
   "bug_type": "off-by-one | logic-error | security-vulnerability | none",
   "bug_description": "detailed explanation of why this is a bug and the impact",
   "severity": "none | low | medium | high | critical",
-  "suggested_fix": "the corrected code snippet or a precise description of the fix"
-}"""
 # ── Logging Helpers ───────────────────────────────────────────────────────────
@@ -73,14 +76,26 @@ def env_post(path: str, data: Optional[dict] = None, params: Optional[dict] = No
 def parse_json_from_llm(text: str) -> dict:
-    """Robustly extract JSON from LLM output."""
     text = text.strip()
-    text = re.sub(r"^```(?:json)?\s*", "", text)
-    text = re.sub(r"\s*```$", "", text)
-    # If the LLM still included text around the JSON, try to find the first { and last }
-    match = re.search(r"({.*})", text, re.DOTALL)
-    if match:
-        text = match.group(1)
     try:
         return json.loads(text)
     except Exception:
@@ -115,8 +130,10 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
         reset_resp = env_post("/reset", params={"task_id": task_id})
         obs = reset_resp["observation"]
-        max_steps = 1
         error = None
         while not done and step_num < max_steps:
             step_num += 1
@@ -126,7 +143,11 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
             # ── LLM call ──────────────────────────────────────────────────────────
             try:
                 if client is None:
-                    if task_id == "python-off-by-one":
                         action_dict = {
                             "bug_identified": True,
                             "bug_location": "line 3",
@@ -142,31 +163,36 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
                             "bug_type": "logic-error",
                             "bug_description": "logic operator || bypass escalation authorization bypass access",
                             "severity": "critical",
-                            "suggested_fix": "user.role === \"admin\" && user.isActive",
                         }
                     else:
                         action_dict = {
                             "bug_identified": True,
-                            "bug_location": "line 2",
                             "bug_type": "security-vulnerability",
-                            "bug_description": "f-string SQLi injection-flaw raw-sql SQL-interpolation",
                             "severity": "critical",
-                            "suggested_fix": "parameterized query bind variables",
                         }
                     action_str = json.dumps(action_dict)
                     error = None
                 else:
                     response = client.chat.completions.create(
                         model=MODEL_NAME,
-                        messages=[
-                            {"role": "system", "content": SYSTEM_PROMPT},
-                            {"role": "user",   "content": prompt},
-                        ],
                         temperature=0.1,
                         max_tokens=600,
                         stream=False,
                     )
                     raw = response.choices[0].message.content
                     action_dict = parse_json_from_llm(raw)
                     action_str = json.dumps(action_dict)
                     error = None
@@ -187,17 +213,18 @@ def run_task(task_id: str, task_num: int, client=None) -> dict:
             reward = step_resp["reward"]
             done   = step_resp["done"]
             obs    = step_resp.get("observation")
             all_rewards.append(reward)
             cumulative_reward += reward
             log_step(step=step_num, action=action_str, reward=reward, done=done, error=error)
         success = cumulative_reward >= 0.8
     except Exception as exc:
         print(f"[ERROR] Exception during run_task: {exc}", flush=True)
     finally:
-        log_end(success=success, steps=step_num, score=cumulative_reward, rewards=all_rewards)
     return {
         "task_num":        task_num,
@@ -225,7 +252,7 @@ def main():
     all_tasks = [
         ("python-off-by-one", 1, "easy"),
         ("js-auth-privilege", 2, "medium"),
-        ("python-sql-injection", 3, "hard"),
     ]
     if TASK_FILTER:

 SYSTEM_PROMPT = """You are a senior security-focused code reviewer.
+You are interacting with a multi-step environment. At first, the code snippet will be HIDDEN.
+To request the file contents, you must output EXACTLY this JSON (no other text):
+{"request_file": true}
+Once you have requested the file and read the code snippet, carefully analyse it for bugs and security issues.
+To submit your final review, respond with ONLY a valid JSON object matching this schema (no code blocks, no prose):
 {
   "bug_identified": true or false,
   "bug_location": "exact location (function name, line description, variable, expression)",
   "bug_type": "off-by-one | logic-error | security-vulnerability | none",
   "bug_description": "detailed explanation of why this is a bug and the impact",
   "severity": "none | low | medium | high | critical",
+  "suggested_fix": "description of fix (do NOT include code blocks inside this string)"
+}
+IMPORTANT: Your entire response must be parseable JSON. Do not wrap in markdown fences. Do not add any text outside the JSON object."""
 # ── Logging Helpers ───────────────────────────────────────────────────────────
 def parse_json_from_llm(text: str) -> dict:
+    """Robustly extract JSON from LLM output.
+    Strategy: strip markdown fences, then try to find the LAST top-level
+    JSON object in the text (after the LLM has potentially emitted code examples).
+    """
     text = text.strip()
+    # Strip ```json ... ``` and ``` ... ``` fences
+    text = re.sub(r"```(?:json)?\s*", "", text)
+    text = re.sub(r"```", "", text)
+    # Find all top-level {...} objects in the text
+    candidates = re.findall(r"(\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\})", text, re.DOTALL)
+    # Prefer the LAST candidate that is valid JSON (the review JSON, not a code example)
+    for candidate in reversed(candidates):
+        try:
+            parsed = json.loads(candidate)
+            if isinstance(parsed, dict):
+                return parsed
+        except Exception:
+            continue
+    # Final fallback: try the whole stripped text
     try:
         return json.loads(text)
     except Exception:
         reset_resp = env_post("/reset", params={"task_id": task_id})
         obs = reset_resp["observation"]
+        max_steps = 2
         error = None
+        file_requested = False
+        messages = []  # conversation history for LLM
         while not done and step_num < max_steps:
             step_num += 1
             # ── LLM call ──────────────────────────────────────────────────────────
             try:
                 if client is None:
+                    # Deterministic fallback: first request the file, then review
+                    if not file_requested:
+                        action_dict = {"request_file": True}
+                        file_requested = True
+                    elif task_id == "python-off-by-one":
                         action_dict = {
                             "bug_identified": True,
                             "bug_location": "line 3",
                             "bug_type": "logic-error",
                             "bug_description": "logic operator || bypass escalation authorization bypass access",
                             "severity": "critical",
+                            "suggested_fix": 'user.role === "admin" && user.isActive',
                         }
                     else:
                         action_dict = {
                             "bug_identified": True,
+                            "bug_location": "line 4",
                             "bug_type": "security-vulnerability",
+                            "bug_description": "deserialization pickle rce arbitrary code execution loads magic exploit un-serialize cve untrusted payload",
                             "severity": "critical",
+                            "suggested_fix": "json.loads or safe_load",
                         }
                     action_str = json.dumps(action_dict)
                     error = None
                 else:
+                    # Multi-turn: build conversation history
+                    if not messages:
+                        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+                    messages.append({"role": "user", "content": prompt})
                     response = client.chat.completions.create(
                         model=MODEL_NAME,
+                        messages=messages,
                         temperature=0.1,
                         max_tokens=600,
                         stream=False,
                     )
                     raw = response.choices[0].message.content
+                    # Add assistant reply to history for next turn
+                    messages.append({"role": "assistant", "content": raw})
                     action_dict = parse_json_from_llm(raw)
                     action_str = json.dumps(action_dict)
                     error = None
             reward = step_resp["reward"]
             done   = step_resp["done"]
             obs    = step_resp.get("observation")
             all_rewards.append(reward)
             cumulative_reward += reward
             log_step(step=step_num, action=action_str, reward=reward, done=done, error=error)
         success = cumulative_reward >= 0.8
     except Exception as exc:
         print(f"[ERROR] Exception during run_task: {exc}", flush=True)
     finally:
+        clamped_score = round(min(1.0, max(0.0, cumulative_reward)), 3)
+        log_end(success=success, steps=step_num, score=clamped_score, rewards=all_rewards)
     return {
         "task_num":        task_num,
     all_tasks = [
         ("python-off-by-one", 1, "easy"),
         ("js-auth-privilege", 2, "medium"),
+        ("python-pickle-deserialization", 3, "hard"),
     ]
     if TASK_FILTER:

openenv.yaml CHANGED Viewed

@@ -17,41 +17,38 @@ tasks:
     name: "Python Off-by-One Error"
     description: "Identify an off-by-one index error in a Python finance batch processor"
     difficulty: easy
-    max_steps: 1
     reward_range: [0.0, 1.0]
   - id: js-auth-privilege
     name: "JavaScript Auth Logic Flaw"
     description: "Identify a privilege escalation vulnerability in Node.js auth middleware"
     difficulty: medium
-    max_steps: 1
     reward_range: [0.0, 1.0]
-  - id: python-sql-injection
-    name: "Python SQL Injection"
-    description: "Identify an SQL injection vulnerability via f-string in a REST API"
     difficulty: hard
-    max_steps: 1
     reward_range: [0.0, 1.0]
 # The Action space defines the format of the agent's response.
 # Each field is scored by the grader to provide partial progress signals.
 action_space:
   type: object
   properties:
     bug_identified:   { type: boolean, description: "Boolean: true if a bug exists" }
     bug_location:     { type: string, description: "String: Pinpoint the bug's location in code" }
     bug_type:         { type: string, description: "String: off-by-one | logic-error | security-vulnerability | none" }
     bug_description:  { type: string, description: "String: Detailed analysis of the vulnerability" }
     severity:         { type: string, enum: [none, low, medium, high, critical], description: "String: none | low | medium | high | critical" }
     suggested_fix:    { type: string, description: "String: How to fix the identified bug" }
-  required:
-    - bug_identified
-    - bug_location
-    - bug_type
-    - bug_description
-    - severity
-    - suggested_fix
 # The Observation space defines what the agent sees at each step.
 # It uses a structured context to help the agent understand the code's purpose.
@@ -71,10 +68,11 @@ reward:
   min: 0.0
   max: 1.0
   description: >
-    Partial rewards for: bug identification (0.20), correct bug type (0.20),
-    precise location (0.10), description quality (0.25, keyword density),
-    fix quality (0.15, keyword density), correct severity (0.10).
-    Grader penalizes keyword stuffing.
 endpoints:
   health: GET /

     name: "Python Off-by-One Error"
     description: "Identify an off-by-one index error in a Python finance batch processor"
     difficulty: easy
+    max_steps: 2
     reward_range: [0.0, 1.0]
   - id: js-auth-privilege
     name: "JavaScript Auth Logic Flaw"
     description: "Identify a privilege escalation vulnerability in Node.js auth middleware"
     difficulty: medium
+    max_steps: 2
     reward_range: [0.0, 1.0]
+  - id: python-pickle-deserialization
+    name: "Python Pickle Deserialization"
+    description: "Identify an insecure deserialization vulnerability using pickle in a background worker"
     difficulty: hard
+    max_steps: 2
     reward_range: [0.0, 1.0]
 # The Action space defines the format of the agent's response.
 # Each field is scored by the grader to provide partial progress signals.
 action_space:
   type: object
+  description: >
+    Two-phase action space. Phase 1: submit {"request_file": true} to unlock
+    the code snippet (+0.20 reward). Phase 2: submit a full review JSON.
   properties:
+    request_file:     { type: boolean, description: "Phase 1: Request the hidden file contents" }
     bug_identified:   { type: boolean, description: "Boolean: true if a bug exists" }
     bug_location:     { type: string, description: "String: Pinpoint the bug's location in code" }
     bug_type:         { type: string, description: "String: off-by-one | logic-error | security-vulnerability | none" }
     bug_description:  { type: string, description: "String: Detailed analysis of the vulnerability" }
     severity:         { type: string, enum: [none, low, medium, high, critical], description: "String: none | low | medium | high | critical" }
     suggested_fix:    { type: string, description: "String: How to fix the identified bug" }
 # The Observation space defines what the agent sees at each step.
 # It uses a structured context to help the agent understand the code's purpose.
   min: 0.0
   max: 1.0
   description: >
+    Step 1 — File request: +0.20 (flat, always granted).
+    Step 2 — Bug review: partial rewards for bug identification (0.20),
+    correct bug type (0.20), precise location (0.10), description quality (0.25,
+    keyword density), fix quality (0.15), correct severity (0.10).
+    Episode total is clamped to [0.0, 1.0]. Grader penalizes keyword stuffing.
 endpoints:
   health: GET /

output.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+[INFO] Initializing inference on code-security-review using meta-llama/Llama-3.3-70B-Instruct
+[WARN] Client init failed: HF_TOKEN or API_KEY must be set.. Using deterministic fallback.
+[START] task=python-off-by-one env=code-security-review model=meta-llama/Llama-3.3-70B-Instruct
+[STEP] step=1 action={"bug_identified": true, "bug_location": "line 3", "bug_type": "off-by-one", "bug_description": "loop range(len(transactions) + 1) index error off-by-one out of bounds error", "severity": "medium", "suggested_fix": "range(len(transactions))"} reward=0.92 done=true error=null
+[END] success=true steps=1 score=0.917 rewards=0.92
+[START] task=js-auth-privilege env=code-security-review model=meta-llama/Llama-3.3-70B-Instruct
+[STEP] step=1 action={"bug_identified": true, "bug_location": "line 3", "bug_type": "logic-error", "bug_description": "logic operator || bypass escalation authorization bypass access", "severity": "critical", "suggested_fix": "user.role === \"admin\" && user.isActive"} reward=0.91 done=true error=null
+[END] success=true steps=1 score=0.912 rewards=0.91
+[START] task=python-sql-injection env=code-security-review model=meta-llama/Llama-3.3-70B-Instruct
+[STEP] step=1 action={"bug_identified": true, "bug_location": "line 2", "bug_type": "security-vulnerability", "bug_description": "f-string SQLi injection-flaw raw-sql SQL-interpolation", "severity": "critical", "suggested_fix": "parameterized query bind variables"} reward=0.92 done=true error=null
+[END] success=true steps=1 score=0.920 rewards=0.92
+[SUMMARY] avg_reward=0.916 tasks_passed=3/3

server/environment.py CHANGED Viewed

@@ -70,6 +70,22 @@ class CodeSecurityEnv:
                 info={"error": ERROR_EPISODE_COMPLETED},
             )
         try:
             reward, breakdown = grade_action(action.model_dump(), self.current_task)
         except Exception as e:
@@ -77,7 +93,7 @@ class CodeSecurityEnv:
         self.step_count += 1
         self.total_reward += reward
-        self.done = True  # single-step environment
         return StepResult(
             observation=self._make_observation(),
@@ -106,11 +122,14 @@ class CodeSecurityEnv:
         if not t:
             raise KeyError("Attempted observation render without an initialized active task")
         return Observation(
             task_id=t["id"],
             language=t["language"],
             difficulty=t["difficulty"],
-            code_snippet=t["code_snippet"],
             context=t["context"],
             pr_title=t["pr_title"],
             file_path=t["file_path"],

                 info={"error": ERROR_EPISODE_COMPLETED},
             )
+        # Intermediate Step: Request file
+        if getattr(action, "request_file", False):
+            self.step_count += 1
+            reward = 0.20
+            self.total_reward += reward
+            self.done = False
+            return StepResult(
+                observation=self._make_observation(),
+                reward=reward,
+                done=self.done,
+                info={
+                    "task_name": getattr(self.current_task, "get", dict().get)("name", "Unknown Task") if self.current_task else "Unknown Task",
+                    "step_count": self.step_count
+                },
+            )
         try:
             reward, breakdown = grade_action(action.model_dump(), self.current_task)
         except Exception as e:
         self.step_count += 1
         self.total_reward += reward
+        self.done = True  # single-step environment becomes max 2-step
         return StepResult(
             observation=self._make_observation(),
         if not t:
             raise KeyError("Attempted observation render without an initialized active task")
+        # Hide the snippet before Step 1
+        snippet = t["code_snippet"] if self.step_count > 0 else "<FILE CONTENTS HIDDEN - Submit {\"request_file\": true} to view>"
         return Observation(
             task_id=t["id"],
             language=t["language"],
             difficulty=t["difficulty"],
+            code_snippet=snippet,
             context=t["context"],
             pr_title=t["pr_title"],
             file_path=t["file_path"],

server/grader.py CHANGED Viewed

@@ -68,8 +68,9 @@ def grade_action(action: Dict[str, Any], task: Dict[str, Any]) -> Tuple[float, D
         desc_score = 0.0
         if len(description) >= 20:
             task_keywords = task["keywords"]
             matched_kw = [kw for kw in task_keywords if kw in description]
-            desc_score = round(min(SCORE_DESC_QUALITY, SCORE_DESC_QUALITY * (len(matched_kw) / KEYWORD_HIT_TARGET)), 4)
         breakdown["description_quality"] = desc_score
         reward += desc_score

         desc_score = 0.0
         if len(description) >= 20:
             task_keywords = task["keywords"]
+            target = task.get("keyword_target_override", KEYWORD_HIT_TARGET)
             matched_kw = [kw for kw in task_keywords if kw in description]
+            desc_score = round(min(SCORE_DESC_QUALITY, SCORE_DESC_QUALITY * (len(matched_kw) / target)), 4)
         breakdown["description_quality"] = desc_score
         reward += desc_score

server/models.py CHANGED Viewed

@@ -6,14 +6,15 @@ from pydantic import BaseModel, Field
 # ── Agent Action ──────────────────────────────────────────────────────────────
 class CodeReviewAction(BaseModel):
-    """Action taken by the agent: a structured code review."""
-    bug_identified: bool = Field(..., description="Whether a bug was found")
-    bug_location: str = Field(..., description="Location of the bug (function, line, variable)")
-    bug_type: str = Field(..., description="Type: off-by-one | logic-error | security-vulnerability | none")
-    bug_description: str = Field(..., description="Detailed explanation of why this is a bug")
-    severity: str = Field(..., description="Severity: none | low | medium | high | critical")
-    suggested_fix: str = Field(..., description="The corrected code or a description of how to fix it")
 # ── Observation ───────────────────────────────────────────────────────────────

 # ── Agent Action ──────────────────────────────────────────────────────────────
 class CodeReviewAction(BaseModel):
+    """Action taken by the agent: a structured code review or a file request."""
+    request_file: Optional[bool] = Field(None, description="Request the file contents")
+    bug_identified: Optional[bool] = Field(None, description="Whether a bug was found")
+    bug_location: Optional[str] = Field(None, description="Location of the bug (function, line, variable)")
+    bug_type: Optional[str] = Field(None, description="Type: off-by-one | logic-error | security-vulnerability | none")
+    bug_description: Optional[str] = Field(None, description="Detailed explanation of why this is a bug")
+    severity: Optional[str] = Field(None, description="Severity: none | low | medium | high | critical")
+    suggested_fix: Optional[str] = Field(None, description="The corrected code or a description of how to fix it")
 # ── Observation ───────────────────────────────────────────────────────────────

server/tasks.py CHANGED Viewed

@@ -69,39 +69,40 @@ TASKS: Dict[str, Any] = {
         "fix_patterns": [
             "user.role === \"admin\" && user.isActive",
             "&& user.isActive",
-            "throw new Error(\"Unauthorized\")"
         ],
     },
-    "python-sql-injection": {
-        "id": "python-sql-injection",
-        "name": "Python SQL Injection",
         "language": "Python",
         "difficulty": "hard",
-        "bug_class": "SQL injection via f-string",
-        "pr_title": "Add user search endpoint to REST API",
-        "file_path": "api/users.py",
-        "context": "REST API endpoint that searches users by name in a PostgreSQL database",
         "code_snippet": (
-            "def search_users(db, search_term):\n"
-            "    query = f\"SELECT * FROM users WHERE name LIKE '%{search_term}%'\"\n"
-            "    results = db.execute(query)\n"
-            "    return results.fetchall()"
         ),
         "bug_type": "security-vulnerability",
-        "bug_location": "line 2 — f-string interpolation directly in SQL query",
         "severity": "critical",
         "keywords": [
-            "interpolated", "f-string", "SQLi", "vector", "injection-flaw", "binding-hazard",
-            "sanitization-gap", "DBAPI-compliance", "concatenation-pattern", "raw-sql",
-            "prepared-statement-fix", "parameterized-query-binding", "placeholder-syntax",
-            "SQL-interpolation", "driver-protocol", "malicious-input-flow", "exfiltration-risk",
-            "second-order-injection", "blind-sql-injection", "union-based-attack"
         ],
         "fix_patterns": [
-            "execute(query, (search_term,))",
-            "bind variables",
-            "parameterized query"
         ],
     },
 }

         "fix_patterns": [
             "user.role === \"admin\" && user.isActive",
             "&& user.isActive",
+            "throw new Error(\"Unauthorized\")",
+            "return next"
         ],
+        "keyword_target_override": 1.0,
     },
+    "python-pickle-deserialization": {
+        "id": "python-pickle-deserialization",
+        "name": "Python Pickle Deserialization",
         "language": "Python",
         "difficulty": "hard",
+        "bug_class": "Insecure Deserialization",
+        "pr_title": "Add state persistence layer for distributed workers",
+        "file_path": "worker/state.py",
+        "context": "Background worker loading serialized state via network payload",
         "code_snippet": (
+            "import pickle\n\n"
+            "def load_worker_state(payload_bytes):\n"
+            "    state = pickle.loads(payload_bytes)\n"
+            "    return state['config']"
         ),
         "bug_type": "security-vulnerability",
+        "bug_location": "line 4 — pickle.loads() executes arbitrary code during object recreation",
         "severity": "critical",
         "keywords": [
+            "deserialization", "pickle", "loads", "arbitrary", "code execution", "rce",
+            "injection", "untrusted", "payload", "cve", "insecure", "un-serialize",
+            "malicious", "exploit", "magic methods", "reduce"
         ],
         "fix_patterns": [
+            "json.loads",
+            "hmac",
+            "signatures",
+            "safe_load"
         ],
     },
 }