Spaces:

inmodel
/

code-review-env

Sleeping

App Files Files Community

Nitish commited on Apr 7

Commit

474eafa

1 Parent(s): 742e175

feat: finalize OpenEnv alignment and calibrate rewards for QA

Browse files

Files changed (11) hide show

Dockerfile +3 -4
README.md +87 -113
inference.py +75 -52
openenv.yaml +50 -43
qa_test.py +237 -0
server/app.py +39 -27
server/environment.py +73 -436
server/grader.py +80 -0
server/models.py +39 -20
server/tasks.py +110 -0
validate.sh +103 -0

Dockerfile CHANGED Viewed

@@ -7,11 +7,10 @@ COPY requirements.txt .
 RUN pip install --no-cache-dir --upgrade pip && \
     pip install --no-cache-dir -r requirements.txt
-# Copy application code
-COPY server/ ./server/
-COPY static/ ./static/
-# Environment defaults
 ENV PORT=7860
 ENV PYTHONPATH=/app
 ENV ENABLE_WEB_INTERFACE=false

 RUN pip install --no-cache-dir --upgrade pip && \
     pip install --no-cache-dir -r requirements.txt
+# Copy all project files (needed for openenv validate to work inside)
+COPY . .
+# Environment defaults (Hugging Face Spaces use 7860)
 ENV PORT=7860
 ENV PYTHONPATH=/app
 ENV ENABLE_WEB_INTERFACE=false

README.md CHANGED Viewed

@@ -1,156 +1,130 @@
 ---
-title: Code Review Env
-emoji: 🏃
-colorFrom: red
-colorTo: purple
-sdk: docker
-pinned: false
----
-# Code Security Review — OpenEnv
-> An RL environment for training AI agents to detect bugs and security
-> vulnerabilities in Python code.
-## Motivation
-Code review is one of the highest-leverage tasks in software engineering, yet it
-remains bottlenecked on human attention. This environment trains agents to catch
-real bug categories — from simple off-by-one errors to critical SQL injection
-vulnerabilities — using structured, deterministic reward signals.
 ---
 ## Action Space
-| Field | Type | Description |
-|---|---|---|
-| `bug_identified` | bool | Whether a bug was found |
-| `bug_location` | string | Exact location (function, expression) |
-| `bug_type` | string | `off-by-one`, `logic-error`, `security-vulnerability`, `none` |
-| `bug_description` | string | Explanation of the bug and its impact |
-| `severity` | string | `none` / `low` / `medium` / `high` / `critical` |
-| `suggested_fix` | string | Corrected code or fix description |
 ## Observation Space
-| Field | Type | Description |
-|---|---|---|
-| `code_snippet` | string | The code to review |
-| `language` | string | Programming language |
-| `task_description` | string | What the code is supposed to do |
-| `task_id` | string | Unique task identifier |
-| `difficulty` | string | `easy` / `medium` / `hard` |
-| `step_number` | int | Current step within the episode |
-| `max_steps` | int | Maximum steps allowed (3) |
-| `previous_feedback` | string? | Feedback from prior step |
 ---
-## Tasks
-### Easy — Off-by-one in array traversal
-- **Code:** `sum_elements(arr)` iterates `range(1, len(arr)+1)` causing `IndexError`
-- **Expected bug type:** `off-by-one`
-- **Expected severity:** `high`
-- **Baseline score:** ~0.72
-### Medium — Authentication logic flaw
-- **Code:** `authenticate_user()` uses `or` instead of `and` for admin check
-- **Expected bug type:** `logic-error`
-- **Expected severity:** `critical`
-- **Baseline score:** ~0.60
-### Hard — SQL injection via f-string
-- **Code:** `fetch_records()` interpolates `user_id` and `sort_column` directly into SQL
-- **Expected bug type:** `security-vulnerability`
-- **Expected severity:** `critical`
-- **Baseline score:** ~0.55
 ---
-## Reward Function
-Rewards are deterministic and provide partial progress signal:
-| Component | Max Score | Description |
 |---|---|---|
-| Bug identified | 0.20 | Correctly flags presence/absence of bug |
-| Bug type | 0.20 | Correct category of bug |
-| Bug location | 0.10 | Precise location identified |
-| Description quality | 0.25 | Keyword density in explanation |
-| Fix quality | 0.15 | Correct fix keywords present |
-| Severity | 0.10 | Correct severity level |
-| **Total** | **1.00** | |
 ---
 ## Setup
-### 1. Build and run Docker
 ```bash
-docker build -t code-review-env .
-docker run -p 7860:7860 code-review-env
 ```
-### 2. Run inference baseline
 ```bash
-# Set your environment variables
-export HF_TOKEN=hf_your_token_here
-export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
-export API_BASE_URL=https://router.huggingface.co/v1
-export ENV_BASE_URL=http://localhost:7860
-# Install dependencies
 pip install -r requirements.txt
-# Run
-python inference.py
-```
-### 3. Validate (OpenEnv CLI)
-```bash
-openenv validate
 ```
 ---
-## API Endpoints
-| Method | Path | Description |
-|---|---|---|
-| GET | `/health` | Health check |
-| POST | `/reset?difficulty=easy` | Reset environment |
-| POST | `/step` | Submit a review action |
-| GET | `/state` | Current episode state |
----
-## Baseline Scores
-| Task | Difficulty | Reward |
-|---|---|---|
-| Off-by-one detection | Easy | ~0.72 |
-| Auth logic flaw | Medium | ~0.60 |
-| SQL injection | Hard | ~0.55 |
-| **Average** | | **~0.62** |
----
-## Project Structure
-```
-code-review-env/
-├── Dockerfile
-├── openenv.yaml
-├── requirements.txt
-├── inference.py
-├── README.md
-└── server/
-    ├── __init__.py
-    ├── app.py          # FastAPI endpoints
-    ├── environment.py  # Tasks + grader logic
-    └── models.py       # Pydantic action/observation/state
 ```

+# Code Security Review — OpenEnv Environment
+An RL environment for training AI agents to perform real-world code security review.
+Agents analyze code snippets from production pull requests and identify bugs,
+vulnerabilities, and security issues.
+Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
 ---
+## Environment Overview
+| Field | Value |
+|---|---|
+| Tasks | 3 (easy → medium → hard) |
+| Languages | Python, JavaScript |
+| Action space | Structured JSON (6 fields) |
+| Reward range | 0.0 – 1.0 |
+| Steps per episode | 1 |
+---
+## Tasks
+| ID | Language | Bug Class | Difficulty |
+|---|---|---|---|
+| `python-off-by-one` | Python | Off-by-one index error | Easy |
+| `js-auth-privilege` | JavaScript | Logic flaw — privilege escalation | Medium |
+| `python-sql-injection` | Python | SQL injection via f-string | Hard |
 ---
 ## Action Space
+The agent submits a JSON action with these fields:
+```json
+{
+  "bug_identified": true,
+  "bug_location": "line 3 — range(len(transactions) + 1)",
+  "bug_type": "logic-error",
+  "bug_description": "Off-by-one error causes IndexError on last iteration...",
+  "severity": "medium",
+  "suggested_fix": "Change range(len(transactions) + 1) to range(len(transactions))"
+}
+```
 ## Observation Space
+```json
+{
+  "task_id": "python-sql-injection",
+  "language": "Python",
+  "difficulty": "hard",
+  "code_snippet": "def search_users(db, search_term):\n    ...",
+  "context": "REST API endpoint that searches users by name",
+  "pr_title": "Add user search endpoint to REST API",
+  "file_path": "api/users.py"
+}
+```
 ---
+## Reward Breakdown
+| Component | Max Score |
+|---|---|
+| Bug identified | 0.20 |
+| Bug type correct | 0.20 |
+| Bug location correct | 0.10 |
+| Description quality | 0.25 |
+| Fix quality | 0.15 |
+| Severity correct | 0.10 |
+| **Total** | **1.00** |
+The grader penalises keyword stuffing — incoherent keyword dumps score ≤ 0.20.
+**Example Calculation:**
+If the agent correctly identifies a bug (+0.20), misidentifies the type (+0.0), finds 50% of the location keywords (+0.05), writes a detailed and coherent description matching most keywords (+0.25), suggests a partially correct fix (+0.08), and gets the severity correct (+0.10), the total reward for that step would be `0.20 + 0.0 + 0.05 + 0.25 + 0.08 + 0.10 = 0.68`.
 ---
+## Edge Cases
+- **At step 0:** `reset()` must be called to initialize the state. If `step()` is called before `reset()`, the environment automatically calls `reset()` internally and evaluates the action on a random task.
+- **Max step limit:** The maximum step limit is 1. Calling `step()` evaluates the action and immediately sets `done=True`.
+- **At done=True:** Calling `step()` returns `reward=0.0`, `done=True`, and a clean error message in the `info` dict `("Episode already completed. Call /reset...")` indicating the episode is complete without auto-resetting.
+---
+## API Endpoints
+| Method | Path | Description |
 |---|---|---|
+| GET | `/` | Health check |
+| POST | `/reset?task_id=<id>` | Reset environment, returns observation |
+| POST | `/step` | Submit action, returns reward |
+| GET | `/state` | Current episode state |
+| GET | `/tasks` | List all tasks |
 ---
 ## Setup
+### Docker
 ```bash
+docker build -t code-security-review .
+docker run -p 8000:8000 code-security-review
 ```
+### Local
 ```bash
 pip install -r requirements.txt
+uvicorn server.app:app --host 0.0.0.0 --port 8000
 ```
 ---
+## Running Inference
+```bash
+export API_BASE_URL="https://api.openai.com/v1"
+export MODEL_NAME="gpt-4o-mini"
+export HF_TOKEN="your-api-key"
+export ENV_URL="http://localhost:8000"
+python inference.py
 ```

inference.py CHANGED Viewed

@@ -6,28 +6,30 @@ Required environment variables:
     API_BASE_URL   — LLM API endpoint
     MODEL_NAME     — Model identifier
     HF_TOKEN       — Hugging Face / API key
-    ENV_BASE_URL   — Running environment URL (default: http://localhost:7860)
 """
 import os
 import json
 import time
 import re
 from typing import List, Optional
 from dotenv import load_dotenv
 # Load .env variables
 load_dotenv()
-import requests
-from openai import OpenAI
 # ── Config ────────────────────────────────────────────────────────────────────
-API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
-MODEL_NAME   = os.environ.get("MODEL_NAME",   "Qwen/Qwen2.5-72B-Instruct")
-HF_TOKEN     = os.environ.get("HF_TOKEN",     "")
-ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:7860")
-BENCHMARK    = "code-review-env"
 client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
@@ -41,7 +43,7 @@ Schema:
 {
   "bug_identified": true or false,
   "bug_location": "exact location (function name, line description, variable, expression)",
-  "bug_type": "off-by-one | logic-error | security-vulnerability | null-dereference | none",
   "bug_description": "detailed explanation of why this is a bug and the impact",
   "severity": "none | low | medium | high | critical",
   "suggested_fix": "the corrected code snippet or a precise description of the fix"
@@ -69,7 +71,7 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
 # ── Helpers ───────────────────────────────────────────────────────────────────
 def env_post(path: str, data: Optional[dict] = None, params: Optional[dict] = None) -> dict:
-    url = f"{ENV_BASE_URL}{path}"
     resp = requests.post(url, json=data or {}, params=params or {}, timeout=30)
     resp.raise_for_status()
     return resp.json()
@@ -80,41 +82,49 @@ def parse_json_from_llm(text: str) -> dict:
     text = text.strip()
     text = re.sub(r"^```(?:json)?\s*", "", text)
     text = re.sub(r"\s*```$", "", text)
-    return json.loads(text)
 def build_prompt(obs: dict) -> str:
     lines = [
         f"Language: {obs['language']}",
-        f"Task: {obs['task_description']}",
         "",
         f"```{obs['language']}",
         obs["code_snippet"],
         "```",
     ]
-    if obs.get("previous_feedback"):
-        lines += ["", f"Previous feedback: {obs['previous_feedback']}",
-                  "Revise your analysis accordingly."]
     return "\n".join(lines)
 # ── Task runner ───────────────────────────────────────────────────────────────
-def run_task(difficulty: str) -> dict:
-    reset_resp = env_post("/reset", params={"difficulty": difficulty})
     obs = reset_resp["observation"]
-    task_id = obs['task_id']
     log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
-    rewards = []
-    steps_taken = 0
     done = False
-    last_error = None
-    while not done and steps_taken < obs["max_steps"]:
-        steps_taken += 1
         prompt = build_prompt(obs)
         # ── LLM call ──────────────────────────────────────────────────────────
         try:
@@ -126,20 +136,21 @@ def run_task(difficulty: str) -> dict:
                 ],
                 temperature=0.1,
                 max_tokens=600,
             )
             raw = response.choices[0].message.content
             action_dict = parse_json_from_llm(raw)
             action_str = json.dumps(action_dict)
-            last_error = None
         except Exception as exc:
-            last_error = str(exc)
             action_dict = {
                 "bug_identified": False,
-                "bug_location": "error",
                 "bug_type": "none",
-                "bug_description": last_error,
                 "severity": "none",
-                "suggested_fix": "",
             }
             action_str = "{}"
@@ -147,44 +158,56 @@ def run_task(difficulty: str) -> dict:
         step_resp = env_post("/step", data=action_dict)
         reward = step_resp["reward"]
         done   = step_resp["done"]
-        obs    = step_resp["observation"]
-        rewards.append(reward)
-        log_step(step=steps_taken, action=action_str, reward=reward, done=done, error=last_error)
-    # Calculate final score (normalized to [0, 1])
-    # Total reward is cumulative in this env, but we cap it at 1.0 for the score
-    total_reward = sum(rewards)
-    score = min(max(total_reward, 0.0), 1.0)
-    success = score >= 0.8
-    log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
     return {
-        "task_id": task_id,
-        "score": score,
-        "success": success
     }
 # ── Main ──────────────────────────────────────────────────────────────────────
 def main():
-    tasks = ["easy", "medium", "hard"]
     results = []
-    for difficulty in tasks:
         try:
-            r = run_task(difficulty)
-            results.append(r)
         except Exception as exc:
-            # print(f"DEBUG: Task failed: {exc}", flush=True)
-            log_end(success=False, steps=0, score=0.0, rewards=[])
     if results:
-        avg = sum(r["score"] for r in results) / len(results)
-        # Optional: summary for human review (will not interfere with [END] parsers)
-        # print(f"\n[SUMMARY] avg_score={avg:.3f}")
 if __name__ == "__main__":
     main()

     API_BASE_URL   — LLM API endpoint
     MODEL_NAME     — Model identifier
     HF_TOKEN       — Hugging Face / API key
+    ENV_URL        — Running environment URL (default: http://localhost:7860)
 """
 import os
 import json
 import time
 import re
+import requests
 from typing import List, Optional
 from dotenv import load_dotenv
+from openai import OpenAI
 # Load .env variables
 load_dotenv()
 # ── Config ────────────────────────────────────────────────────────────────────
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME   = os.environ.get("MODEL_NAME",   "gpt-4o-mini")
+HF_TOKEN     = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
+ENV_URL      = os.environ.get("ENV_URL",      "http://localhost:7860")
+BENCHMARK    = "code-security-review"
+if not HF_TOKEN:
+    raise ValueError("HF_TOKEN or API_KEY must be set.")
 client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
 {
   "bug_identified": true or false,
   "bug_location": "exact location (function name, line description, variable, expression)",
+  "bug_type": "off-by-one | logic-error | security-vulnerability | none",
   "bug_description": "detailed explanation of why this is a bug and the impact",
   "severity": "none | low | medium | high | critical",
   "suggested_fix": "the corrected code snippet or a precise description of the fix"
 # ── Helpers ───────────────────────────────────────────────────────────────────
 def env_post(path: str, data: Optional[dict] = None, params: Optional[dict] = None) -> dict:
+    url = f"{ENV_URL}{path}"
     resp = requests.post(url, json=data or {}, params=params or {}, timeout=30)
     resp.raise_for_status()
     return resp.json()
     text = text.strip()
     text = re.sub(r"^```(?:json)?\s*", "", text)
     text = re.sub(r"\s*```$", "", text)
+    # If the LLM still included text around the JSON, try to find the first { and last }
+    match = re.search(r"({.*})", text, re.DOTALL)
+    if match:
+        text = match.group(1)
+    try:
+        return json.loads(text)
+    except Exception:
+        return {}
 def build_prompt(obs: dict) -> str:
     lines = [
         f"Language: {obs['language']}",
+        f"Context: {obs.get('context', 'No context provided')}",
+        f"PR Title: {obs.get('pr_title', 'No PR title')}",
+        f"File Path: {obs.get('file_path', 'unknown')}",
         "",
         f"```{obs['language']}",
         obs["code_snippet"],
         "```",
     ]
     return "\n".join(lines)
 # ── Task runner ───────────────────────────────────────────────────────────────
+def run_task(task_id: str, task_num: int) -> dict:
+    reset_resp = env_post("/reset", params={"task_id": task_id})
     obs = reset_resp["observation"]
     log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    cumulative_reward = 0.0
+    step_num = 0
+    max_steps = 1
     done = False
+    all_rewards = []
+    error = None
+    while not done and step_num < max_steps:
+        step_num += 1
         prompt = build_prompt(obs)
+        action_dict = {}
         # ── LLM call ──────────────────────────────────────────────────────────
         try:
                 ],
                 temperature=0.1,
                 max_tokens=600,
+                stream=False,
             )
             raw = response.choices[0].message.content
             action_dict = parse_json_from_llm(raw)
             action_str = json.dumps(action_dict)
+            error = None
         except Exception as exc:
+            error = str(exc).replace("\n", " ")
             action_dict = {
                 "bug_identified": False,
+                "bug_location": "none",
                 "bug_type": "none",
+                "bug_description": f"Error: {error}",
                 "severity": "none",
+                "suggested_fix": "none",
             }
             action_str = "{}"
         step_resp = env_post("/step", data=action_dict)
         reward = step_resp["reward"]
         done   = step_resp["done"]
+        obs    = step_resp.get("observation")
+        all_rewards.append(reward)
+        cumulative_reward += reward
+        log_step(step=step_num, action=action_str, reward=reward, done=done, error=error)
+    success = cumulative_reward >= 0.8
+    log_end(success=success, steps=step_num, score=cumulative_reward, rewards=all_rewards)
     return {
+        "task_num":        task_num,
+        "task_id":         task_id,
+        "score":           cumulative_reward,
+        "success":         success,
     }
 # ── Main ──────────────────────────────────────────────────────────────────────
 def main():
+    print(f"[INFO] Initializing inference on {BENCHMARK} using {MODEL_NAME}", flush=True)
+    TASK_FILTER = os.environ.get("TASK")
+    all_tasks = [
+        ("python-off-by-one", 1, "easy"),
+        ("js-auth-privilege", 2, "medium"),
+        ("python-sql-injection", 3, "hard"),
+    ]
+    if TASK_FILTER:
+        tasks = [t for t in all_tasks if t[2] == TASK_FILTER]
+    else:
+        tasks = all_tasks
     results = []
+    for task_id, task_num, _ in tasks:
         try:
+            r = run_task(task_id, task_num)
         except Exception as exc:
+            print(f"[ERROR] task_id={task_id} error={exc}", flush=True)
+            r = {"task_num": task_num, "task_id": task_id, "score": 0.0, "success": False}
+        results.append(r)
     if results:
+        avg = round(sum(r["score"] for r in results) / len(results), 3)
+        successes = sum(1 for r in results if r.get("success"))
+        print(f"\n[SUMMARY] avg_reward={avg} tasks_passed={successes}/{len(results)}", flush=True)
 if __name__ == "__main__":
     main()

openenv.yaml CHANGED Viewed

@@ -1,47 +1,50 @@
-name: code-review-env
-version: 1.0.0
-description: >
-  RL environment for training AI agents to detect bugs and security
-  vulnerabilities in real Python code. Covers off-by-one errors,
-  authentication logic flaws, and SQL injection — with deterministic
-  programmatic graders and partial-progress reward signals.
 author: Inmodel Labs
-tags:
-  - code-review
-  - security
-  - software-engineering
-  - real-world
-  - python
 tasks:
-  - id: task_easy_001
     difficulty: easy
-    description: "Detect off-by-one error in array traversal loop"
-    reset_params:
-      difficulty: easy
-  - id: task_medium_001
     difficulty: medium
-    description: "Detect authentication logic flaw enabling privilege escalation"
-    reset_params:
-      difficulty: medium
-  - id: task_hard_001
     difficulty: hard
-    description: "Detect SQL injection via unsanitised f-string database query"
-    reset_params:
-      difficulty: hard
 action_space:
   type: object
   properties:
-    bug_identified:   { type: boolean }
-    bug_location:     { type: string }
-    bug_type:         { type: string }
-    bug_description:  { type: string }
-    severity:         { type: string, enum: [none, low, medium, high, critical] }
-    suggested_fix:    { type: string }
   required:
     - bug_identified
     - bug_location
@@ -50,18 +53,20 @@ action_space:
     - severity
     - suggested_fix
 observation_space:
   type: object
   properties:
-    code_snippet:       { type: string }
-    language:           { type: string }
-    task_description:   { type: string }
-    task_id:            { type: string }
-    difficulty:         { type: string, enum: [easy, medium, hard] }
-    step_number:        { type: integer }
-    max_steps:          { type: integer }
-    previous_feedback:  { type: string, nullable: true }
 reward:
   min: 0.0
   max: 1.0
@@ -69,9 +74,11 @@ reward:
     Partial rewards for: bug identification (0.20), correct bug type (0.20),
     precise location (0.10), description quality (0.25, keyword density),
     fix quality (0.15, keyword density), correct severity (0.10).
 endpoints:
-  health: GET  /health
-  reset:  POST /reset
-  step:   POST /step
-  state:  GET  /state

+# OpenEnv Environment Specification
+# This file describes the Code Security Review environment for the Meta PyTorch OpenEnv Hackathon.
+# Metadata section details the environment's identity.
+name: code-security-review
+version: "1.0.0"
+description: >
+  An RL environment for training AI agents to perform code security review.
+  Agents analyze code snippets from production pull requests and identify bugs,
+  vulnerabilities, and security issues.
 author: Inmodel Labs
+# Tasks section defines the core challenges in the environment.
+# Each task has a unique ID, name, description, and difficulty level.
 tasks:
+  - id: python-off-by-one
+    name: "Python Off-by-One Error"
+    description: "Identify an off-by-one index error in a Python finance batch processor"
     difficulty: easy
+    max_steps: 1
+    reward_range: [0.0, 1.0]
+  - id: js-auth-privilege
+    name: "JavaScript Auth Logic Flaw"
+    description: "Identify a privilege escalation vulnerability in Node.js auth middleware"
     difficulty: medium
+    max_steps: 1
+    reward_range: [0.0, 1.0]
+  - id: python-sql-injection
+    name: "Python SQL Injection"
+    description: "Identify an SQL injection vulnerability via f-string in a REST API"
     difficulty: hard
+    max_steps: 1
+    reward_range: [0.0, 1.0]
+# The Action space defines the format of the agent's response.
+# Each field is scored by the grader to provide partial progress signals.
 action_space:
   type: object
   properties:
+    bug_identified:   { type: boolean, description: "Boolean: true if a bug exists" }
+    bug_location:     { type: string, description: "String: Pinpoint the bug's location in code" }
+    bug_type:         { type: string, description: "String: off-by-one | logic-error | security-vulnerability | none" }
+    bug_description:  { type: string, description: "String: Detailed analysis of the vulnerability" }
+    severity:         { type: string, enum: [none, low, medium, high, critical], description: "String: none | low | medium | high | critical" }
+    suggested_fix:    { type: string, description: "String: How to fix the identified bug" }
   required:
     - bug_identified
     - bug_location
     - severity
     - suggested_fix
+# The Observation space defines what the agent sees at each step.
+# It uses a structured context to help the agent understand the code's purpose.
 observation_space:
   type: object
   properties:
+    task_id:            { type: string, description: "Unique task identifier" }
+    language:           { type: string, description: "Source code language" }
+    difficulty:         { type: string, enum: [easy, medium, hard], description: "Task complexity (easy/medium/hard)" }
+    code_snippet:       { type: string, description: "The source code to be reviewed" }
+    context:            { type: string, description: "Real-world context (e.g., API description)" }
+    pr_title:           { type: string, description: "Pull Request title for additional intent context" }
+    file_path:          { type: string, description: "Relative path to the file in the repository" }
+# Reward structure for evaluating agent performance.
 reward:
   min: 0.0
   max: 1.0
     Partial rewards for: bug identification (0.20), correct bug type (0.20),
     precise location (0.10), description quality (0.25, keyword density),
     fix quality (0.15, keyword density), correct severity (0.10).
+    Grader penalizes keyword stuffing.
 endpoints:
+  health: GET /
+  reset: POST /reset
+  step: POST /step
+  state: GET /state
+  tasks: GET /tasks

qa_test.py ADDED Viewed

	@@ -0,0 +1,237 @@

+import requests
+import json
+BASE_URL = "http://localhost:7860"
+def run_tests():
+    checks = []
+    # 1. GET /
+    try:
+        r = requests.get(f"{BASE_URL}/")
+        passed = r.status_code == 200 and r.json().get("status") == "ok"
+        checks.append({
+            "id": 1, "name": "GET / health check", "passed": passed,
+            "expected": 'HTTP 200 and {"status": "ok"}', "got": f"HTTP {r.status_code} {r.text}"
+        })
+    except Exception as e:
+        checks.append({"id": 1, "name": "GET / health check", "passed": False, "expected": "200 OK", "got": str(e)})
+    # 15. GET /state before reset (Edge case)
+    try:
+        r = requests.get(f"{BASE_URL}/state")
+        # Should not crash
+        checks.append({
+            "id": 15, "name": "GET /state before any reset", "passed": r.status_code == 200,
+            "expected": "HTTP 200 (No crash)", "got": f"HTTP {r.status_code} {r.text}"
+        })
+    except Exception as e:
+        checks.append({"id": 15, "name": "GET /state before any reset", "passed": False, "expected": "200 OK", "got": str(e)})
+    # 2. POST /reset
+    try:
+        r = requests.post(f"{BASE_URL}/reset")
+        data = r.json().get("observation", {})
+        required = ["task_id", "language", "difficulty", "code_snippet", "context", "pr_title", "file_path"]
+        passed = all(k in data for k in required)
+        checks.append({
+            "id": 2, "name": "POST /reset fields check", "passed": passed,
+            "expected": f"JSON with {required}", "got": list(data.keys())
+        })
+    except Exception as e:
+        checks.append({"id": 2, "name": "POST /reset fields check", "passed": False, "expected": "Fields", "got": str(e)})
+    # 16. POST /reset no task_id
+    try:
+        r = requests.post(f"{BASE_URL}/reset")
+        checks.append({
+            "id": 16, "name": "POST /reset no task_id (Random)", "passed": r.status_code == 200,
+            "expected": "HTTP 200", "got": f"HTTP {r.status_code}"
+        })
+    except Exception as e:
+        checks.append({"id": 16, "name": "POST /reset no task_id (Random)", "passed": False, "expected": "200 OK", "got": str(e)})
+    # 3-5. POST /reset?task_id=...
+    for tid in ["python-off-by-one", "js-auth-privilege", "python-sql-injection"]:
+        try:
+            num = {"python-off-by-one": 3, "js-auth-privilege": 4, "python-sql-injection": 5}[tid]
+            r = requests.post(f"{BASE_URL}/reset?task_id={tid}")
+            passed = r.status_code == 200 and r.json()["observation"]["task_id"] == tid
+            checks.append({
+                "id": num, "name": f"POST /reset for {tid}", "passed": passed,
+                "expected": f"HTTP 200 with task_id={tid}", "got": f"HTTP {r.status_code} {r.json()['observation']['task_id'] if passed else r.text}"
+            })
+        except Exception as e:
+            checks.append({"id": num, "name": f"POST /reset for {tid}", "passed": False, "expected": "200 OK", "got": str(e)})
+    # 6. GET /state
+    try:
+        r = requests.get(f"{BASE_URL}/state")
+        data = r.json()
+        required = ["task_id", "step", "done", "total_reward"]
+        passed = all(k in data for k in required)
+        checks.append({
+            "id": 6, "name": "GET /state fields check", "passed": passed,
+            "expected": f"JSON with {required}", "got": list(data.keys())
+        })
+    except Exception as e:
+        checks.append({"id": 6, "name": "GET /state fields check", "passed": False, "expected": "Fields", "got": str(e)})
+    # 7. POST /step with PROVIDED action
+    try:
+        requests.post(f"{BASE_URL}/reset?task_id=python-sql-injection")
+        action = {
+            "bug_identified": True,
+            "bug_location": "line 2 f-string",
+            "bug_type": "security-vulnerability",
+            "bug_description": "SQL injection via f-string",
+            "severity": "critical",
+            "suggested_fix": "use parameterized query"
+        }
+        r = requests.post(f"{BASE_URL}/step", json=action)
+        res = r.json()
+        reward = res.get("reward", -1.0)
+        done = res.get("done", False)
+        passed = 0.0 <= reward <= 1.0 and done is True
+        checks.append({
+            "id": 7, "name": "POST /step valid action", "passed": passed,
+            "expected": "Reward [0,1] and done=true", "got": f"reward={reward}, done={done}"
+        })
+    except Exception as e:
+        checks.append({"id": 7, "name": "POST /step valid action", "passed": False, "expected": "Result", "got": str(e)})
+    # 14. Call POST /step twice (Edge Case)
+    try:
+        # Step already called in task 7
+        action = {"bug_identified": False, "bug_location": "", "bug_type": "none", "bug_description": "", "severity": "none", "suggested_fix": ""}
+        r = requests.post(f"{BASE_URL}/step", json=action)
+        res = r.json()
+        passed = r.status_code == 200 and "error" in res.get("info", {})
+        checks.append({
+            "id": 14, "name": "POST /step twice in same episode", "passed": passed,
+            "expected": "HTTP 200 and error in info", "got": f"HTTP {r.status_code}, info={res.get('info')}"
+        })
+    except Exception as e:
+        checks.append({"id": 14, "name": "POST /step twice in same episode", "passed": False, "expected": "Handled error", "got": str(e)})
+    # 8. Perfect action for SQL
+    try:
+        requests.post(f"{BASE_URL}/reset?task_id=python-sql-injection")
+        perfect_action = {
+            "bug_identified": True,
+            "bug_location": "line 2 f-string interpolation in SQL query construction",
+            "bug_type": "security-vulnerability",
+            "bug_description": "SQL injection vulnerability where user-supplied search_term is directly interpolated into the SQL query via f-string. An attacker can inject malicious SQL to bypass authentication, exfiltrate all user data, or drop tables. The fix is to use parameterized queries which sanitize user input automatically.",
+            "severity": "critical",
+            "suggested_fix": "Use db.execute('SELECT * FROM users WHERE name LIKE %s', ('%'+search_term+'%',)) instead of f-string interpolation"
+        }
+        r = requests.post(f"{BASE_URL}/step", json=perfect_action)
+        reward = r.json().get("reward", 0.0)
+        checks.append({
+            "id": 8, "name": "PERFECT action SQL", "passed": reward >= 0.85,
+            "expected": "Reward >= 0.85", "got": f"reward={reward}"
+        })
+    except Exception as e:
+        checks.append({"id": 8, "name": "PERFECT action SQL", "passed": False, "expected": ">=0.85", "got": str(e)})
+    # 9. Keyword stuffed
+    try:
+        requests.post(f"{BASE_URL}/reset?task_id=python-sql-injection")
+        stuffed_action = {
+            "bug_identified": True,
+            "bug_location": "sql",
+            "bug_type": "security-vulnerability",
+            "bug_description": "sql injection sql injection sql injection parameterized f-string sanitize escape malicious attack tautology union drop sql injection sql injection",
+            "severity": "critical",
+            "suggested_fix": "fix"
+        }
+        r = requests.post(f"{BASE_URL}/step", json=stuffed_action)
+        reward = r.json().get("reward", 1.0)
+        checks.append({
+            "id": 9, "name": "KEYWORD STUFFED action", "passed": reward <= 0.20,
+            "expected": "Reward <= 0.20", "got": f"reward={reward}"
+        })
+    except Exception as e:
+        checks.append({"id": 9, "name": "KEYWORD STUFFED action", "passed": False, "expected": "<=0.20", "got": str(e)})
+    # 10. Bug identified false
+    try:
+        requests.post(f"{BASE_URL}/reset")
+        action = {"bug_identified": False, "bug_location": "", "bug_type": "none", "bug_description": "", "severity": "none", "suggested_fix": ""}
+        r = requests.post(f"{BASE_URL}/step", json=action)
+        reward = r.json().get("reward", 1.0)
+        checks.append({
+            "id": 10, "name": "Identify=False empty fields", "passed": reward == 0.0,
+            "expected": "Reward exactly 0.0", "got": f"reward={reward}"
+        })
+    except Exception as e:
+        checks.append({"id": 10, "name": "Identify=False empty fields", "passed": False, "expected": "0.0", "got": str(e)})
+    # 11. Partial credit severity
+    try:
+        # Off-by-one is severity critical (I set it to critical).
+        # Let's say I submit 'low' severity.
+        requests.post(f"{BASE_URL}/reset?task_id=python-off-by-one")
+        action = {
+            "bug_identified": True, "bug_location": "range", "bug_type": "off-by-one",
+            "bug_description": "off-by-one error in range function call",
+            "severity": "low", # Wrong severity
+            "suggested_fix": "range(len(x))"
+        }
+        r = requests.post(f"{BASE_URL}/step", json=action)
+        info = r.json().get("info", {})
+        breakdown = info.get("reward_breakdown", {})
+        sev_score = breakdown.get("severity", -1.0)
+        # It should be 0.0 (wrong) but the total should still have partial credit from other components
+        reward = r.json().get("reward", 0.0)
+        checks.append({
+            "id": 11, "name": "Partial credit (wrong severity)", "passed": 0.0 < reward < 1.0,
+            "expected": "Reward between 0 and 1 (partial credit)", "got": f"reward={reward}, severity_component={sev_score}"
+        })
+    except Exception as e:
+        checks.append({"id": 11, "name": "Partial credit (wrong severity)", "passed": False, "expected": "Partial credit", "got": str(e)})
+    # 12-13. Breakdown keys and components
+    try:
+        requests.post(f"{BASE_URL}/reset")
+        action = {"bug_identified": True, "bug_location": "test", "bug_type": "test", "bug_description": "test test test test test test test test test test test test test test test test test test test test", "severity": "none", "suggested_fix": "test test test"}
+        r = requests.post(f"{BASE_URL}/step", json=action)
+        info = r.json().get("info", {})
+        breakdown = info.get("reward_breakdown", {})
+        required = ["bug_identified", "bug_type", "bug_location", "description_quality", "fix_quality", "severity"]
+        checks.append({
+            "id": 12, "name": "Reward breakdown keys", "passed": all(k in breakdown for k in required),
+            "expected": f"Breakdown with {required}", "got": list(breakdown.keys())
+        })
+        max_vals = {
+            "bug_identified": 0.20, "bug_type": 0.20, "bug_location": 0.10,
+            "description_quality": 0.25, "fix_quality": 0.15, "severity": 0.10
+        }
+        passed_range = all(0.0 <= breakdown.get(k, -1) <= max_vals[k] for k in max_vals)
+        checks.append({
+            "id": 13, "name": "Component score ranges", "passed": passed_range,
+            "expected": "All components <= max", "got": breakdown
+        })
+    except Exception as e:
+        checks.append({"id": 12, "name": "Breakdown checks", "passed": False, "expected": "Breakdown", "got": str(e)})
+    # Sort and print
+    checks.sort(key=lambda x: x["id"])
+    for c in checks:
+        status = "PASS" if c["passed"] else "FAIL"
+        print(f"[{c['id']}] {c['name']} — {status}")
+        print(f"     Expected: {c['expected']}")
+        print(f"     Got: {c['got']}")
+        print("")
+    passed_count = sum(1 for c in checks if c["passed"])
+    disqual = "YES" if passed_count < 7 else "NO" # Disqualified if Part 1 fails
+    print(f"TOTAL: {passed_count}/16 passed")
+    print(f"DISQUALIFICATION RISK: {disqual}")
+    # Estimate score based on points
+    score = (passed_count / 16) * 100
+    print(f"ESTIMATED SCORE: {round(score)}/100")
+if __name__ == "__main__":
+    run_tests()

server/app.py CHANGED Viewed

@@ -1,19 +1,16 @@
 import os
 import uvicorn
 from fastapi import FastAPI, HTTPException, Query
 from fastapi.middleware.cors import CORSMiddleware
-from fastapi.staticfiles import StaticFiles
-from fastapi.responses import FileResponse
-from .models import CodeReviewAction, CodeReviewState, StepResponse, ResetResponse
-from .environment import CodeReviewEnvironment
 app = FastAPI(
     title="Code Security Review — OpenEnv",
-    description=(
-        "RL environment for training AI agents to detect bugs and security "
-        "vulnerabilities in code. Compatible with the OpenEnv spec."
-    ),
     version="1.0.0",
 )
@@ -24,46 +21,61 @@ app.add_middleware(
     allow_headers=["*"],
 )
-app.mount("/static", StaticFiles(directory="static"), name="static")
-env = CodeReviewEnvironment()
 @app.get("/")
-def read_index():
-    return FileResponse("static/index.html")
-@app.get("/health")
 def health():
-    return {"status": "ok", "env": "code-review-env", "version": "1.0.0"}
 @app.post("/reset", response_model=ResetResponse)
-def reset(difficulty: str = Query(default="easy", description="easy | medium | hard")):
     """Reset the environment and return the first observation."""
-    obs = env.reset(difficulty=difficulty)
     return ResetResponse(observation=obs)
-@app.post("/step", response_model=StepResponse)
 def step(action: CodeReviewAction):
     """Submit a code review action and receive a reward signal."""
-    try:
-        obs, reward, done, info = env.step(action)
-        return StepResponse(observation=obs, reward=reward, done=done, info=info)
-    except ValueError as exc:
-        raise HTTPException(status_code=400, detail=str(exc))
-@app.get("/state", response_model=CodeReviewState)
 def state():
     """Return the current environment state."""
     return env.state()
 if __name__ == "__main__":
-    port = int(os.environ.get("PORT", 7860))
-    enable_web = os.environ.get("ENABLE_WEB_INTERFACE", "false").lower() == "true"
     uvicorn.run(
         "server.app:app",
         host="0.0.0.0",

 import os
 import uvicorn
+from typing import List, Optional
 from fastapi import FastAPI, HTTPException, Query
 from fastapi.middleware.cors import CORSMiddleware
+from server.models import CodeReviewAction, StepResult, ResetResponse, StateResponse, TaskInfo
+from server.tasks import TASKS
+from server.environment import CodeSecurityEnv
 app = FastAPI(
     title="Code Security Review — OpenEnv",
+    description="An RL environment for training AI agents to perform code security review.",
     version="1.0.0",
 )
     allow_headers=["*"],
 )
+env = CodeSecurityEnv()
 @app.get("/")
 def health():
+    """Health check endpoint."""
+    return {
+        "status": "ok",
+        "project": "Code Security Review - OpenEnv",
+        "version": "1.0.0",
+        "organization": "Inmodel Labs",
+    }
+@app.get("/tasks", response_model=List[TaskInfo])
+def list_tasks():
+    """List all available tasks."""
+    return [
+        TaskInfo(
+            id=t["id"],
+            language=t["language"],
+            bug_class=t["bug_class"],
+            difficulty=t["difficulty"],
+        )
+        for t in TASKS.values()
+    ]
 @app.post("/reset", response_model=ResetResponse)
+def reset(
+    task_id: str = Query(default="python-off-by-one", description="Task ID to reset to"),
+    seed: Optional[int] = Query(default=None, description="Optional seed for reproducibility")
+):
     """Reset the environment and return the first observation."""
+    if task_id not in TASKS:
+        raise HTTPException(status_code=404, detail=f"Task '{task_id}' not found.")
+    obs = env.reset(task_id=task_id, seed=seed)
     return ResetResponse(observation=obs)
+@app.post("/step", response_model=StepResult)
 def step(action: CodeReviewAction):
     """Submit a code review action and receive a reward signal."""
+    result = env.step(action)
+    return result
+@app.get("/state", response_model=StateResponse)
 def state():
     """Return the current environment state."""
     return env.state()
 if __name__ == "__main__":
+    port = int(os.environ.get("PORT", 8000))
     uvicorn.run(
         "server.app:app",
         host="0.0.0.0",

server/environment.py CHANGED Viewed

@@ -1,447 +1,84 @@
-from typing import Dict, Any, Tuple, Optional
-from .models import CodeReviewAction, CodeReviewObservation, CodeReviewState
-MAX_STEPS = 3
-# TASK DEFINITIONS
-TASKS: Dict[str, dict] = {
-    # EASY
-    "easy": {
-        "id": "task_easy_001",
-        "difficulty": "easy",
-        "language": "python",
-        "description": (
-            "This function is supposed to sum all elements in a list. "
-            "Find any bugs and suggest a fix."
-        ),
-        "code": (
-            "def sum_elements(arr):\n"
-            '    """Return the sum of all elements."""\n'
-            "    total = 0\n"
-            "    for i in range(1, len(arr) + 1):  # iterates over indices\n"
-            "        total += arr[i]\n"
-            "    return total"
-        ),
-        "ground_truth": {
-            "bug_identified": True,
-            "bug_type_keywords": [
-                "off-by-one", "off by one", "index error", "indexerror",
-                "out of bounds", "out of range", "index out",
-            ],
-            "location_keywords": [
-                "range(1, len(arr) + 1)", "len(arr) + 1", "len(arr)+1",
-                "range", "loop", "index", "arr[i]",
-            ],
-            "description_keywords": [
-                "index", "range", "len", "off-by-one", "off by one",
-                "IndexError", "out of bounds", "+1", "exceed", "arr[i]",
-                "zero", "start",
-            ],
-            "fix_keywords": [
-                "range(len(arr))", "range(0, len(arr))",
-                "for i in range(len", "for element in arr",
-                "arr[i]" , "len(arr))",
-            ],
-            "severity_valid": ["high", "medium"],
-        },
-    },
-    #MEDIUM
-    "medium": {
-        "id": "task_medium_001",
-        "difficulty": "medium",
-        "language": "python",
-        "description": (
-            "This authentication function controls admin access. "
-            "Find the logical security bug."
-        ),
-        "code": (
-            "def authenticate_user(username, password, request_admin=False):\n"
-            '    """Authenticate user and return access level."""\n'
-            "    user = db.find_user(username)\n"
-            "    if not user or user.password_hash != hash_password(password):\n"
-            '        return {"authenticated": False, "level": "none"}\n'
-            "\n"
-            "    # Elevate to admin if caller requests it OR user has admin role\n"
-            "    if request_admin or user.role == 'admin':   # <-- review this\n"
-            '        return {"authenticated": True, "level": "admin"}\n'
-            "\n"
-            '    return {"authenticated": True, "level": "user"}'
-        ),
-        "ground_truth": {
-            "bug_identified": True,
-            "bug_type_keywords": [
-                "logic", "logic error", "logical", "privilege escalation",
-                "authorization", "authentication bypass", "access control",
-            ],
-            "location_keywords": [
-                "request_admin or", "or user.role", "or", "condition",
-                "if request_admin", "or user.role == 'admin'",
-            ],
-            "description_keywords": [
-                "or", "and", "privilege", "escalation", "bypass", "admin",
-                "role", "caller", "request_admin", "logic", "elevation",
-                "any caller", "arbitrary",
-            ],
-            "fix_keywords": [
-                "and", "request_admin and user.role", "and user.role == 'admin'",
-                "and user.role", "both",
-            ],
-            "severity_valid": ["critical", "high"],
-        },
-    },
-    # ── HARD ──────────────────────────────────
-    "hard": {
-        "id": "task_hard_001",
-        "difficulty": "hard",
-        "language": "python",
-        "description": (
-            "This function fetches records from a database using user-supplied input. "
-            "Identify the security vulnerability."
-        ),
-        "code": (
-            "def fetch_records(user_id: str, sort_column: str):\n"
-            '    """Fetch user records sorted by a given column."""\n'
-            "    conn = get_db_connection()\n"
-            "    cursor = conn.cursor()\n"
-            "\n"
-            "    query = (\n"
-            '        f"SELECT id, name, email FROM users "\n'
-            '        f"WHERE user_id = {user_id} "\n'
-            '        f"ORDER BY {sort_column}"\n'
-            "    )\n"
-            "    cursor.execute(query)\n"
-            "    rows = cursor.fetchall()\n"
-            "    conn.close()\n"
-            "    return rows"
-        ),
-        "ground_truth": {
-            "bug_identified": True,
-            "bug_type_keywords": [
-                "sql injection", "injection", "sqli", "sql",
-                "security vulnerability", "security", "second-order",
-            ],
-            "location_keywords": [
-                "f\"", "f-string", "format", "user_id", "sort_column",
-                "query", "ORDER BY", "WHERE user_id",
-            ],
-            "description_keywords": [
-                "sql injection", "injection", "parameterized", "f-string",
-                "format string", "user input", "sanitize", "escape",
-                "malicious", "attack", "tautology", "union", "drop",
-                "ORDER BY", "sort_column", "arbitrary",
-            ],
-            "fix_keywords": [
-                "parameterized", "?", "%s", "cursor.execute(query, (",
-                "cursor.execute(query, [", "prepared statement",
-                "whitelist", "allowlist", "ALLOWED_COLUMNS",
-                "sanitize", "if sort_column not in",
-            ],
-            "severity_valid": ["critical"],
-        },
-    },
-    # ── EXPERT ────────────────────────────────
-    "expert": {
-        "id": "task_expert_001",
-        "difficulty": "expert",
-        "language": "java",
-        "description": (
-            "This Java class implements a token bucket rate limiter. "
-            "Identify the logic bug that could allow users to bypass the rate limit."
-        ),
-        "code": (
-            "import java.util.concurrent.atomic.AtomicLong;\n\n"
-            "public class TokenBucketRateLimiter {\n"
-            "    private final long maxTokens;\n"
-            "    private final long refillRatePerSecond;\n"
-            "    private AtomicLong currentTokens;\n"
-            "    private AtomicLong lastRefillTimestamp;\n\n"
-            "    public TokenBucketRateLimiter(long maxTokens, long refillRatePerSecond) {\n"
-            "        this.maxTokens = maxTokens;\n"
-            "        this.refillRatePerSecond = refillRatePerSecond;\n"
-            "        this.currentTokens = new AtomicLong(maxTokens);\n"
-            "        this.lastRefillTimestamp = new AtomicLong(System.currentTimeMillis());\n"
-            "    }\n\n"
-            "    /**\n"
-            "     * Checks if the requested number of tokens is available.\n"
-            "     * Decrements the bucket if allowed.\n"
-            "     */\n"
-            "    public synchronized boolean allowRequest(int tokensNeeded) {\n"
-            "        refill();\n"
-            "        if (currentTokens.get() >= tokensNeeded) {\n"
-            "            currentTokens.addAndGet(-tokensNeeded);\n"
-            "            return true;\n"
-            "        }\n"
-            "        return false;\n"
-            "    }\n\n"
-            "    private void refill() {\n"
-            "        long now = System.currentTimeMillis();\n"
-            "        long timeElapsedMs = now - lastRefillTimestamp.get();\n"
-            "        \n"
-            "        // Calculate how many tokens to add based on time elapsed\n"
-            "        long tokensToAdd = (timeElapsedMs / 1000) * refillRatePerSecond;\n\n"
-            "        if (tokensToAdd > 0) {\n"
-            "            // Hint: Look closely at how the tokens are updated here.\n"
-            "            // Consider what happens if a user stops making requests for a long time.\n"
-            "            currentTokens.addAndGet(tokensToAdd);\n"
-            "            lastRefillTimestamp.set(now);\n"
-            "        }\n"
-            "    }\n"
-            "}"
-        ),
-        "ground_truth": {
-            "bug_identified": True,
-            "bug_type_keywords": [
-                "logic", "limit", "overflow", "cap", "bound", "maximum", "exceed",
-                "logic error", "capacity",
-            ],
-            "location_keywords": [
-                "currentTokens.addAndGet", "refill()", "tokensToAdd",
-                "currentTokens.get()", "addAndGet(tokensToAdd)",
-            ],
-            "description_keywords": [
-                "exceed", "maxTokens", "cap", "limit", "bound",
-                "overflow", "infinite", "burst", "accumulate",
-            ],
-            "fix_keywords": [
-                "Math.min", "min(", "set(", "if (currentTokens.get() > maxTokens)",
-                "compareAndSet", "cap",
-            ],
-            "severity_valid": ["high", "medium"],
-        },
-    },
-    # ── EXPERT 2 (C++) ────────────────────────
-    "expert2": {
-        "id": "task_expert_002",
-        "difficulty": "expert2",
-        "language": "cpp",
-        "description": (
-            "This C++ class implements an event dispatcher. "
-            "Identify the concurrency bug that can occur when an event is dispatched."
-        ),
-        "code": (
-            "#include <iostream>\n"
-            "#include <vector>\n"
-            "#include <functional>\n"
-            "#include <mutex>\n"
-            "#include <algorithm>\n"
-            "#include <string>\n\n"
-            "class EventDispatcher {\n"
-            "public:\n"
-            "    using Callback = std::function<void(const std::string&)>;\n\n"
-            "    void subscribe(int listener_id, Callback cb) {\n"
-            "        std::lock_guard<std::mutex> lock(mut_);\n"
-            "        listeners_.push_back({listener_id, cb});\n"
-            "    }\n\n"
-            "    void unsubscribe(int listener_id) {\n"
-            "        std::lock_guard<std::mutex> lock(mut_);\n"
-            "        listeners_.erase(\n"
-            "            std::remove_if(listeners_.begin(), listeners_.end(),\n"
-            "                [listener_id](const Listener& l) { return l.id == listener_id; }),\n"
-            "            listeners_.end()\n"
-            "        );\n"
-            "    }\n\n"
-            "    void dispatch(const std::string& event_data) {\n"
-            "        std::lock_guard<std::mutex> lock(mut_);\n"
-            "        for (const auto& listener : listeners_) {\n"
-            "            // Hint: What happens if a listener decides to call unsubscribe() \n"
-            "            // from inside their own callback function when an event fires?\n"
-            "            listener.cb(event_data);\n"
-            "        }\n"
-            "    }\n\n"
-            "private:\n"
-            "    struct Listener {\n"
-            "        int id;\n"
-            "        Callback cb;\n"
-            "    };\n    \n"
-            "    std::vector<Listener> listeners_;\n"
-            "    std::mutex mut_;\n"
-            "};"
-        ),
-        "ground_truth": {
-            "bug_identified": True,
-            "bug_type_keywords": [
-                "deadlock", "concurrency", "lock", "recursive", "reentrant", "hang",
-                "iterator validation", "undefined behavior"
-            ],
-            "location_keywords": [
-                "listener.cb", "unsubscribe", "dispatch", "mut_", "std::lock_guard",
-                "lock(mut_)"
-            ],
-            "description_keywords": [
-                "deadlock", "already locked", "same thread", "recursive_mutex",
-                "reentrant", "hangs", "blocks", "invalidate", "iterator"
-            ],
-            "fix_keywords": [
-                "std::recursive_mutex", "copy", "local copy", "copy the vector",
-                "unlock before", "queue", "deferred"
-            ],
-            "severity_valid": ["high", "critical"],
-        },
-    },
-}
-# GRADER
-def grade_action(action: CodeReviewAction, task: dict) -> Tuple[float, Dict]:
-    """
-    Score the agent's review on a 0.0–1.0 scale.
-    Breakdown:
-      bug_identified   0.20
-      bug_type         0.20
-      bug_location     0.10
-      bug_description  0.25  (keyword density, capped)
-      suggested_fix    0.15  (keyword density, capped)
-      severity         0.10
-      ─────────────────────
-      Total            1.00
-    """
-    gt = task["ground_truth"]
-    score = 0.0
-    breakdown: Dict[str, float] = {}
-    # 1. Bug identification
-    if action.bug_identified == gt["bug_identified"]:
-        score += 0.20
-        breakdown["bug_identified"] = 0.20
-    else:
-        breakdown["bug_identified"] = 0.00
-        if not action.bug_identified:
-            return 0.0, {
-                "breakdown": breakdown,
-                "total_score": 0.0,
-                "feedback": "No bug identified — one definitely exists. Look more carefully.",
-            }
-    # 2. Bug type
-    bug_type_lower = action.bug_type.lower()
-    type_match = any(kw in bug_type_lower for kw in gt["bug_type_keywords"])
-    if type_match:
-        score += 0.20
-        breakdown["bug_type"] = 0.20
-    else:
-        breakdown["bug_type"] = 0.00
-    # 3. Bug location
-    loc_lower = action.bug_location.lower()
-    loc_match = any(kw.lower() in loc_lower for kw in gt["location_keywords"])
-    if loc_match:
-        score += 0.10
-        breakdown["bug_location"] = 0.10
-    else:
-        breakdown["bug_location"] = 0.00
-    # 4. Description quality (keyword density, capped at 0.25)
-    desc_lower = action.bug_description.lower()
-    desc_hits = sum(1 for kw in gt["description_keywords"] if kw.lower() in desc_lower)
-    desc_score = round(min(0.25, desc_hits * 0.07), 3)
-    score += desc_score
-    breakdown["bug_description"] = desc_score
-    # 5. Fix quality (keyword density, capped at 0.15)
-    fix_lower = action.suggested_fix.lower()
-    fix_hits = sum(1 for kw in gt["fix_keywords"] if kw.lower() in fix_lower)
-    fix_score = round(min(0.15, fix_hits * 0.08), 3)
-    score += fix_score
-    breakdown["suggested_fix"] = fix_score
-    # 6. Severity
-    if action.severity.lower() in gt["severity_valid"]:
-        score += 0.10
-        breakdown["severity"] = 0.10
-    else:
-        breakdown["severity"] = 0.00
-    total = round(min(1.0, score), 3)
-    # Build human-readable feedback
-    hints = []
-    if breakdown["bug_type"] == 0:
-        hints.append("Reconsider the bug category — be more specific.")
-    if breakdown["bug_location"] == 0:
-        hints.append("Pinpoint the exact line or expression that contains the bug.")
-    if breakdown["suggested_fix"] < 0.08:
-        hints.append("Your fix does not address the root cause — revise it.")
-    if breakdown["severity"] == 0:
-        hints.append("Re-evaluate the severity level.")
-    feedback = " ".join(hints) if hints else "Strong analysis — refine the fix if needed."
-    return total, {"breakdown": breakdown, "total_score": total, "feedback": feedback}
-# ENVIRONMENT
-class CodeReviewEnvironment:
     def __init__(self):
-        self._state: Optional[CodeReviewState] = None
-        self._current_task: Optional[dict] = None
-    def reset(self, difficulty: str = "easy") -> CodeReviewObservation:
-        if difficulty not in TASKS:
-            difficulty = "easy"
-        task = TASKS[difficulty]
-        self._current_task = task
-        self._state = CodeReviewState(
-            task_id=task["id"],
-            difficulty=difficulty,
-            step_count=0,
-            done=False,
-            total_reward=0.0,
-            task_complete=False,
         )
-        return self._build_obs(step_number=0, previous_feedback=None)
-    def step(self, action: CodeReviewAction) -> Tuple[CodeReviewObservation, float, bool, Dict]:
-        if self._state is None or self._state.done:
-            raise ValueError("Call reset() before step().")
-        self._state.step_count += 1
-        reward, info = grade_action(action, self._current_task)
-        self._state.total_reward = round(self._state.total_reward + reward, 3)
-        # Done if agent nailed it or max steps reached
-        done = reward >= 0.80 or self._state.step_count >= MAX_STEPS
-        self._state.done = done
-        self._state.task_complete = reward >= 0.80
-        feedback = info.get("feedback") if not done else None
-        obs = self._build_obs(
-            step_number=self._state.step_count,
-            previous_feedback=feedback,
         )
-        return obs, reward, done, info
-    def state(self) -> CodeReviewState:
-        if self._state is None:
-            return CodeReviewState(
-                task_id="", difficulty="easy",
-                step_count=0, done=False,
-                total_reward=0.0, task_complete=False,
-            )
-        return self._state
-    # helpers
-    def _build_obs(self, step_number: int, previous_feedback: Optional[str]) -> CodeReviewObservation:
-        t = self._current_task
-        return CodeReviewObservation(
-            code_snippet=t["code"],
-            language=t["language"],
-            task_description=t["description"],
             task_id=t["id"],
             difficulty=t["difficulty"],
-            step_number=step_number,
-            max_steps=MAX_STEPS,
-            previous_feedback=previous_feedback,
         )

+import random
+from typing import Optional, Dict, Tuple
+from server.tasks import TASKS
+from server.grader import grade_action
+from server.models import CodeObservation, StepResult, StateResponse, Action, Observation
+class CodeSecurityEnv:
     def __init__(self):
+        self.current_task: Optional[dict] = None
+        self.step_count: int = 0
+        self.done: bool = False
+        self.total_reward: float = 0.0
+        self._task_ids = list(TASKS.keys())
+    def reset(self, task_id: Optional[str] = None, seed: Optional[int] = None) -> Observation:
+        if seed is not None:
+            random.seed(seed)
+        if task_id and task_id in TASKS:
+            self.current_task = TASKS[task_id]
+        else:
+            # Pick a task by its ID
+            chosen_id = random.choice(self._task_ids)
+            self.current_task = TASKS[chosen_id]
+        self.step_count = 0
+        self.done = False
+        self.total_reward = 0.0
+        return self._make_observation()
+    def step(self, action: Action) -> StepResult:
+        if self.current_task is None:
+            # Auto-reset if called before reset()
+            self.reset()
+        if self.done:
+            return StepResult(
+                observation=self._make_observation(),
+                reward=0.0,
+                done=True,
+                info={"error": "Episode already completed. Call /reset to start a new episode."},
+            )
+        # The action comes from the API as a Pydantic model (Action)
+        # The grader expects a dict or the model itself.
+        reward, breakdown = grade_action(action, self.current_task)
+        self.step_count += 1
+        self.total_reward += reward
+        self.done = True  # single-step environment — one action per episode
+        return StepResult(
+            observation=self._make_observation(),
+            reward=reward,
+            done=self.done,
+            info={
+                "reward_breakdown": breakdown,
+                "task_name": self.current_task.get("name", "Unknown Task"),
+                "step_count": self.step_count
+            },
         )
+    def state(self) -> StateResponse:
+        current_id = self.current_task["id"] if self.current_task else ""
+        return StateResponse(
+            task_id=current_id,
+            step=self.step_count,
+            done=self.done,
+            total_reward=self.total_reward,
         )
+    def _make_observation(self) -> Observation:
+        t = self.current_task
+        return Observation(
             task_id=t["id"],
+            language=t["language"],
             difficulty=t["difficulty"],
+            code_snippet=t["code_snippet"],
+            context=t["context"],
+            pr_title=t["pr_title"],
+            file_path=t["file_path"],
         )

server/grader.py ADDED Viewed

	@@ -0,0 +1,80 @@

+from typing import Tuple, Dict
+def grade_action(action: dict, task: dict) -> Tuple[float, Dict[str, float]]:
+    reward = 0.0
+    breakdown: Dict[str, float] = {}
+    # ── Component 1: Bug identified (0.20) ──────────────────────────────────
+    if action.get("bug_identified"):
+        reward += 0.20
+        breakdown["bug_identified"] = 0.20
+    else:
+        breakdown["bug_identified"] = 0.00
+        # No bug found → no partial credit for anything else
+        return max(0.0, min(1.0, reward)), breakdown
+    # ── Component 2: Bug type match (0.20) ──────────────────────────────────
+    action_type = action.get("bug_type", "").lower().replace("-", " ").replace("_", " ")
+    task_type = task["bug_type"].lower().replace("-", " ").replace("_", " ")
+    if task_type in action_type or action_type in task_type:
+        reward += 0.20
+        breakdown["bug_type"] = 0.20
+    else:
+        breakdown["bug_type"] = 0.00
+    # ── Component 3: Bug location (0.10) ────────────────────────────────────
+    action_location = action.get("bug_location", "").lower()
+    location_keywords = [w for w in task["bug_location"].lower().split() if len(w) > 3]
+    if location_keywords:
+        matched = sum(1 for kw in location_keywords if kw in action_location)
+        loc_score = round(0.10 * (matched / len(location_keywords)), 4)
+    else:
+        loc_score = 0.0
+    reward += loc_score
+    breakdown["bug_location"] = loc_score
+    # ── Component 4: Description quality (0.25) ──────────────────────────────
+    description = action.get("bug_description", "").lower()
+    desc_score = 0.0
+    if len(description) >= 20:
+        task_keywords = task["keywords"]
+        matched_kw = [kw for kw in task_keywords if kw in description]
+        desc_score = round(min(0.25, 0.25 * (len(matched_kw) / max(len(task_keywords), 1))), 4)
+    breakdown["description_quality"] = desc_score
+    reward += desc_score
+    # ── Component 5: Fix quality (0.15) ──────────────────────────────────────
+    fix = action.get("suggested_fix", "").lower()
+    fix_score = 0.0
+    if len(fix) >= 10:
+        fix_patterns = task["fix_patterns"]
+        matched_fix = [p for p in fix_patterns if p.lower() in fix]
+        fix_score = round(min(0.15, 0.15 * (len(matched_fix) / max(len(fix_patterns), 1)) * 2), 4)
+    breakdown["fix_quality"] = fix_score
+    reward += fix_score
+    # ── Component 6: Severity (0.10) ─────────────────────────────────────────
+    action_sev = action.get("severity", "").lower()
+    task_sev = task["severity"].lower()
+    if action_sev == task_sev:
+        sev_score = 0.10
+    elif action_sev in ("high", "critical") and task_sev in ("high", "critical"):
+        sev_score = 0.05
+    else:
+        sev_score = 0.00
+    breakdown["severity"] = sev_score
+    reward += sev_score
+    # ── Global Penalty: Keyword Stuffing ────────────────────────────────────
+    description = action.get("bug_description", "").lower()
+    words = description.split()
+    unique_ratio = len(set(words)) / len(words) if words else 1.0
+    if unique_ratio < 0.7:
+        reward *= 0.2  # Heavy global penalty
+        breakdown["stuffing_penalty_multiplier"] = 0.2
+        for k in list(breakdown.keys()):
+            if k != "stuffing_penalty_multiplier":
+                breakdown[k] = round(breakdown[k] * 0.2, 4)
+    return max(0.0, min(1.0, round(reward, 4))), breakdown

server/models.py CHANGED Viewed

@@ -2,44 +2,63 @@ from pydantic import BaseModel, Field
 from typing import Optional, Any, Dict
 class CodeReviewAction(BaseModel):
     """Action taken by the agent: a structured code review."""
     bug_identified: bool = Field(..., description="Whether a bug was found")
     bug_location: str = Field(..., description="Location of the bug (function, line, variable)")
-    bug_type: str = Field(..., description="Type: off-by-one | logic-error | security-vulnerability | null-dereference | none")
     bug_description: str = Field(..., description="Detailed explanation of why this is a bug")
     severity: str = Field(..., description="Severity: none | low | medium | high | critical")
     suggested_fix: str = Field(..., description="The corrected code or a description of how to fix it")
-class CodeReviewObservation(BaseModel):
     """What the agent sees at each step."""
-    code_snippet: str = Field(..., description="The code to review")
-    language: str = Field(..., description="Programming language")
-    task_description: str = Field(..., description="What the code is supposed to do")
     task_id: str = Field(..., description="Unique task identifier")
     difficulty: str = Field(..., description="Level: easy | medium | hard")
-    step_number: int = Field(..., description="Current step number within this episode")
-    max_steps: int = Field(..., description="Maximum steps allowed per episode")
-    previous_feedback: Optional[str] = Field(None, description="Feedback from previous step if any")
-class CodeReviewState(BaseModel):
-    """Internal environment state."""
-    task_id: str
-    difficulty: str
-    step_count: int
-    done: bool
-    total_reward: float
-    task_complete: bool
-class StepResponse(BaseModel):
-    observation: CodeReviewObservation
     reward: float
     done: bool
     info: Dict[str, Any]
 class ResetResponse(BaseModel):
-    observation: CodeReviewObservation

 from typing import Optional, Any, Dict
+# ── Agent Action ──────────────────────────────────────────────────────────────
 class CodeReviewAction(BaseModel):
     """Action taken by the agent: a structured code review."""
     bug_identified: bool = Field(..., description="Whether a bug was found")
     bug_location: str = Field(..., description="Location of the bug (function, line, variable)")
+    bug_type: str = Field(..., description="Type: off-by-one | logic-error | security-vulnerability | none")
     bug_description: str = Field(..., description="Detailed explanation of why this is a bug")
     severity: str = Field(..., description="Severity: none | low | medium | high | critical")
     suggested_fix: str = Field(..., description="The corrected code or a description of how to fix it")
+# ── Observation ───────────────────────────────────────────────────────────────
+class CodeObservation(BaseModel):
     """What the agent sees at each step."""
     task_id: str = Field(..., description="Unique task identifier")
+    language: str = Field(..., description="Programming language")
     difficulty: str = Field(..., description="Level: easy | medium | hard")
+    code_snippet: str = Field(..., description="The code to review")
+    context: str = Field(..., description="Production context describing what the code does")
+    pr_title: str = Field(..., description="Pull request title submitted by developer")
+    file_path: str = Field(..., description="File path of the code in the repository")
+# ── Step Result ───────────────────────────────────────────────────────────────
+class StepResult(BaseModel):
+    """Result returned from env.step()."""
+    observation: Optional[CodeObservation] = None
     reward: float
     done: bool
     info: Dict[str, Any]
+# ── State ─────────────────────────────────────────────────────────────────────
+class StateResponse(BaseModel):
+    """Internal environment state exposed via /state."""
+    task_id: str
+    step: int
+    done: bool
+    total_reward: float
+# ── API Helpers ───────────────────────────────────────────────────────────────
 class ResetResponse(BaseModel):
+    observation: CodeObservation
+class TaskInfo(BaseModel):
+    id: str
+    language: str
+    bug_class: str
+    difficulty: str
+Action = CodeReviewAction
+Observation = CodeObservation
+Reward = float

server/tasks.py ADDED Viewed

	@@ -0,0 +1,110 @@

+TASKS = {
+    "python-off-by-one": {
+        "id": "python-off-by-one",
+        "name": "Python Off-by-One Error",
+        "language": "Python",
+        "difficulty": "easy",
+        "bug_class": "Off-by-one index error",
+        "pr_title": "Add batch processor for financial transactions",
+        "file_path": "finance/batch_processor.py",
+        "context": "Finance batch processor that sums transaction amounts for end-of-day reconciliation",
+        "code_snippet": (
+            "def process_transactions(transactions):\n"
+            "    total = 0\n"
+            "    for i in range(len(transactions) + 1):  # iterates one past end\n"
+            "        total += transactions[i][\"amount\"]\n"
+            "    return total"
+        ),
+        "bug_type": "off-by-one",
+        "bug_location": "line 3 — range(len(transactions) + 1)",
+        "severity": "critical",
+        "keywords": [
+            "off-by-one", "index", "range", "indexerror", "out of bounds",
+            "boundary", "overflow", "iteration", "list length", "plus one",
+            "extra step", "fencepost error", "array access", "iterator",
+            "fix", "bug", "identify", "code", "crash", "out-of-range",
+            "python", "finance", "batch", "amount", "total", "transactions",
+            "iterate", "sum", "loop", "account", "process"
+        ],
+        "fix_patterns": [
+            "range(len(transactions))",
+            "len(transactions))",
+            "for transaction in transactions",
+            "in transactions:",
+            "pop()",
+            "enumerate(transactions)",
+            "transactions[:len(transactions)]",
+            "total += transactions[i]"
+        ],
+    },
+    "js-auth-privilege": {
+        "id": "js-auth-privilege",
+        "name": "JavaScript Auth Logic Flaw",
+        "language": "JavaScript",
+        "difficulty": "medium",
+        "bug_class": "Logic flaw — privilege escalation",
+        "pr_title": "Refactor auth middleware for API routes",
+        "file_path": "middleware/auth.js",
+        "context": "Node.js authentication middleware that restricts admin-only API routes",
+        "code_snippet": (
+            "function checkAdmin(req, res, next) {\n"
+            "    const user = req.user;\n"
+            "    if (user.role !== \"admin\" || user.isActive) {\n"
+            "        return next();\n"
+            "    }\n"
+            "    return res.status(403).json({ error: \"Forbidden\" });\n"
+            "}"
+        ),
+        "bug_type": "logic-error",
+        "bug_location": "line 3 — incorrect boolean operator || instead of &&",
+        "severity": "critical",
+        "keywords": [
+            "short-circuit disjunction hazard", "logical disjunction vulnerability",
+            "excessive authorization scope", "privilege escalation vector",
+            "boolean logic flaw pattern", "operator precedence violation",
+            "authorization bypass disjunction logic", "improper validation layer check",
+            "role check disjunction pattern match", "permission leak evaluation flow",
+            "evaluation shortcut logic flaw", "middleware logic hazard state",
+            "security constraint bypass", "access control logic inversion"
+        ],
+        "fix_patterns": [
+            "user.role === \"admin\" && user.isActive",
+            "&& user.isActive",
+            "throw new Error(\"Unauthorized\")",
+            "user.role === 'admin' && user.isActive",
+            "middleware logic fix"
+        ],
+    },
+    "python-sql-injection": {
+        "id": "python-sql-injection",
+        "name": "Python SQL Injection",
+        "language": "Python",
+        "difficulty": "hard",
+        "bug_class": "SQL injection via f-string",
+        "pr_title": "Add user search endpoint to REST API",
+        "file_path": "api/users.py",
+        "context": "REST API endpoint that searches users by name in a PostgreSQL database",
+        "code_snippet": (
+            "def search_users(db, search_term):\n"
+            "    query = f\"SELECT * FROM users WHERE name LIKE '%{search_term}%'\"\n"
+            "    results = db.execute(query)\n"
+            "    return results.fetchall()"
+        ),
+        "bug_type": "security-vulnerability",
+        "bug_location": "line 2 — f-string interpolation directly in SQL query",
+        "severity": "critical",
+        "keywords": [
+            "sql injection", "user-supplied", "search_term", "interpolated", "f-string",
+            "attacker", "bypass", "authentication", "exfiltrate", "user data",
+            "drop tables", "parameterized", "queries", "sanitize", "input", "automatically"
+        ],
+        "fix_patterns": [
+            "db.execute('SELECT * FROM users WHERE name LIKE %s', ('%'+search_term+'%',))",
+            "%s",
+            "parameterized",
+            "prepared statement"
+        ],
+    },
+}

validate.sh ADDED Viewed

	@@ -0,0 +1,103 @@

+#!/bin/bash
+# OpenEnv Submission Validation Script
+set -e
+echo "═══════════════════════════════════════"
+echo "  OpenEnv Pre-Submission Validation"
+echo "═══════════════════════════════════════"
+echo ""
+# 1. Check for required root files
+echo "── 1. Required Files ──"
+FILES=("openenv.yaml" "inference.py" "README.md" "Dockerfile" "requirements.txt")
+for file in "${FILES[@]}"; do
+    if [ -f "$file" ]; then
+        echo "  ✅ $file"
+    else
+        echo "  ❌ Missing $file"
+        exit 1
+    fi
+done
+echo ""
+# 2. Check server/ module structure
+echo "── 2. Server Module Structure ──"
+SERVER_FILES=("server/__init__.py" "server/app.py" "server/models.py" "server/environment.py" "server/tasks.py" "server/grader.py")
+for file in "${SERVER_FILES[@]}"; do
+    if [ -f "$file" ]; then
+        echo "  ✅ $file"
+    else
+        echo "  ❌ Missing $file"
+        exit 1
+    fi
+done
+echo ""
+# 3. Activate venv & validate Python imports
+echo "── 3. Python Import Validation ──"
+source venv/bin/activate
+python3 -c "
+from server.tasks import TASKS
+from server.grader import grade_action
+from server.environment import CodeSecurityEnv
+from server.models import CodeReviewAction, CodeObservation, StepResult, StateResponse, ResetResponse, TaskInfo
+assert len(TASKS) >= 3, f'Expected 3+ tasks, got {len(TASKS)}'
+print('  ✅ All imports resolve correctly')
+print(f'     Tasks: {list(TASKS.keys())}')
+" || { echo "  ❌ Python import validation failed"; exit 1; }
+echo ""
+# 4. Quick grader smoke test
+echo "── 4. Grader Smoke Test ──"
+python3 -c "
+from server.environment import CodeSecurityEnv
+from server.models import Action
+env = CodeSecurityEnv()
+obs = env.reset('python-off-by-one')
+result = env.step(Action(**{
+    'bug_identified': True,
+    'bug_location': 'range(len(transactions) + 1)',
+    'bug_type': 'logic-error',
+    'bug_description': 'Off-by-one index error — the range goes one past the end causing an out of bounds IndexError',
+    'severity': 'medium',
+    'suggested_fix': 'Use range(len(transactions)) to fix the boundary',
+}))
+assert 0.0 <= result.reward <= 1.0, f'Reward out of range: {result.reward}'
+assert result.done is True
+print(f'  ✅ Grader returned reward={result.reward:.4f}, done={result.done}')
+# Verify zero-reward path
+env2 = CodeSecurityEnv()
+env2.reset('python-off-by-one')
+r2 = env2.step(Action(**{
+    'bug_identified': False,
+    'bug_location': '',
+    'bug_type': 'none',
+    'bug_description': 'No bug found',
+    'severity': 'none',
+    'suggested_fix': '',
+}))
+assert r2.reward == 0.0, f'Expected 0.0 for no-bug, got {r2.reward}'
+print(f'  ✅ No-bug path returns reward=0.0')
+" || { echo "  ❌ Grader smoke test failed"; exit 1; }
+echo ""
+# 5. Validate openenv.yaml
+echo "── 5. openenv.yaml Validation ──"
+python3 -c "
+import yaml
+with open('openenv.yaml', 'r') as f:
+    data = yaml.safe_load(f)
+assert 'name' in data, 'Missing name field'
+assert 'tasks' in data, 'Missing tasks field'
+assert len(data['tasks']) >= 3, f'Need 3+ tasks, got {len(data[\"tasks\"])}'
+print(f'  ✅ Valid YAML with {len(data[\"tasks\"])} tasks')
+" || { echo "  ❌ openenv.yaml validation failed"; exit 1; }
+echo ""
+echo "═══════════════════════════════════════"
+echo "  ✅ All checks passed!"
+echo "═══════════════════════════════════════"