Spaces:

mahithakur
/

PRobe

Runtime error

Thakur, Mahipal commited on Apr 23

Commit

ab287c4

1 Parent(s): 62f5d41

feat: add dynamic world modeling — mutation engine, GET_CONTEXT action, causal chain task

server/mutator.py: variable rename + line shift + constant variance per episode

server/tasks.py: Task 6 causal chain with progressive context unlock

server/CodeReviewAgent_environment.py: wire mutation, GET_CONTEXT, unlock logic

models.py: add GET_CONTEXT action type + context_hints observation field

tests/test_dynamic_world.py: 26 tests covering all new features

refactor: rename project from CodeReviewAgent to PRobe

All class names: ProbeAction, ProbeObservation, ProbeEnv, ProbeEnvironment

pyproject.toml, openenv.yaml, README.md, __init__.py fully updated

50/50 tests passing

Files changed (34) hide show

README.md +98 -34
__init__.py +6 -6
__pycache__/__init__.cpython-314.pyc +0 -0
__pycache__/client.cpython-314.pyc +0 -0
__pycache__/models.cpython-314.pyc +0 -0
client.py +31 -44
models.py +72 -35
openenv.yaml +35 -14
openenv_CodeReviewAgent.egg-info/SOURCES.txt +5 -1
openenv_PRobe.egg-info/PKG-INFO +11 -0
openenv_PRobe.egg-info/SOURCES.txt +19 -0
openenv_PRobe.egg-info/dependency_links.txt +1 -0
openenv_PRobe.egg-info/entry_points.txt +2 -0
openenv_PRobe.egg-info/requires.txt +7 -0
openenv_PRobe.egg-info/top_level.txt +1 -0
pyproject.toml +11 -6
server/CodeReviewAgent_environment.py +127 -32
server/__init__.py +3 -3
server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc +0 -0
server/__pycache__/__init__.cpython-314.pyc +0 -0
server/__pycache__/grader.cpython-314.pyc +0 -0
server/__pycache__/mutator.cpython-314.pyc +0 -0
server/__pycache__/tasks.cpython-314.pyc +0 -0
server/app.py +28 -22
server/grader.py +66 -39
server/mutator.py +123 -0
server/tasks.py +224 -0
tests/__init__.py +0 -0
tests/__pycache__/__init__.cpython-314.pyc +0 -0
tests/__pycache__/test_dynamic_world.cpython-314-pytest-9.0.3.pyc +0 -0
tests/__pycache__/test_grader.cpython-314-pytest-9.0.3.pyc +0 -0
tests/test_dynamic_world.py +344 -0
tests/test_grader.py +397 -0
uv.lock +39 -26

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: CodeReviewAgent Environment
 emoji: 🔍
 colorFrom: blue
 colorTo: green
@@ -12,13 +12,22 @@ tags:
   - code-review
   - rl-training
   - grpo
 ---
-# CodeReviewAgent — OpenEnv Environment
 > **OpenEnv Hackathon 2026 · Theme #3.1 — World Modeling (Professional Tasks)**
-An RL training environment where an LLM learns to perform structured **pull-request code reviews** on real Python source files.  The agent must identify bugs, security vulnerabilities, performance bottlenecks, and design issues — and submit a structured review with line-level comments.
 ---
@@ -31,15 +40,17 @@ This environment provides a **reward signal** that directly measures review qual
 ## Environment Design
-### Tasks (5 total)
 | ID | Difficulty | File | Issues | Domain |
 |----|-----------|------|--------|--------|
-| 0  | Easy      | `utils.py` | 3 | Logic bugs, off-by-one, dead code |
-| 1  | Medium    | `auth.py` | 5 | SQL injection, MD5, eval(), hardcoded creds |
-| 2  | Hard      | `data_pipeline.py` | 7 | N+1, SSL bypass, thread leak, OOM cache |
-| 3  | Medium    | `async_worker.py` | 5 | Race condition, missing await, resource leak |
-| 4  | Hard      | `api_server.py` | 6 | Command injection, path traversal, pickle RCE |
 Tasks cycle automatically on each `reset()` call.
@@ -47,18 +58,19 @@ Tasks cycle automatically on each `reset()` call.
 ```python
 {
-  "code_snippet":     str,   # Python source to review
-  "task_description": str,   # What to look for
-  "file_name":        str,
-  "task_id":          int,   # 0–4
-  "task_difficulty":  str,   # easy / medium / hard
-  "review_history":   list,  # actions taken so far this episode
-  "step_count":       int,
-  "max_steps":        int,
   "issues_found_count": int,
-  "total_issues":     int,
-  "done":             bool,
-  "reward":           float,
 }
 ```
@@ -66,7 +78,8 @@ Tasks cycle automatically on each `reset()` call.
 | action_type | Required fields | Effect |
 |-------------|----------------|--------|
-| `add_comment` | `line_number`, `comment`, `severity`, `category` | Annotate a line; partial reward if it matches a ground-truth issue |
 | `request_changes` | `comment` | Signal PR needs work |
 | `approve` | — | Approve PR (penalised if issues remain) |
 | `submit_review` | — | Finalise review; terminal reward |
@@ -86,7 +99,54 @@ Terminal (SUBMIT_REVIEW):
 Maximum achievable: ~1.0
 ```
-Grading uses **keyword + line-range matching** (±3 lines tolerance) against hand-labelled ground-truth issues — no LLM judge needed, fully deterministic.
 ---
@@ -141,11 +201,11 @@ All install, training, evaluation, and plotting cells are included.
 *(Fill in after training run)*
-| Model | Avg Reward | Task-0 | Task-1 | Task-2 | Task-3 | Task-4 |
-|-------|-----------|--------|--------|--------|--------|--------|
-| GPT-4o-mini (baseline) | — | — | — | — | — | — |
-| Qwen2.5-1.5B (untrained) | — | — | — | — | — | — |
-| Qwen2.5-1.5B (GRPO 3 epochs) | — | — | — | — | — | — |
 Training curves: `training_curves.png` · Per-task rewards: `per_task_reward.png`
@@ -154,17 +214,21 @@ Training curves: `training_curves.png` · Per-task rewards: `per_task_reward.png
 ## Project Structure
 ```
-CodeReviewAgent/
 ├── openenv.yaml                    # OpenEnv manifest
 ├── pyproject.toml
 ├── models.py                       # Action + Observation types
 ├── client.py                       # OpenEnv client
-└── server/
-    ├── app.py                      # FastAPI server
-    ├── CodeReviewAgent_environment.py
-    ├── grader.py                   # Deterministic reward grader
-    ├── tasks.py                    # 5 ground-truth tasks
-    └── Dockerfile
 train_grpo.py                       # GRPO training script
 train_grpo_colab.ipynb              # Colab notebook
 baseline.py                         # GPT-4o-mini baseline

 ---
+title: PRobe Environment
 emoji: 🔍
 colorFrom: blue
 colorTo: green
   - code-review
   - rl-training
   - grpo
+  - world-modeling
+  - probe
 ---
+# PRobe — Pull Request Investigation Environment
 > **OpenEnv Hackathon 2026 · Theme #3.1 — World Modeling (Professional Tasks)**
+> *An RL environment where agents learn to investigate code like a security researcher, not scan it like a linter.*
+PRobe is an RL training environment where an LLM learns to perform structured **pull-request code reviews** on real Python source files. The agent must identify bugs, security vulnerabilities, performance bottlenecks, and design issues — and submit a structured review with line-level comments.
+The name has three meanings that map directly to the environment's design:
+- **PR** — the domain: pull-request review
+- **Probe** — the `get_context` action where the agent literally probes lines for deeper context
+- **World Modeling** — an agent that *investigates* a partially observable system, updating its beliefs as new evidence is revealed
 ---
 ## Environment Design
+### Tasks (7 total)
 | ID | Difficulty | File | Issues | Domain |
 |----|-----------|------|--------|--------|
+| 0  | Ultra-easy | `bootstrap.py` | 2 | Off-by-one, hardcoded credential (hinted in comments) |
+| 1  | Easy       | `utils.py` | 3 | Logic bugs, off-by-one, dead code |
+| 2  | Medium     | `auth.py` | 5 | SQL injection, MD5, eval(), hardcoded creds |
+| 3  | Hard       | `data_pipeline.py` | 7 | N+1, SSL bypass, thread leak, OOM cache |
+| 4  | Medium     | `async_worker.py` | 5 | Race condition, missing await, resource leak |
+| 5  | Hard       | `api_server.py` | 6 | Command injection, path traversal, pickle RCE |
+| 6  | Hard       | `auth_service.py` | 6 | **Causal chain** — JWT forgery → privilege escalation |
 Tasks cycle automatically on each `reset()` call.
 ```python
 {
+  "code_snippet":      str,    # Python source to review (mutated each episode)
+  "task_description":  str,    # What to look for
+  "file_name":         str,
+  "task_id":           int,    # 0–6
+  "task_difficulty":   str,    # ultra-easy / easy / medium / hard
+  "review_history":    list,   # actions taken so far this episode
+  "step_count":        int,
+  "max_steps":         int,
   "issues_found_count": int,
+  "total_issues":      int,
+  "context_hints":     list,   # causal hints unlocked so far (Task 6)
+  "done":              bool,
+  "reward":            float,
 }
 ```
 | action_type | Required fields | Effect |
 |-------------|----------------|--------|
+| `add_comment` | `line_number`, `comment`, `severity`, `category` | Annotate a line; reward if it matches a ground-truth issue |
+| `get_context` | `line_number` | Reveal ±5 lines of context around a line (free near issues, −0.01 elsewhere) |
 | `request_changes` | `comment` | Signal PR needs work |
 | `approve` | — | Approve PR (penalised if issues remain) |
 | `submit_review` | — | Finalise review; terminal reward |
 Maximum achievable: ~1.0
 ```
+Grading uses **keyword + line-range matching** (±2 lines tolerance) against hand-labelled ground-truth issues — no LLM judge needed, fully deterministic.
+---
+## Dynamic World Features (v3)
+### Code Mutation
+Every `reset()` applies three surface-level mutations so the agent must *read* code each episode rather than memorise tokens:
+| Mutation | Effect |
+|---|---|
+| Variable rename | One identifier swapped for a synonym (e.g. `total` → `acc`) |
+| Line shift | One blank line inserted above the first issue, shifting all `line_range` values by +1 |
+| Constant variance | One numeric literal nudged ±1 (e.g. `range(1000)` → `range(999)`) |
+Mutations are fully **deterministic** given the episode seed — reproducible but always fresh.
+### GET_CONTEXT Action
+The agent can spend a step probing any line to receive ±5 lines of surrounding context:
+```python
+action = ProbeAction(
+    action_type="get_context",
+    line_number=37,
+)
+# Observation will contain a context snippet around line 37
+# Cost: -0.01 if line is far from any real issue, 0.00 if near one
+```
+### Causal Unlock Chain (Task 6)
+Task 6 implements a **progressive world model**: finding certain issues unlocks new context hints that reveal deeper parts of the system:
+```
+Find hardcoded JWT secret
+        │
+        ▼
+  DB schema revealed ──► agent sees plaintext passwords + role table
+        │
+        ▼
+  Can now reason: leaked secret → forge admin token → privilege escalation
+Find missing rate-limit
+        │
+        ▼
+  nginx config revealed ──► confirms /auth fully exposed, no IP filtering
+```
+This rewards genuine *causal reasoning* — the agent must update its world model as new evidence arrives.
 ---
 *(Fill in after training run)*
+| Model | Avg Reward | Task-0 | Task-1 | Task-2 | Task-3 | Task-4 | Task-5 | Task-6 |
+|-------|-----------|--------|--------|--------|--------|--------|--------|--------|
+| GPT-4o-mini (baseline) | — | — | — | — | — | — | — | — |
+| Qwen2.5-1.5B (untrained) | — | — | — | — | — | — | — | — |
+| Qwen2.5-1.5B (GRPO 3 epochs) | — | — | — | — | — | — | — | — |
 Training curves: `training_curves.png` · Per-task rewards: `per_task_reward.png`
 ## Project Structure
 ```
+PRobe/
 ├── openenv.yaml                    # OpenEnv manifest
 ├── pyproject.toml
 ├── models.py                       # Action + Observation types
 ├── client.py                       # OpenEnv client
+├── server/
+│   ├── app.py                      # FastAPI server
+│   ├── PRobe_environment.py        # Environment core
+│   ├── grader.py                   # Deterministic reward grader
+│   ├── mutator.py                  # Code mutation engine (dynamic world)
+│   ├── tasks.py                    # 7 ground-truth tasks
+│   └── Dockerfile
+├── tests/
+│   ├── test_grader.py              # 24 grader tests (all 5 RL attacks)
+│   └── test_dynamic_world.py       # 26 dynamic world tests
 train_grpo.py                       # GRPO training script
 train_grpo_colab.ipynb              # Colab notebook
 baseline.py                         # GPT-4o-mini baseline

__init__.py CHANGED Viewed

@@ -4,13 +4,13 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-"""Codereviewagent Environment."""
-from .client import CodereviewagentEnv
-from .models import CodereviewagentAction, CodereviewagentObservation
 __all__ = [
-    "CodereviewagentAction",
-    "CodereviewagentObservation",
-    "CodereviewagentEnv",
 ]

 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
+"""PRobe \u2014 Pull Request Investigation Environment."""
+from .client import ProbeEnv
+from .models import ProbeAction, ProbeObservation
 __all__ = [
+    "ProbeAction",
+    "ProbeObservation",
+    "ProbeEnv",
 ]

__pycache__/__init__.cpython-314.pyc CHANGED Viewed

Binary files a/__pycache__/__init__.cpython-314.pyc and b/__pycache__/__init__.cpython-314.pyc differ

__pycache__/client.cpython-314.pyc CHANGED Viewed

Binary files a/__pycache__/client.cpython-314.pyc and b/__pycache__/client.cpython-314.pyc differ

__pycache__/models.cpython-314.pyc CHANGED Viewed

Binary files a/__pycache__/models.cpython-314.pyc and b/__pycache__/models.cpython-314.pyc differ

client.py CHANGED Viewed

@@ -1,40 +1,39 @@
-"""CodeReviewAgent Environment Client."""
-from typing import Dict
 from openenv.core import EnvClient
 from openenv.core.client_types import StepResult
 from openenv.core.env_server.types import State
-from .models import CodereviewagentAction, CodereviewagentObservation
-class CodereviewagentEnv(
-    EnvClient[CodereviewagentAction, CodereviewagentObservation, State]
-):
     """
-    Client for the CodeReviewAgent environment.
     Maintains a persistent WebSocket connection to the server.
-    Example:
-        >>> with CodereviewagentEnv(base_url="http://localhost:8000") as env:
-        ...     result = env.reset()
-        ...     print(result.observation.task_description)
-        ...
-        ...     action = CodereviewagentAction(
-        ...         action_type="add_comment",
-        ...         line_number=4,
-        ...         comment="Off-by-one: range(len+1) causes IndexError",
-        ...         severity="error",
-        ...         category="bug",
-        ...     )
-        ...     result = env.step(action)
-        ...     print(result.reward)
     """
-    def _step_payload(self, action: CodereviewagentAction) -> Dict:
-        payload = {"action_type": action.action_type.value}
         if action.line_number is not None:
             payload["line_number"] = action.line_number
         if action.comment is not None:
@@ -46,31 +45,19 @@ class CodereviewagentEnv(
         return payload
     def _parse_result(
-        self, payload: Dict
-    ) -> StepResult[CodereviewagentObservation]:
-        obs_data = payload.get("observation", {})
-        observation = CodereviewagentObservation(
-            code_snippet=obs_data.get("code_snippet", ""),
-            task_description=obs_data.get("task_description", ""),
-            file_name=obs_data.get("file_name", ""),
-            task_id=obs_data.get("task_id", 0),
-            task_difficulty=obs_data.get("task_difficulty", "easy"),
-            review_history=obs_data.get("review_history", []),
-            step_count=obs_data.get("step_count", 0),
-            max_steps=obs_data.get("max_steps", 20),
-            issues_found_count=obs_data.get("issues_found_count", 0),
-            total_issues=obs_data.get("total_issues", 0),
-            done=payload.get("done", False),
-            reward=payload.get("reward"),
-            metadata=obs_data.get("metadata", {}),
-        )
         return StepResult(
             observation=observation,
-            reward=payload.get("reward"),
-            done=payload.get("done", False),
         )
-    def _parse_state(self, payload: Dict) -> State:
         return State(
             episode_id=payload.get("episode_id"),
             step_count=payload.get("step_count", 0),

+"""PRobe Environment Client."""
+from __future__ import annotations
 from openenv.core import EnvClient
 from openenv.core.client_types import StepResult
 from openenv.core.env_server.types import State
+from .models import ProbeAction, ProbeObservation
+class ProbeEnv(EnvClient[ProbeAction, ProbeObservation, State]):
     """
+    Client for the PRobe environment.
     Maintains a persistent WebSocket connection to the server.
+    Example::
+        with ProbeEnv(base_url="http://localhost:8000") as env:
+            result = env.reset()
+            print(result.observation.task_description)
+            action = ProbeAction(
+                action_type="add_comment",
+                line_number=4,
+                comment="Off-by-one: range(len+1) causes IndexError",
+                severity="error",
+                category="bug",
+            )
+            result = env.step(action)
+            print(result.reward)
     """
+    def _step_payload(self, action: ProbeAction) -> dict:
+        payload: dict = {"action_type": action.action_type.value}
         if action.line_number is not None:
             payload["line_number"] = action.line_number
         if action.comment is not None:
         return payload
     def _parse_result(
+        self, payload: dict
+    ) -> StepResult[ProbeObservation]:
+        obs_data: dict = payload.get("observation", {})
+        # Use model_validate so new fields added to ProbeObservation
+        # are picked up automatically without changing this method.
+        observation = ProbeObservation.model_validate(obs_data)
         return StepResult(
             observation=observation,
+            reward=float(payload.get("reward") or 0.0),
+            done=bool(payload.get("done", False)),
         )
+    def _parse_state(self, payload: dict) -> State:
         return State(
             episode_id=payload.get("episode_id"),
             step_count=payload.get("step_count", 0),

models.py CHANGED Viewed

@@ -1,10 +1,12 @@
 """
-Data models for the CodeReviewAgent Environment.
 An agent reviews Python source files, identifies bugs, security issues,
 and design problems, then submits a structured review.
 """
 from enum import Enum
 from typing import Any
@@ -13,13 +15,18 @@ from pydantic import BaseModel, ConfigDict, Field
 class ActionType(str, Enum):
     ADD_COMMENT = "add_comment"
     REQUEST_CHANGES = "request_changes"
     APPROVE = "approve"
     SUBMIT_REVIEW = "submit_review"
 class Severity(str, Enum):
     INFO = "info"
     WARNING = "warning"
     ERROR = "error"
@@ -27,6 +34,8 @@ class Severity(str, Enum):
 class IssueCategory(str, Enum):
     BUG = "bug"
     SECURITY = "security"
     PERFORMANCE = "performance"
@@ -36,62 +45,90 @@ class IssueCategory(str, Enum):
 class RewardType(BaseModel):
     """
-    Structured reward returned by step().
-    total       : final clamped score in [-1.0, 1.0]
-    components  : named sub-scores before clamping (may sum outside [-1, 1])
-    passed      : True when the action was a clear positive signal
-    explanation : human-readable breakdown for logging / debugging
-    step        : environment step this reward was issued at
-    terminal    : True only on the SUBMIT_REVIEW step
     """
     model_config = ConfigDict(frozen=True)
     total: float = Field(..., ge=-1.0, le=1.0)
     components: dict[str, float] = Field(default_factory=dict)
-    passed: bool = Field(False)
-    explanation: str = Field("")
-    step: int = Field(0)
-    terminal: bool = Field(False)
-class CodereviewagentAction(Action):
     """
-    - ADD_COMMENT    : annotate a specific line with a review comment
-    - REQUEST_CHANGES: mark the PR as needing changes
-    - APPROVE        : approve the PR (only when no significant issues remain)
-    - SUBMIT_REVIEW  : finalize and submit the review (ends the episode)
     """
     action_type: ActionType = Field(..., description="Type of review action")
-    line_number: int | None = Field(None, description="Source line being commented on")
-    comment: str | None = Field(None, description="Review comment text")
-    severity: Severity | None = Field(None, description="Issue severity level")
-    category: IssueCategory | None = Field(None, description="Issue category")
-class CodereviewagentObservation(Observation):
     """
-    Contains the code to review, task instructions, and the running
-    review history so the agent can track what it has already flagged.
-    The `reward` field mirrors the most recent step reward for convenience;
-    the authoritative reward is the RewardType returned by step().
     """
-    code_snippet: str = Field(default="", description="Python source code to review")
     task_description: str = Field(default="", description="Review instructions and goals")
     file_name: str = Field(default="", description="Name of the file being reviewed")
-    task_id: int = Field(default=0, description="Current task index")
     task_difficulty: str = Field(default="ultra-easy", description="Task difficulty label")
     review_history: list[dict[str, Any]] = Field(
         default_factory=list,
-        description="Ordered list of actions taken so far this episode",
     )
-    step_count: int = Field(default=0, description="Steps taken in current episode")
-    max_steps: int = Field(default=6, description="Step budget for this task")
-    issues_found_count: int = Field(default=0, description="Number of issues identified so far")
-    total_issues: int = Field(default=0, description="Total issues in this task")
     done: bool = Field(default=False, description="Whether the episode has ended")
-    reward: float = Field(default=0.0, description="Most recent step reward (mirror of RewardType.total)")
     metadata: dict[str, Any] = Field(default_factory=dict, description="Extra episode metadata")

 """
+Data models for the PRobe Environment.
 An agent reviews Python source files, identifies bugs, security issues,
 and design problems, then submits a structured review.
 """
+from __future__ import annotations
 from enum import Enum
 from typing import Any
 class ActionType(str, Enum):
+    """All actions the agent may take during a review episode."""
     ADD_COMMENT = "add_comment"
+    GET_CONTEXT = "get_context"       # probe a line for deeper causal context
     REQUEST_CHANGES = "request_changes"
     APPROVE = "approve"
     SUBMIT_REVIEW = "submit_review"
 class Severity(str, Enum):
+    """Severity levels for review comments."""
     INFO = "info"
     WARNING = "warning"
     ERROR = "error"
 class IssueCategory(str, Enum):
+    """Issue category taxonomy used in review comments."""
     BUG = "bug"
     SECURITY = "security"
     PERFORMANCE = "performance"
 class RewardType(BaseModel):
     """
+    Structured reward returned by ``step()``.
+    Attributes:
+        total:       Final clamped score in ``[-1.0, 1.0]``.
+        components:  Named sub-scores before clamping (may sum outside ``[-1, 1]``).
+        passed:      ``True`` when the action produced a clear positive signal.
+        explanation: Human-readable breakdown for logging / debugging.
+        step:        Environment step at which this reward was issued.
+        terminal:    ``True`` only on the ``SUBMIT_REVIEW`` step.
     """
     model_config = ConfigDict(frozen=True)
     total: float = Field(..., ge=-1.0, le=1.0)
     components: dict[str, float] = Field(default_factory=dict)
+    passed: bool = Field(default=False)
+    explanation: str = Field(default="")
+    step: int = Field(default=0, ge=0)
+    terminal: bool = Field(default=False)
+class ProbeAction(Action):
     """
+    An action the agent submits during a review episode.
+    Action types:
+        ADD_COMMENT     — annotate a specific line with a review comment.
+        GET_CONTEXT     — reveal ±5 lines of context around a line number.
+        REQUEST_CHANGES — mark the PR as requiring changes before merge.
+        APPROVE         — approve the PR (penalised if issues remain).
+        SUBMIT_REVIEW   — finalise and submit the review (ends the episode).
     """
     action_type: ActionType = Field(..., description="Type of review action")
+    line_number: int | None = Field(
+        default=None,
+        ge=1,
+        description="1-based source line being commented on or probed",
+    )
+    comment: str | None = Field(default=None, description="Review comment text")
+    severity: Severity | None = Field(default=None, description="Issue severity level")
+    category: IssueCategory | None = Field(default=None, description="Issue category")
+class ProbeObservation(Observation):
     """
+    The observation returned to the agent after every ``reset()`` / ``step()``.
+    The ``reward`` field mirrors ``RewardType.total`` for the most recent step
+    as a convenience; the authoritative reward object is returned by ``step()``.
     """
+    code_snippet: str = Field(default="", description="Python source code to review (mutated each episode)")
     task_description: str = Field(default="", description="Review instructions and goals")
     file_name: str = Field(default="", description="Name of the file being reviewed")
+    task_id: int = Field(default=0, ge=0, description="Current task index (0–6)")
     task_difficulty: str = Field(default="ultra-easy", description="Task difficulty label")
     review_history: list[dict[str, Any]] = Field(
         default_factory=list,
+        description="Ordered list of all actions taken so far this episode",
+    )
+    step_count: int = Field(default=0, ge=0, description="Steps taken in current episode")
+    max_steps: int = Field(default=6, ge=1, description="Step budget for this task")
+    issues_found_count: int = Field(default=0, ge=0, description="Distinct issues identified so far")
+    total_issues: int = Field(default=0, ge=0, description="Total ground-truth issues in this task")
+    context_hints: list[str] = Field(
+        default_factory=list,
+        description="Causal context unlocked by finding key issues — read these before continuing",
     )
     done: bool = Field(default=False, description="Whether the episode has ended")
+    reward: float = Field(
+        default=0.0,
+        ge=-1.0,
+        le=1.0,
+        description="Most recent step reward (mirrors RewardType.total)",
+    )
     metadata: dict[str, Any] = Field(default_factory=dict, description="Extra episode metadata")
+__all__ = [
+    "ActionType",
+    "IssueCategory",
+    "ProbeAction",
+    "ProbeObservation",
+    "RewardType",
+    "Severity",
+]

openenv.yaml CHANGED Viewed

@@ -1,32 +1,40 @@
 spec_version: 1
-name: CodeReviewAgent
 type: space
 runtime: fastapi
 app: server.app:app
 port: 8000
 description: >
-  Code review environment where an agent reviews Python source files,
-  identifies bugs, security vulnerabilities, performance bottlenecks,
-  and design issues, then submits a structured review with comments
-  and a final decision (request_changes or approve).
 tasks:
   - id: 0
     name: Basic Bug Detection
     difficulty: easy
     description: Identify logical bugs in a simple Python utility module
     max_steps: 15
     issues: 3
-  - id: 1
     name: Security Vulnerability Review
     difficulty: medium
     description: Find security vulnerabilities in an authentication module
     max_steps: 20
     issues: 5
-  - id: 2
     name: Full Architecture and Performance Review
     difficulty: hard
     description: >
@@ -35,14 +43,14 @@ tasks:
     max_steps: 30
     issues: 7
-  - id: 3
     name: Async Worker Review
     difficulty: medium
     description: Find concurrency bugs and resource leaks in an async worker
     max_steps: 20
     issues: 5
-  - id: 4
     name: Flask API Security Review
     difficulty: hard
     description: >
@@ -51,19 +59,30 @@ tasks:
     max_steps: 30
     issues: 6
 observation:
   type: object
   fields:
-    code_snippet: {type: string, description: "Python source to review"}
     task_description: {type: string, description: "Review instructions"}
     file_name: {type: string}
-    task_id: {type: integer, range: [0, 4]}
-    task_difficulty: {type: string, values: [easy, medium, hard]}
     review_history: {type: array, description: "Actions taken so far"}
     step_count: {type: integer}
     max_steps: {type: integer}
     issues_found_count: {type: integer}
     total_issues: {type: integer}
     done: {type: boolean}
     reward: {type: number}
@@ -72,7 +91,7 @@ action:
   fields:
     action_type:
       type: enum
-      values: [add_comment, request_changes, approve, submit_review]
     line_number: {type: integer, required: false}
     comment: {type: string, required: false}
     severity:
@@ -88,9 +107,11 @@ reward_design:
   range: [-1.0, 1.0]
   per_step:
     issue_found: "up to 0.60 total (weight/total_weight × 0.60 per issue)"
-    false_positive: -0.02
     correct_request_changes: +0.05
     bad_approval: -0.15
   terminal:
     coverage_bonus: "coverage × 0.20  (max +0.20)"
     decision_correct: +0.10

 spec_version: 1
+name: PRobe
 type: space
 runtime: fastapi
 app: server.app:app
 port: 8000
 description: >
+  PRobe (Pull Request Investigation Environment) — an RL training environment
+  where an agent reviews Python source files, identifies bugs, security
+  vulnerabilities, performance bottlenecks, and design issues, then submits a
+  structured review. Features dynamic code mutation, a GET_CONTEXT probe action,
+  and a causal unlock chain for genuine world-model reasoning.
 tasks:
   - id: 0
+    name: Bootstrap Obvious Issues
+    difficulty: ultra-easy
+    description: Off-by-one and hardcoded credential, both hinted in comments
+    max_steps: 6
+    issues: 2
+  - id: 1
     name: Basic Bug Detection
     difficulty: easy
     description: Identify logical bugs in a simple Python utility module
     max_steps: 15
     issues: 3
+  - id: 2
     name: Security Vulnerability Review
     difficulty: medium
     description: Find security vulnerabilities in an authentication module
     max_steps: 20
     issues: 5
+  - id: 3
     name: Full Architecture and Performance Review
     difficulty: hard
     description: >
     max_steps: 30
     issues: 7
+  - id: 4
     name: Async Worker Review
     difficulty: medium
     description: Find concurrency bugs and resource leaks in an async worker
     max_steps: 20
     issues: 5
+  - id: 5
     name: Flask API Security Review
     difficulty: hard
     description: >
     max_steps: 30
     issues: 6
+  - id: 6
+    name: Causal Secrets Leak Investigation
+    difficulty: hard
+    description: >
+      JWT auth service review with causal unlock chain — finding key issues
+      reveals DB schema and nginx config, enabling deeper attack-path reasoning
+    max_steps: 35
+    issues: 6
+    causal_unlocks: true
 observation:
   type: object
   fields:
+    code_snippet: {type: string, description: "Python source to review (mutated each episode)"}
     task_description: {type: string, description: "Review instructions"}
     file_name: {type: string}
+    task_id: {type: integer, range: [0, 6]}
+    task_difficulty: {type: string, values: [ultra-easy, easy, medium, hard]}
     review_history: {type: array, description: "Actions taken so far"}
     step_count: {type: integer}
     max_steps: {type: integer}
     issues_found_count: {type: integer}
     total_issues: {type: integer}
+    context_hints: {type: array, description: "Causal hints unlocked by finding key issues"}
     done: {type: boolean}
     reward: {type: number}
   fields:
     action_type:
       type: enum
+      values: [add_comment, get_context, request_changes, approve, submit_review]
     line_number: {type: integer, required: false}
     comment: {type: string, required: false}
     severity:
   range: [-1.0, 1.0]
   per_step:
     issue_found: "up to 0.60 total (weight/total_weight × 0.60 per issue)"
+    false_positive: -0.05
     correct_request_changes: +0.05
     bad_approval: -0.15
+    context_probe_near_issue: 0.00
+    context_probe_far: -0.01
   terminal:
     coverage_bonus: "coverage × 0.20  (max +0.20)"
     decision_correct: +0.10

openenv_CodeReviewAgent.egg-info/SOURCES.txt CHANGED Viewed

@@ -1,4 +1,7 @@
 README.md
 pyproject.toml
 ./__init__.py
 ./client.py
@@ -13,4 +16,5 @@ server/CodeReviewAgent_environment.py
 server/__init__.py
 server/app.py
 server/grader.py
-server/tasks.py

 README.md
+__init__.py
+client.py
+models.py
 pyproject.toml
 ./__init__.py
 ./client.py
 server/__init__.py
 server/app.py
 server/grader.py
+server/tasks.py
+tests/test_grader.py

openenv_PRobe.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,11 @@

+Metadata-Version: 2.4
+Name: openenv-PRobe
+Version: 0.1.0
+Summary: PRobe — Pull Request Investigation Environment for OpenEnv
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.2
+Requires-Dist: openai>=1.0.0
+Requires-Dist: python-dotenv>=1.2.2
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_PRobe.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+README.md
+pyproject.toml
+./__init__.py
+./client.py
+./models.py
+openenv_PRobe.egg-info/PKG-INFO
+openenv_PRobe.egg-info/SOURCES.txt
+openenv_PRobe.egg-info/dependency_links.txt
+openenv_PRobe.egg-info/entry_points.txt
+openenv_PRobe.egg-info/requires.txt
+openenv_PRobe.egg-info/top_level.txt
+server/CodeReviewAgent_environment.py
+server/__init__.py
+server/app.py
+server/grader.py
+server/mutator.py
+server/tasks.py
+tests/test_dynamic_world.py
+tests/test_grader.py

openenv_PRobe.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_PRobe.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = PRobe.server.app:main

openenv_PRobe.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+openenv-core[core]>=0.2.2
+openai>=1.0.0
+python-dotenv>=1.2.2
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_PRobe.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ PRobe

pyproject.toml CHANGED Viewed

@@ -9,9 +9,9 @@ requires = ["setuptools>=45", "wheel"]
 build-backend = "setuptools.build_meta"
 [project]
-name = "openenv-CodeReviewAgent"
 version = "0.1.0"
-description = "Codereviewagent environment for OpenEnv"
 requires-python = ">=3.10"
 dependencies = [
     # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
@@ -31,10 +31,15 @@ dev = [
 [project.scripts]
 # Server entry point - enables running via: uv run --project . server
-# or: python -m CodeReviewAgent.server.app
-server = "CodeReviewAgent.server.app:main"
 [tool.setuptools]
 include-package-data = true
-packages = ["CodeReviewAgent", "CodeReviewAgent.server"]
-package-dir = { "CodeReviewAgent" = ".", "CodeReviewAgent.server" = "server" }

 build-backend = "setuptools.build_meta"
 [project]
+name = "openenv-PRobe"
 version = "0.1.0"
+description = "PRobe — Pull Request Investigation Environment for OpenEnv"
 requires-python = ">=3.10"
 dependencies = [
     # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
 [project.scripts]
 # Server entry point - enables running via: uv run --project . server
+server = "PRobe.server.app:main"
 [tool.setuptools]
 include-package-data = true
+packages = ["PRobe", "PRobe.server"]
+package-dir = { "PRobe" = ".", "PRobe.server" = "server" }
+[dependency-groups]
+dev = [
+    "pytest>=9.0.3",
+    "pytest-cov>=7.1.0",
+]

server/CodeReviewAgent_environment.py CHANGED Viewed

@@ -6,7 +6,18 @@ Episode lifecycle:
   2. step(a)  → (Obs, RewardType, done, info) (execute one action)
   3. state()  → dict                          (full internal snapshot)
-Tasks cycle automatically: 0 (ultra-easy) → 1 (easy) → … → 5 (hard flask) → 0 …
 Thread / task safety: each Environment instance owns its own state.
 For concurrent GRPO rollouts spin up one instance per worker.
@@ -15,6 +26,8 @@ For concurrent GRPO rollouts spin up one instance per worker.
 from __future__ import annotations
 import asyncio
 from typing import Any
 from uuid import uuid4
@@ -24,30 +37,30 @@ from openenv.core.env_server.types import State
 try:
     from ..models import (
         ActionType,
-        CodereviewagentAction,
-        CodereviewagentObservation,
         RewardType,
     )
-    from .grader import CodeReviewGrader
     from .tasks import TASKS
 except ImportError:
     from models import (  # type: ignore[no-redef]
         ActionType,
-        CodereviewagentAction,
-        CodereviewagentObservation,
         RewardType,
     )
-    from server.grader import CodeReviewGrader  # type: ignore[no-redef]
-    from server.tasks import TASKS  # type: ignore[no-redef]
-# Sentinel reward returned on non-terminal steps that produce no signal
-_ZERO_REWARD = RewardType(total=0.0, components={}, passed=False,
-                           explanation="No signal this step.", step=0, terminal=False)
-class CodereviewagentEnvironment(Environment):
     """
-    OpenEnv-compliant code-review environment.
     Public interface is fully async.  The sync wrappers (reset / step / state)
     required by openenv's create_app are also provided; they delegate to the
@@ -76,23 +89,28 @@ class CodereviewagentEnvironment(Environment):
             "review_decision": None,
             "review_submitted": False,
             "cumulative_reward": 0.0,
         }
     # ── Async-native interface (primary) ──────────────────────────────────
-    async def async_reset(self) -> CodereviewagentObservation:
         task_id = self._reset_count % len(TASKS)
         self._reset_count += 1
         self._episode_id = str(uuid4())
         self._step_count = 0
-        task = TASKS[task_id]
         self._grader = CodeReviewGrader(task)
         self._ep = self._fresh_episode(task)
         return self._make_obs(reward=0.0, done=False)
     async def async_step(
-        self, action: CodereviewagentAction
-    ) -> tuple[CodereviewagentObservation, RewardType, bool, dict[str, Any]]:
         self._step_count += 1
         task = self._ep["task"]
         done = False
@@ -101,6 +119,9 @@ class CodereviewagentEnvironment(Environment):
         if action.action_type == ActionType.ADD_COMMENT:
             reward_obj = self._handle_add_comment(action)
         elif action.action_type == ActionType.REQUEST_CHANGES:
             reward_obj = self._handle_request_changes(action)
@@ -165,32 +186,29 @@ class CodereviewagentEnvironment(Environment):
     # ── Sync wrappers (openenv / create_app compatibility) ────────────────
-    def reset(self) -> CodereviewagentObservation:  # type: ignore[override]
         try:
-            loop = asyncio.get_running_loop()
         except RuntimeError:
             return asyncio.run(self.async_reset())
-        # Called from inside a running loop (e.g. pytest-asyncio) — run directly
-        import concurrent.futures
         with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
-            fut = pool.submit(asyncio.run, self.async_reset())
-            return fut.result()
-    def step(self, action: CodereviewagentAction) -> CodereviewagentObservation:  # type: ignore[override]
         """
         Sync step for openenv compatibility.
         Returns only the Observation (reward is embedded in obs.reward).
         Use async_step() for the full (obs, reward, done, info) tuple.
         """
         try:
-            loop = asyncio.get_running_loop()
         except RuntimeError:
             obs, _, _, _ = asyncio.run(self.async_step(action))
             return obs
-        import concurrent.futures
         with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
-            fut = pool.submit(asyncio.run, self.async_step(action))
-            obs, _, _, _ = fut.result()
             return obs
     @property
@@ -199,7 +217,7 @@ class CodereviewagentEnvironment(Environment):
     # ── Action handlers ───────────────────────────────────────────────────
-    def _handle_add_comment(self, action: CodereviewagentAction) -> RewardType:
         entry = {
             "type": "comment",
             "line": action.line_number,
@@ -224,6 +242,9 @@ class CodereviewagentEnvironment(Environment):
         else:
             explanation = "Comment recorded; no new issue matched."
         return RewardType(
             total=clamped,
             components=breakdown,
@@ -233,7 +254,79 @@ class CodereviewagentEnvironment(Environment):
             terminal=False,
         )
-    def _handle_request_changes(self, action: CodereviewagentAction) -> RewardType:
         self._ep["review_decision"] = "request_changes"
         self._ep["review_comments"].append(
             {"type": "request_changes", "text": action.comment}
@@ -304,9 +397,9 @@ class CodereviewagentEnvironment(Environment):
     # ── Observation builder ───────────────────────────────────────────────
-    def _make_obs(self, reward: float, done: bool) -> CodereviewagentObservation:
         task = self._ep["task"]
-        return CodereviewagentObservation(
             code_snippet=task["code"],
             task_description=task["description"],
             file_name=task["file_name"],
@@ -319,9 +412,11 @@ class CodereviewagentEnvironment(Environment):
             total_issues=len(task["issues"]),
             done=done,
             reward=round(max(-1.0, min(1.0, reward)), 4),
             metadata={
                 "cumulative_reward": self._ep.get("cumulative_reward", 0.0),
                 "review_decision": self._ep.get("review_decision"),
                 "episode_id": self._episode_id,
             },
         )

   2. step(a)  → (Obs, RewardType, done, info) (execute one action)
   3. state()  → dict                          (full internal snapshot)
+Tasks cycle automatically: 0 (ultra-easy) → 1 (easy) → … → 6 (causal chain) → 0 …
+Dynamic world features (v3)
+───────────────────────────
+• Code mutation   — each episode applies surface-level variable renames,
+                    a line shift, and a constant nudge so the agent must
+                    read the code rather than memorise tokens.
+• GET_CONTEXT     — the agent can spend a step probing a specific line to
+                    receive the surrounding ±5 lines of context.
+• Causal unlocks  — finding certain issues appends a new context hint to
+                    the observation, modelling real-world situations where
+                    one discovery leads to deeper investigation.
 Thread / task safety: each Environment instance owns its own state.
 For concurrent GRPO rollouts spin up one instance per worker.
 from __future__ import annotations
 import asyncio
+import concurrent.futures
+import logging
 from typing import Any
 from uuid import uuid4
 try:
     from ..models import (
         ActionType,
+        ProbeAction,
+        ProbeObservation,
         RewardType,
     )
+    from .grader import CodeReviewGrader, LINE_TOLERANCE
+    from .mutator import mutate_task
     from .tasks import TASKS
 except ImportError:
     from models import (  # type: ignore[no-redef]
         ActionType,
+        ProbeAction,
+        ProbeObservation,
         RewardType,
     )
+    from server.grader import CodeReviewGrader, LINE_TOLERANCE  # type: ignore[no-redef]
+    from server.mutator import mutate_task       # type: ignore[no-redef]
+    from server.tasks import TASKS              # type: ignore[no-redef]
+log = logging.getLogger(__name__)
+class ProbeEnvironment(Environment):
     """
+    PRobe — Pull Request Investigation Environment.
     Public interface is fully async.  The sync wrappers (reset / step / state)
     required by openenv's create_app are also provided; they delegate to the
             "review_decision": None,
             "review_submitted": False,
             "cumulative_reward": 0.0,
+            # causal world-modeling state
+            "context_hints": [],          # list[str] of unlocked hint texts
+            "hints_unlocked": set(),      # set[str] of hint keys already fired
         }
     # ── Async-native interface (primary) ──────────────────────────────────
+    async def async_reset(self) -> ProbeObservation:
         task_id = self._reset_count % len(TASKS)
+        seed = self._reset_count          # unique seed per episode
         self._reset_count += 1
         self._episode_id = str(uuid4())
         self._step_count = 0
+        # Apply surface mutation so the agent cannot memorise tokens
+        task = mutate_task(TASKS[task_id], seed=seed)
         self._grader = CodeReviewGrader(task)
         self._ep = self._fresh_episode(task)
         return self._make_obs(reward=0.0, done=False)
     async def async_step(
+        self, action: ProbeAction
+    ) -> tuple[ProbeObservation, RewardType, bool, dict[str, Any]]:
         self._step_count += 1
         task = self._ep["task"]
         done = False
         if action.action_type == ActionType.ADD_COMMENT:
             reward_obj = self._handle_add_comment(action)
+        elif action.action_type == ActionType.GET_CONTEXT:
+            reward_obj = self._handle_get_context(action)
         elif action.action_type == ActionType.REQUEST_CHANGES:
             reward_obj = self._handle_request_changes(action)
     # ── Sync wrappers (openenv / create_app compatibility) ────────────────
+    def reset(self) -> ProbeObservation:  # type: ignore[override]
         try:
+            asyncio.get_running_loop()
         except RuntimeError:
             return asyncio.run(self.async_reset())
+        # Called from inside a running loop (e.g. pytest-asyncio) -- run in a
+        # fresh thread that has its own event loop.
         with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+            return pool.submit(asyncio.run, self.async_reset()).result()
+    def step(self, action: ProbeAction) -> ProbeObservation:  # type: ignore[override]
         """
         Sync step for openenv compatibility.
         Returns only the Observation (reward is embedded in obs.reward).
         Use async_step() for the full (obs, reward, done, info) tuple.
         """
         try:
+            asyncio.get_running_loop()
         except RuntimeError:
             obs, _, _, _ = asyncio.run(self.async_step(action))
             return obs
         with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+            obs, _, _, _ = pool.submit(asyncio.run, self.async_step(action)).result()
             return obs
     @property
     # ── Action handlers ───────────────────────────────────────────────────
+    def _handle_add_comment(self, action: ProbeAction) -> RewardType:
         entry = {
             "type": "comment",
             "line": action.line_number,
         else:
             explanation = "Comment recorded; no new issue matched."
+        # ── Causal unlock: check whether any newly found issue reveals context
+        self._unlock_causal_hints(new_finds)
         return RewardType(
             total=clamped,
             components=breakdown,
             terminal=False,
         )
+    def _unlock_causal_hints(self, newly_found: list[str]) -> None:
+        """Append context hint text for any issue that has an 'unlocks' key."""
+        task = self._ep["task"]
+        hint_map: dict[str, str] = task.get("context_hints", {})
+        for issue in task["issues"]:
+            unlock_key = issue.get("unlocks")
+            if (
+                unlock_key
+                and issue["id"] in newly_found
+                and unlock_key not in self._ep["hints_unlocked"]
+                and unlock_key in hint_map
+            ):
+                self._ep["hints_unlocked"].add(unlock_key)
+                self._ep["context_hints"].append(hint_map[unlock_key])
+    def _handle_get_context(
+        self, action: ProbeAction
+    ) -> RewardType:
+        """
+        GET_CONTEXT — reveal ±5 lines around the requested line number.
+        Costs a small step penalty (-0.01) to discourage random probing,
+        but rewards focused investigation (line near an actual issue: 0.0
+        net cost — penalty waived).
+        """
+        line_number = action.line_number
+        task = self._ep["task"]
+        code_lines = task["code"].split("\n")
+        if line_number is None:
+            return RewardType(
+                total=-0.02,
+                components={"invalid_context_probe": -0.02},
+                passed=False,
+                explanation="GET_CONTEXT requires a line_number.",
+                step=self._step_count,
+                terminal=False,
+            )
+        # Build snippet
+        start = max(0, line_number - 6)
+        end = min(len(code_lines), line_number + 5)
+        snippet_lines = [
+            f"{i + 1:3}: {code_lines[i]}" for i in range(start, end)
+        ]
+        snippet = "\n".join(snippet_lines)
+        # Check if probed line is near a real issue (within LINE_TOLERANCE).
+        near_issue = any(
+            (iss["line_range"][0] - LINE_TOLERANCE) <= line_number <= (iss["line_range"][1] + LINE_TOLERANCE)
+            for iss in task["issues"]
+        )
+        penalty = 0.0 if near_issue else -0.01
+        # Store the context result in review history so the agent can see it
+        self._ep["review_comments"].append({
+            "type": "context_probe",
+            "line": line_number,
+            "context": snippet,
+        })
+        return RewardType(
+            total=penalty,
+            components={"context_probe_penalty": penalty},
+            passed=near_issue,
+            explanation=(
+                f"Context around line {line_number}:\n{snippet}"
+            ),
+            step=self._step_count,
+            terminal=False,
+        )
+    def _handle_request_changes(self, action: ProbeAction) -> RewardType:
         self._ep["review_decision"] = "request_changes"
         self._ep["review_comments"].append(
             {"type": "request_changes", "text": action.comment}
     # ── Observation builder ───────────────────────────────────────────────
+    def _make_obs(self, reward: float, done: bool) -> ProbeObservation:
         task = self._ep["task"]
+        return ProbeObservation(
             code_snippet=task["code"],
             task_description=task["description"],
             file_name=task["file_name"],
             total_issues=len(task["issues"]),
             done=done,
             reward=round(max(-1.0, min(1.0, reward)), 4),
+            context_hints=list(self._ep.get("context_hints", [])),
             metadata={
                 "cumulative_reward": self._ep.get("cumulative_reward", 0.0),
                 "review_decision": self._ep.get("review_decision"),
                 "episode_id": self._episode_id,
+                "mutation_seed": self._ep["task"].get("_mutation_seed"),
             },
         )

server/__init__.py CHANGED Viewed

@@ -4,8 +4,8 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-"""Codereviewagent environment server components."""
-from .CodeReviewAgent_environment import CodereviewagentEnvironment
-__all__ = ["CodereviewagentEnvironment"]

 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
+"""PRobe environment server components."""
+from .CodeReviewAgent_environment import ProbeEnvironment
+__all__ = ["ProbeEnvironment"]

server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc CHANGED Viewed

Binary files a/server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc and b/server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc differ

server/__pycache__/__init__.cpython-314.pyc CHANGED Viewed

Binary files a/server/__pycache__/__init__.cpython-314.pyc and b/server/__pycache__/__init__.cpython-314.pyc differ

server/__pycache__/grader.cpython-314.pyc CHANGED Viewed

Binary files a/server/__pycache__/grader.cpython-314.pyc and b/server/__pycache__/grader.cpython-314.pyc differ

server/__pycache__/mutator.cpython-314.pyc ADDED Viewed

Binary file (5.86 kB). View file

server/__pycache__/tasks.cpython-314.pyc CHANGED Viewed

Binary files a/server/__pycache__/tasks.cpython-314.pyc and b/server/__pycache__/tasks.cpython-314.pyc differ

server/app.py CHANGED Viewed

@@ -1,5 +1,5 @@
 """
-Async FastAPI server for the CodeReviewAgent environment.
 Endpoints:
   POST /reset              — start a new episode (HTTP session)
@@ -20,9 +20,11 @@ falls back to a minimal HTML redirect page.
 from __future__ import annotations
 import json
 from contextlib import asynccontextmanager
 from typing import Any
 from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
 from fastapi.responses import HTMLResponse
@@ -33,22 +35,24 @@ except Exception:  # pragma: no cover
     _OPENENV_AVAILABLE = False
 try:
-    from ..models import CodereviewagentAction, CodereviewagentObservation, RewardType
-    from .CodeReviewAgent_environment import CodereviewagentEnvironment
 except ModuleNotFoundError:
-    from models import CodereviewagentAction, CodereviewagentObservation, RewardType  # type: ignore
-    from server.CodeReviewAgent_environment import CodereviewagentEnvironment  # type: ignore
 # ── Shared HTTP session env ───────────────────────────────────────────────────
-_http_env: CodereviewagentEnvironment | None = None
 @asynccontextmanager
 async def lifespan(application: FastAPI):
     global _http_env
-    _http_env = CodereviewagentEnvironment()
     yield
     _http_env = None
@@ -58,7 +62,7 @@ async def lifespan(application: FastAPI):
 class StepResponse:
     def __init__(
         self,
-        obs: CodereviewagentObservation,
         reward: RewardType,
         done: bool,
         info: dict[str, Any],
@@ -81,7 +85,7 @@ class StepResponse:
 def _build_app() -> FastAPI:
     application = FastAPI(
-        title="CodeReviewAgent",
         description="OpenEnv code-review environment — async FastAPI server.",
         version="2.0.0",
         lifespan=lifespan,
@@ -91,19 +95,22 @@ def _build_app() -> FastAPI:
     @application.post("/reset", summary="Start a new episode")
     async def reset_endpoint() -> dict[str, Any]:
-        assert _http_env is not None
         obs = await _http_env.async_reset()
         return {"observation": obs.model_dump(), "reward": None, "done": False, "info": {}}
     @application.post("/step", summary="Execute one action")
-    async def step_endpoint(action: CodereviewagentAction) -> dict[str, Any]:
-        assert _http_env is not None
         obs, reward, done, info = await _http_env.async_step(action)
         return StepResponse(obs, reward, done, info).to_dict()
     @application.get("/state", summary="Current episode state snapshot")
     async def state_endpoint() -> dict[str, Any]:
-        assert _http_env is not None
         return await _http_env.async_state()
     @application.get("/health", summary="Liveness probe")
@@ -113,8 +120,8 @@ def _build_app() -> FastAPI:
     @application.get("/schema", summary="Action and observation JSON schemas")
     async def schema() -> dict[str, Any]:
         return {
-            "action": CodereviewagentAction.model_json_schema(),
-            "observation": CodereviewagentObservation.model_json_schema(),
             "reward": RewardType.model_json_schema(),
         }
@@ -123,7 +130,7 @@ def _build_app() -> FastAPI:
     @application.websocket("/ws")
     async def ws_endpoint(websocket: WebSocket) -> None:
         await websocket.accept()
-        env = CodereviewagentEnvironment()
         try:
             while True:
                 raw = await websocket.receive_text()
@@ -138,7 +145,7 @@ def _build_app() -> FastAPI:
                 elif cmd == "step":
                     try:
-                        action = CodereviewagentAction(**msg["action"])
                     except Exception as exc:
                         await websocket.send_json({"type": "error", "detail": str(exc)})
                         continue
@@ -170,9 +177,9 @@ def _build_app() -> FastAPI:
     @application.get("/web", response_class=HTMLResponse, include_in_schema=False)
     async def web_ui() -> str:
         return """
-        <!doctype html><html><head><title>CodeReviewAgent</title></head>
-        <body>
-        <h2>CodeReviewAgent Environment</h2>
         <p>API docs: <a href="/docs">/docs</a></p>
         <p>Health: <a href="/health">/health</a></p>
         <p>Schema: <a href="/schema">/schema</a></p>
@@ -185,8 +192,7 @@ def _build_app() -> FastAPI:
 app = _build_app()
-def main(host: str = "0.0.0.0", port: int = 8000) -> None:
-    import uvicorn
     uvicorn.run(app, host=host, port=port)

 """
+Async FastAPI server for the PRobe environment.
 Endpoints:
   POST /reset              — start a new episode (HTTP session)
 from __future__ import annotations
 import json
+import logging
 from contextlib import asynccontextmanager
 from typing import Any
+import uvicorn
 from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
 from fastapi.responses import HTMLResponse
     _OPENENV_AVAILABLE = False
 try:
+    from ..models import ProbeAction, ProbeObservation, RewardType
+    from .CodeReviewAgent_environment import ProbeEnvironment
 except ModuleNotFoundError:
+    from models import ProbeAction, ProbeObservation, RewardType  # type: ignore
+    from server.CodeReviewAgent_environment import ProbeEnvironment  # type: ignore
+log = logging.getLogger(__name__)
 # ── Shared HTTP session env ───────────────────────────────────────────────────
+_http_env: ProbeEnvironment | None = None
 @asynccontextmanager
 async def lifespan(application: FastAPI):
     global _http_env
+    _http_env = ProbeEnvironment()
     yield
     _http_env = None
 class StepResponse:
     def __init__(
         self,
+        obs: ProbeObservation,
         reward: RewardType,
         done: bool,
         info: dict[str, Any],
 def _build_app() -> FastAPI:
     application = FastAPI(
+        title="PRobe",
         description="OpenEnv code-review environment — async FastAPI server.",
         version="2.0.0",
         lifespan=lifespan,
     @application.post("/reset", summary="Start a new episode")
     async def reset_endpoint() -> dict[str, Any]:
+        if _http_env is None:
+            raise HTTPException(status_code=503, detail="Environment not initialised")
         obs = await _http_env.async_reset()
         return {"observation": obs.model_dump(), "reward": None, "done": False, "info": {}}
     @application.post("/step", summary="Execute one action")
+    async def step_endpoint(action: ProbeAction) -> dict[str, Any]:
+        if _http_env is None:
+            raise HTTPException(status_code=503, detail="Environment not initialised")
         obs, reward, done, info = await _http_env.async_step(action)
         return StepResponse(obs, reward, done, info).to_dict()
     @application.get("/state", summary="Current episode state snapshot")
     async def state_endpoint() -> dict[str, Any]:
+        if _http_env is None:
+            raise HTTPException(status_code=503, detail="Environment not initialised")
         return await _http_env.async_state()
     @application.get("/health", summary="Liveness probe")
     @application.get("/schema", summary="Action and observation JSON schemas")
     async def schema() -> dict[str, Any]:
         return {
+            "action": ProbeAction.model_json_schema(),
+            "observation": ProbeObservation.model_json_schema(),
             "reward": RewardType.model_json_schema(),
         }
     @application.websocket("/ws")
     async def ws_endpoint(websocket: WebSocket) -> None:
         await websocket.accept()
+        env = ProbeEnvironment()
         try:
             while True:
                 raw = await websocket.receive_text()
                 elif cmd == "step":
                     try:
+                        action = ProbeAction(**msg["action"])
                     except Exception as exc:
                         await websocket.send_json({"type": "error", "detail": str(exc)})
                         continue
     @application.get("/web", response_class=HTMLResponse, include_in_schema=False)
     async def web_ui() -> str:
         return """
+        <!doctype html><html><head><title>PRobe</title></head>
+        <body style="font-family:sans-serif;padding:2rem">
+        <h2>PRobe Environment</h2>
         <p>API docs: <a href="/docs">/docs</a></p>
         <p>Health: <a href="/health">/health</a></p>
         <p>Schema: <a href="/schema">/schema</a></p>
 app = _build_app()
+def main(host: str = "0.0.0.0", port: int = 8000) -> None:  # noqa: S104
     uvicorn.run(app, host=host, port=port)

server/grader.py CHANGED Viewed

@@ -1,27 +1,29 @@
 """
-Deterministic grader for CodeReviewAgent tasks.
 Scoring design
 --------------
 During the episode (ADD_COMMENT actions):
-  +weight/total_weight * 0.60   per newly found issue (max 0.60 cumulative)
-  -0.02                         per false-positive (substantive comment, no match)
-Final (SUBMIT_REVIEW):
-  +coverage * 0.20              weighted coverage bonus   (max  0.20)
-  +/-0.10                       correct / incorrect final decision
-  +efficiency * 0.10            step-efficiency bonus when coverage >= 60%
-Maximum achievable total: ~1.0   Minimum: −1.0
-Anti-exploit rule (enforced since v2):
-  A comment MUST satisfy BOTH:
-    1. keyword_hit  — at least one issue keyword appears in the comment text
-    2. line_hit     — comment line_number is within ±LINE_TOLERANCE of the issue
-  `category` match is NOT sufficient on its own.  This closes the keyword-spam
-  exploit where a model dumps all known keywords on a single line.
 """
 from typing import Any
 try:
@@ -29,15 +31,26 @@ try:
 except ImportError:
     from models import RewardType  # type: ignore[no-redef]
-LINE_TOLERANCE: int = 3  # lines either side of an issue's declared range
 class CodeReviewGrader:
     def __init__(self, task: dict[str, Any]) -> None:
         self.task = task
         self.total_weight: float = sum(iss["weight"] for iss in task["issues"])
-    # ── Per-comment scoring ───────────────────────────────────────────────
     def score_comment(
         self,
@@ -51,16 +64,19 @@ class CodeReviewGrader:
         Returns:
             (reward_delta, newly_found_issue_ids, component_breakdown)
-        Match condition (BOTH required — no shortcut):
-            keyword_hit  AND  line_hit
         """
         if not comment:
             return 0.0, [], {}
         comment_lower = comment.lower()
         newly_found: list[str] = []
         issue_credit: float = 0.0
-        false_positive_penalty: float = 0.0
         for issue in self.task["issues"]:
             if issue["id"] in already_found:
@@ -69,15 +85,15 @@ class CodeReviewGrader:
             keyword_hit = any(kw.lower() in comment_lower for kw in issue["keywords"])
             line_hit = self._line_in_range(line_number, issue["line_range"])
-            # BOTH conditions required — no cat_hit shortcut
-            if keyword_hit and line_hit:
-                credit = (issue["weight"] / self.total_weight) * 0.60
                 newly_found.append(issue["id"])
                 issue_credit += credit
-        # Penalise substantive comments that matched nothing
-        if not newly_found and comment and len(comment.strip()) > 15:
-            false_positive_penalty = -0.02
         total = round(issue_credit + false_positive_penalty, 4)
         breakdown = {
@@ -86,7 +102,7 @@ class CodeReviewGrader:
         }
         return total, newly_found, breakdown
-    # ── Terminal scoring ──────────────────────────────────────────────────
     def final_score(
         self,
@@ -98,9 +114,13 @@ class CodeReviewGrader:
     ) -> RewardType:
         """
         Compute the terminal reward on SUBMIT_REVIEW.
-        Returns a fully typed RewardType with component breakdown.
         """
-        unique_found = list(set(issues_found))
         found_weight = sum(
             iss["weight"]
             for iss in self.task["issues"]
@@ -108,12 +128,18 @@ class CodeReviewGrader:
         )
         coverage = found_weight / self.total_weight if self.total_weight > 0 else 0.0
-        correct_decision = self.task.get("correct_decision", "request_changes")
-        decision_score = 0.10 if review_decision == correct_decision else -0.10
         efficiency = max(0.0, 1.0 - step_count / max_steps)
-        efficiency_bonus = round(0.10 * efficiency, 4) if coverage >= 0.60 else 0.0
-        coverage_bonus = round(coverage * 0.20, 4)
         raw_total = coverage_bonus + decision_score + efficiency_bonus
         clamped = round(max(-1.0, min(1.0, raw_total)), 4)
@@ -123,23 +149,24 @@ class CodeReviewGrader:
             "decision_score": round(decision_score, 4),
             "efficiency_bonus": efficiency_bonus,
         }
         explanation = (
-            f"Found {len(unique_found)}/{len(self.task['issues'])} issues "
             f"(weighted coverage {coverage:.0%}). "
-            f"Decision '{review_decision}' was "
             f"{'correct' if review_decision == correct_decision else 'incorrect'}. "
             f"Used {step_count}/{max_steps} steps."
         )
         return RewardType(
             total=clamped,
             components=components,
-            passed=review_decision == correct_decision and coverage >= 0.60,
             explanation=explanation,
             step=current_step,
             terminal=True,
         )
-    # ── Helper ────────────────────────────────────────────────────────────
     @staticmethod
     def _line_in_range(

 """
+Deterministic reward grader for PRobe tasks.
 Scoring design
 --------------
 During the episode (ADD_COMMENT actions):
+  + weight/total_weight * ISSUE_REWARD_POOL   per newly found issue
+  - FALSE_POSITIVE_PENALTY                    per substantive unmatched comment
+Terminal (SUBMIT_REVIEW):
+  + coverage * COVERAGE_POOL      weighted coverage bonus  (max COVERAGE_POOL)
+  +/- DECISION_REWARD             correct / incorrect final decision
+  + efficiency * EFFICIENCY_POOL  step-efficiency bonus when coverage >= COVERAGE_THRESHOLD
+Maximum achievable total: ~1.0   Minimum: -1.0
+Anti-exploit rules (v3):
+  A comment MUST satisfy ALL of:
+    1. keyword_hit  -- at least one issue keyword appears in the comment text
+    2. line_hit     -- comment line_number is within +/-LINE_TOLERANCE of the issue
+    3. substantive  -- comment is longer than MIN_COMMENT_LENGTH characters
+  This prevents keyword-spam, wide-net line fishing, and trivial one-word matches.
 """
+from __future__ import annotations
 from typing import Any
 try:
 except ImportError:
     from models import RewardType  # type: ignore[no-redef]
+# -- Grading hyper-parameters ------------------------------------------------
+LINE_TOLERANCE: int = 2         # lines either side of an issue's declared range
+MIN_COMMENT_LENGTH: int = 15    # chars -- comments shorter than this earn no credit
+ISSUE_REWARD_POOL: float = 0.60     # max cumulative credit from ADD_COMMENT
+COVERAGE_POOL: float = 0.20         # terminal coverage bonus ceiling
+DECISION_REWARD: float = 0.10       # +/- for correct/incorrect final decision
+EFFICIENCY_POOL: float = 0.10       # max terminal efficiency bonus
+COVERAGE_THRESHOLD: float = 0.60    # min coverage to unlock efficiency bonus
+FALSE_POSITIVE_PENALTY: float = -0.05  # per substantive unmatched comment
 class CodeReviewGrader:
+    """Scores agent actions against a task's ground-truth issue list."""
     def __init__(self, task: dict[str, Any]) -> None:
         self.task = task
         self.total_weight: float = sum(iss["weight"] for iss in task["issues"])
+    # -- Per-comment scoring -------------------------------------------------
     def score_comment(
         self,
         Returns:
             (reward_delta, newly_found_issue_ids, component_breakdown)
+        Match condition (ALL required -- no shortcut)::
+            keyword_hit AND line_hit AND substantive
         """
         if not comment:
             return 0.0, [], {}
         comment_lower = comment.lower()
+        # Compute once -- used for both the credit path and the penalty path.
+        substantive: bool = len(comment.strip()) > MIN_COMMENT_LENGTH
         newly_found: list[str] = []
         issue_credit: float = 0.0
         for issue in self.task["issues"]:
             if issue["id"] in already_found:
             keyword_hit = any(kw.lower() in comment_lower for kw in issue["keywords"])
             line_hit = self._line_in_range(line_number, issue["line_range"])
+            if keyword_hit and line_hit and substantive:
+                credit = (issue["weight"] / self.total_weight) * ISSUE_REWARD_POOL
                 newly_found.append(issue["id"])
                 issue_credit += credit
+        # Penalise substantive comments that matched nothing.
+        false_positive_penalty: float = (
+            FALSE_POSITIVE_PENALTY if (not newly_found and substantive) else 0.0
+        )
         total = round(issue_credit + false_positive_penalty, 4)
         breakdown = {
         }
         return total, newly_found, breakdown
+    # -- Terminal scoring ----------------------------------------------------
     def final_score(
         self,
     ) -> RewardType:
         """
         Compute the terminal reward on SUBMIT_REVIEW.
+        Returns a fully-typed RewardType with a per-component breakdown.
+        De-duplicates issues_found with stable ordering so results are
+        deterministic regardless of insertion order.
         """
+        # sorted() gives stable ordering so results are reproducible.
+        unique_found: list[str] = sorted(set(issues_found))
         found_weight = sum(
             iss["weight"]
             for iss in self.task["issues"]
         )
         coverage = found_weight / self.total_weight if self.total_weight > 0 else 0.0
+        correct_decision: str = self.task.get("correct_decision", "request_changes")
+        decision_score = (
+            DECISION_REWARD if review_decision == correct_decision else -DECISION_REWARD
+        )
         efficiency = max(0.0, 1.0 - step_count / max_steps)
+        efficiency_bonus = (
+            round(EFFICIENCY_POOL * efficiency, 4)
+            if coverage >= COVERAGE_THRESHOLD
+            else 0.0
+        )
+        coverage_bonus = round(coverage * COVERAGE_POOL, 4)
         raw_total = coverage_bonus + decision_score + efficiency_bonus
         clamped = round(max(-1.0, min(1.0, raw_total)), 4)
             "decision_score": round(decision_score, 4),
             "efficiency_bonus": efficiency_bonus,
         }
+        total_issues = len(self.task["issues"])
         explanation = (
+            f"Found {len(unique_found)}/{total_issues} issues "
             f"(weighted coverage {coverage:.0%}). "
+            f"Decision {review_decision!r} was "
             f"{'correct' if review_decision == correct_decision else 'incorrect'}. "
             f"Used {step_count}/{max_steps} steps."
         )
         return RewardType(
             total=clamped,
             components=components,
+            passed=review_decision == correct_decision and coverage >= COVERAGE_THRESHOLD,
             explanation=explanation,
             step=current_step,
             terminal=True,
         )
+    # -- Helper --------------------------------------------------------------
     @staticmethod
     def _line_in_range(

server/mutator.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""
+Code Mutation Engine -- makes the world dynamic.
+Each call to ``mutate_task()`` returns a deep copy of a task with:
+  1. Variable renaming  -- one identifier swapped for a synonym so the agent
+                           cannot memorise exact token strings between episodes.
+  2. Line shifting      -- an inert blank line inserted above the first issue,
+                           shifting all issue line_ranges down by 1.  The agent
+                           must *read* the code each episode.
+  3. Constant variance  -- numeric literals (e.g. range limits, sleep durations)
+                           are nudged +/-1 so the agent sees a fresh surface
+                           without changing the underlying bug.
+Mutation is fully deterministic given a seed, so training runs are
+reproducible while still being different across episodes.
+Design principle
+----------------
+Mutations must NEVER change *whether* a bug exists or *which line category*
+it falls in.  They only change surface tokens and line positions so the agent
+cannot exploit memorisation.
+"""
+from __future__ import annotations
+import copy
+import random
+import re
+from typing import Any
+# -- Variable synonym table --------------------------------------------------
+# Maps original identifiers -> list of drop-in synonyms.
+# Only single-token renames that do not affect semantics are listed.
+_SYNONYMS: dict[str, list[str]] = {
+    "total":        ["acc", "running_total", "summed"],
+    "numbers":      ["values", "nums", "items"],
+    "result":       ["output", "response", "ret"],
+    "data":         ["payload", "records", "entries"],
+    "item":         ["record", "entry", "obj"],
+    "items":        ["records", "entries", "objects"],
+    "user":         ["account", "principal", "member"],
+    "users":        ["accounts", "principals", "members"],
+    "password":     ["passwd", "secret", "credential"],
+    "username":     ["user_name", "login", "uname"],
+    "command":      ["cmd", "instruction", "directive"],
+    "filename":     ["file_name", "fname", "path_name"],
+    "url":          ["endpoint", "uri", "address"],
+    "attempt":      ["try_num", "iteration", "retry_idx"],
+    "counter":      ["count", "tally", "n"],
+    "session":      ["conn", "http_session", "client"],
+    "results":      ["findings", "collected", "gathered"],
+    "cache":        ["store", "lookup", "memo"],
+    "transformed":  ["processed", "mapped", "converted"],
+}
+def mutate_task(base_task: dict[str, Any], seed: int) -> dict[str, Any]:
+    """
+    Return a mutated deep-copy of *base_task* using *seed* for reproducibility.
+    The returned task is structurally identical to the original -- same keys,
+    same issue ids, same categories -- but with surface-level code changes and
+    adjusted line_ranges.
+    """
+    rng = random.Random(seed)
+    task: dict[str, Any] = copy.deepcopy(base_task)
+    code: str = task["code"]
+    issues: list[dict[str, Any]] = task["issues"]
+    # -- 1. Variable rename --------------------------------------------------
+    candidates = [orig for orig in _SYNONYMS if re.search(rf"\b{orig}\b", code)]
+    if candidates:
+        original = rng.choice(candidates)
+        replacement = rng.choice(_SYNONYMS[original])
+        # Whole-word replace to avoid partial matches.
+        code = re.sub(rf"\b{original}\b", replacement, code)
+        # Keep the keyword list in sync so the grader still matches.
+        for issue in issues:
+            issue["keywords"] = [
+                replacement if kw == original else kw
+                for kw in issue["keywords"]
+            ]
+    # -- 2. Line shift -- insert one blank line before the first issue --------
+    if issues:
+        first_line = min(iss["line_range"][0] for iss in issues)
+        # Convert 1-based line number to 0-based list index.
+        insert_before = max(0, first_line - 2)
+        lines = code.split("\n")
+        lines.insert(insert_before, "")
+        code = "\n".join(lines)
+        # Shift every issue line_range down by 1 to match the new positions.
+        for issue in issues:
+            start, end = issue["line_range"]
+            issue["line_range"] = (start + 1, end + 1)
+    # -- 3. Constant variance -- nudge one numeric literal -------------------
+    # Exclude numbers that appear only inside a comment on the same line,
+    # to avoid corrupting annotated line references.
+    numeric_matches = [
+        m
+        for m in re.finditer(r"\b([2-9]|[1-9]\d+)\b", code)
+        if not re.search(r"#[^\n]*" + re.escape(m.group()), code[: m.end()])
+    ]
+    if numeric_matches:
+        chosen = rng.choice(numeric_matches)
+        original_val = int(chosen.group())
+        delta = rng.choice([-1, 1])
+        new_val = max(2, original_val + delta)  # never go below 2
+        code = code[: chosen.start()] + str(new_val) + code[chosen.end() :]
+    task["code"] = code
+    task["issues"] = issues
+    # Tag the task so the environment can record mutation metadata.
+    task["_mutation_seed"] = seed
+    return task
+__all__ = ["mutate_task"]

server/tasks.py CHANGED Viewed

@@ -716,4 +716,228 @@ def admin_panel():
         ],
         "correct_decision": "request_changes",
     },
 ]

         ],
         "correct_decision": "request_changes",
     },
+    # ── Task 6: Causal Chain — Secrets Leak Investigation ────────────────────
+    #
+    # WORLD-MODELING DESIGN
+    # ─────────────────────
+    # This task implements a *causal observation chain*:
+    #
+    #   Phase 1 (lines visible from the start)
+    #     The agent sees a Flask service with two obvious surface issues.
+    #     Finding issue A (hardcoded JWT secret) *unlocks* Phase 2 context.
+    #
+    #   Phase 2 (revealed after issue A is found)
+    #     A hidden DB schema snippet is appended to the observation, exposing
+    #     a privilege-escalation path that only makes sense once the secret
+    #     leak is understood.  This rewards genuine causal reasoning:
+    #       "the leaked secret lets an attacker forge admin tokens → they can
+    #        reach the unguarded /admin/promote endpoint → full privilege
+    #        escalation."
+    #
+    #   Phase 3 (revealed after issue B is found)
+    #     After the agent flags the missing rate-limit, the server's nginx
+    #     config fragment is revealed, showing that /auth is also missing
+    #     the global IP-allowlist — confirming the attack surface is wider
+    #     than the code alone suggests.
+    #
+    # The chained field `"unlocks"` in each issue entry names the context_key
+    # that the environment injects into the observation when that issue is found.
+    # The environment layer reads this and appends the hint to `context_hints`.
+    {
+        "id": 6,
+        "name": "Causal Secrets Leak Investigation",
+        "difficulty": "hard",
+        "file_name": "auth_service.py",
+        "description": (
+            "Review this authentication service carefully. "
+            "Some issues unlock additional context about the wider system — "
+            "read every new hint you receive before continuing. "
+            "Use get_context on any suspicious line to reveal surrounding detail. "
+            "Identify all issues, then submit your review."
+        ),
+        "max_steps": 35,
+        "code": """\
+import jwt
+import sqlite3
+import time
+from flask import Flask, request, jsonify
+app = Flask(__name__)
+# ---- configuration ----------------------------------------------------------
+JWT_SECRET = "super-secret-jwt-key-do-not-share"   # line 9: hardcoded secret
+JWT_ALGORITHM = "HS256"
+# ---- helpers ----------------------------------------------------------------
+def create_token(user_id: int, role: str) -> str:
+    payload = {
+        "sub": user_id,
+        "role": role,
+        "exp": time.time() + 3600,
+    }
+    return jwt.encode(payload, JWT_SECRET, algorithm=JWT_ALGORITHM)
+def verify_token(token: str) -> dict:
+    # line 23: algorithm not pinned — accepts ["none"] attack if lib < 2.0
+    return jwt.decode(token, JWT_SECRET, algorithms=["HS256", "none"])
+# ---- routes -----------------------------------------------------------------
+@app.route("/auth", methods=["POST"])
+def authenticate():
+    \"\"\"Issue a JWT for valid credentials.\"\"\"
+    body  = request.get_json(force=True)
+    uname = body.get("username", "")
+    pwd   = body.get("password", "")
+    # line 33: no rate limiting → brute-force possible
+    conn   = sqlite3.connect("users.db")
+    cursor = conn.cursor()
+    # line 37: f-string SQL → injection
+    cursor.execute(f"SELECT id, role FROM users WHERE username='{uname}' AND password='{pwd}'")
+    row = cursor.fetchone()
+    conn.close()
+    if row:
+        return jsonify({"token": create_token(row[0], row[1])})
+    return jsonify({"error": "invalid credentials"}), 401
+@app.route("/admin/promote", methods=["POST"])
+def promote_user():
+    \"\"\"Promote a user to admin — JWT required.\"\"\"
+    token = request.headers.get("Authorization", "").replace("Bearer ", "")
+    try:
+        claims = verify_token(token)
+    except Exception:
+        return jsonify({"error": "unauthorized"}), 401
+    # line 51: role taken directly from token — no DB re-validation
+    if claims.get("role") == "admin":
+        target = request.json.get("user_id")
+        conn = sqlite3.connect("users.db")
+        conn.execute(f"UPDATE users SET role='admin' WHERE id={target}")   # line 55: injection
+        conn.commit()
+        conn.close()
+        return jsonify({"promoted": target})
+    return jsonify({"error": "forbidden"}), 403
+""",
+        # ── Ground-truth issues ───────────────────────────────────────────
+        "issues": [
+            {
+                "id": "hardcoded_jwt_secret",
+                "description": "JWT_SECRET is hard-coded; anyone with source access can forge tokens",
+                "line_range": (9, 9),
+                "keywords": [
+                    "hardcoded", "hard-coded", "jwt_secret", "secret", "jwt",
+                    "environment variable", "env var", "os.environ", "forge",
+                    "hardcode", "token secret",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+                # Finding this issue unlocks the DB schema context hint
+                "unlocks": "db_schema_hint",
+            },
+            {
+                "id": "jwt_none_algorithm",
+                "description": (
+                    "jwt.decode accepts 'none' algorithm — attacker can craft an "
+                    "unsigned token and bypass signature verification"
+                ),
+                "line_range": (23, 24),
+                "keywords": [
+                    "none", "algorithm", "alg", "unsigned", "bypass",
+                    "jwt", "signature", "verify", "none algorithm",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "no_rate_limit",
+                "description": "/auth endpoint has no rate limiting — susceptible to brute-force",
+                "line_range": (33, 34),
+                "keywords": [
+                    "rate limit", "rate-limit", "brute force", "brute-force",
+                    "throttle", "throttling", "flood", "limit", "attempts",
+                ],
+                "category": "security",
+                "severity": "error",
+                "weight": 0.75,
+                # Finding this unlocks the nginx config hint
+                "unlocks": "nginx_config_hint",
+            },
+            {
+                "id": "sql_injection_auth",
+                "description": "f-string interpolation in SQL query on /auth → injection",
+                "line_range": (37, 38),
+                "keywords": [
+                    "sql injection", "sql", "injection", "f-string", "parameterized",
+                    "sanitize", "escape", "prepared statement", "placeholder",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "role_from_token_only",
+                "description": (
+                    "Role is read directly from the JWT payload without re-checking the DB — "
+                    "a forged or stale token grants permanent privilege"
+                ),
+                "line_range": (51, 52),
+                "keywords": [
+                    "role", "token", "db", "database", "re-check", "revalidat",
+                    "stale", "privilege", "escalation", "claims", "payload",
+                    "not verified", "trust",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "sql_injection_promote",
+                "description": "f-string SQL in /admin/promote UPDATE query → second-order injection",
+                "line_range": (55, 55),
+                "keywords": [
+                    "sql injection", "sql", "injection", "f-string", "parameterized",
+                    "prepared statement", "placeholder", "update", "second order",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+        ],
+        "correct_decision": "request_changes",
+        # ── Causal context hints — revealed progressively ─────────────────
+        # Each value is injected into the observation once the triggering
+        # issue is found.  The agent must incorporate this new information
+        # into its ongoing world model.
+        "context_hints": {
+            "db_schema_hint": (
+                "=== UNLOCKED: Database Schema (users.db) ===\n"
+                "  CREATE TABLE users (\n"
+                "    id       INTEGER PRIMARY KEY,\n"
+                "    username TEXT UNIQUE NOT NULL,\n"
+                "    password TEXT NOT NULL,         -- stored as plaintext!\n"
+                "    role     TEXT DEFAULT 'viewer'  -- 'viewer' | 'editor' | 'admin'\n"
+                "  );\n"
+                "NOTE: The /admin/promote endpoint can elevate any user to 'admin'. "
+                "Combined with a forged JWT (from the leaked secret), an attacker "
+                "can reach this endpoint with admin claims and promote themselves."
+            ),
+            "nginx_config_hint": (
+                "=== UNLOCKED: nginx reverse-proxy config (nginx.conf excerpt) ===\n"
+                "  location /auth {\n"
+                "      proxy_pass http://auth_service:5000;\n"
+                "      # no ip_allowlist, no limit_req_zone\n"
+                "  }\n"
+                "NOTE: The nginx layer adds no rate-limiting or IP filtering "
+                "in front of /auth, confirming the brute-force surface is "
+                "fully exposed to the internet."
+            ),
+        },
+    },
 ]

tests/__init__.py ADDED Viewed

File without changes

tests/__pycache__/__init__.cpython-314.pyc ADDED Viewed

Binary file (162 Bytes). View file

tests/__pycache__/test_dynamic_world.cpython-314-pytest-9.0.3.pyc ADDED Viewed

Binary file (48.8 kB). View file

tests/__pycache__/test_grader.cpython-314-pytest-9.0.3.pyc ADDED Viewed

Binary file (47.6 kB). View file

tests/test_dynamic_world.py ADDED Viewed

	@@ -0,0 +1,344 @@

+"""
+Tests for the dynamic world features:
+  - server/mutator.py   (code mutation engine)
+  - Task 6              (causal chain / progressive observation)
+  - GET_CONTEXT action  (line-context probing)
+  - Causal unlock chain (context_hints injected into observation)
+"""
+import sys
+import os
+import copy
+import pytest
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from server.mutator import mutate_task
+from server.tasks import TASKS
+from server.grader import CodeReviewGrader
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+TASK6 = TASKS[6]   # causal chain task
+def _grader(task):
+    return CodeReviewGrader(task)
+# ===========================================================================
+# MUTATOR TESTS
+# ===========================================================================
+class TestMutator:
+    def test_returns_deep_copy(self):
+        """mutate_task must not modify the original TASKS entry."""
+        original_code = TASKS[1]["code"]
+        _ = mutate_task(TASKS[1], seed=0)
+        assert TASKS[1]["code"] == original_code
+    def test_mutation_seed_tag(self):
+        """Mutated task carries _mutation_seed matching the supplied seed."""
+        t = mutate_task(TASKS[1], seed=42)
+        assert t["_mutation_seed"] == 42
+    def test_different_seeds_differ(self):
+        """Two different seeds should (almost always) produce different code."""
+        t1 = mutate_task(TASKS[1], seed=0)
+        t2 = mutate_task(TASKS[1], seed=1)
+        # At minimum the blank-line insert shifts are different; codes differ
+        assert t1["code"] != TASKS[1]["code"] or t2["code"] != TASKS[1]["code"]
+    def test_same_seed_is_deterministic(self):
+        """Same seed must always produce identical output."""
+        t1 = mutate_task(TASKS[2], seed=99)
+        t2 = mutate_task(TASKS[2], seed=99)
+        assert t1["code"] == t2["code"]
+        assert t1["issues"] == t2["issues"]
+    def test_line_shift_applied(self):
+        """Line shift must move every issue line_range down by exactly 1."""
+        original = copy.deepcopy(TASKS[1])
+        mutated = mutate_task(TASKS[1], seed=7)
+        orig_ranges = [iss["line_range"] for iss in original["issues"]]
+        mut_ranges = [iss["line_range"] for iss in mutated["issues"]]
+        for orig_r, mut_r in zip(orig_ranges, mut_ranges):
+            assert mut_r[0] == orig_r[0] + 1
+            assert mut_r[1] == orig_r[1] + 1
+    def test_issue_count_preserved(self):
+        """Mutation must not add or remove issues."""
+        for task in TASKS[:6]:   # skip task 6 here, tested separately
+            mutated = mutate_task(task, seed=5)
+            assert len(mutated["issues"]) == len(task["issues"])
+    def test_issue_ids_preserved(self):
+        """Issue ids must be unchanged after mutation."""
+        original_ids = [i["id"] for i in TASKS[2]["issues"]]
+        mutated_ids = [i["id"] for i in mutate_task(TASKS[2], seed=3)["issues"]]
+        assert original_ids == mutated_ids
+    def test_grader_still_matches_after_mutation(self):
+        """
+        The grader must still award credit after mutation.
+        Use the off-by-one issue in task 1 — keyword 'range' is always present
+        and line_range shifts by exactly 1.
+        """
+        mutated = mutate_task(TASKS[1], seed=10)
+        g = _grader(mutated)
+        off_by_one = next(i for i in mutated["issues"] if i["id"] == "off_by_one")
+        target_line = off_by_one["line_range"][0]
+        score, found, _ = g.score_comment(
+            line_number=target_line,
+            comment="off-by-one error: range(len + 1) causes IndexError on the last iteration",
+            already_found=[],
+        )
+        assert "off_by_one" in found
+        assert score > 0.0
+    def test_correct_decision_preserved(self):
+        """correct_decision must be unchanged by mutation."""
+        for task in TASKS:
+            mutated = mutate_task(task, seed=1)
+            assert mutated["correct_decision"] == task["correct_decision"]
+# ===========================================================================
+# TASK 6 STRUCTURE TESTS
+# ===========================================================================
+class TestTask6Structure:
+    def test_task6_exists(self):
+        assert len(TASKS) >= 7, "Task 6 (causal chain) must exist in TASKS"
+    def test_task6_has_context_hints(self):
+        assert "context_hints" in TASK6
+        assert len(TASK6["context_hints"]) >= 2
+    def test_task6_unlock_keys_present(self):
+        """Every 'unlocks' key in an issue must exist in context_hints dict."""
+        hints = TASK6["context_hints"]
+        for issue in TASK6["issues"]:
+            key = issue.get("unlocks")
+            if key:
+                assert key in hints, f"Issue {issue['id']} unlocks '{key}' but key not in context_hints"
+    def test_task6_total_weight_positive(self):
+        g = _grader(TASK6)
+        assert g.total_weight > 0.0
+    def test_task6_has_chained_issues(self):
+        """At least two issues must have an 'unlocks' field."""
+        unlocking = [i for i in TASK6["issues"] if i.get("unlocks")]
+        assert len(unlocking) >= 2
+    def test_task6_correct_decision(self):
+        assert TASK6["correct_decision"] == "request_changes"
+# ===========================================================================
+# CAUSAL UNLOCK CHAIN TESTS (environment layer)
+# ===========================================================================
+class TestCausalUnlock:
+    """
+    Test the unlock mechanic via the environment's _unlock_causal_hints helper
+    and _handle_add_comment pipeline.
+    """
+    def _make_env(self):
+        """Return a fresh environment instance fast-forwarded to task 6."""
+        import asyncio
+        try:
+            from server.CodeReviewAgent_environment import ProbeEnvironment
+        except ImportError:
+            from CodeReviewAgent_environment import ProbeEnvironment  # type: ignore
+        env = ProbeEnvironment()
+        # force-set episode to task 6 (bypass cycling for test speed)
+        from server.mutator import mutate_task as _mt
+        task = _mt(TASK6, seed=0)
+        from server.grader import CodeReviewGrader as _G
+        env._grader = _G(task)
+        env._ep = env._fresh_episode(task)
+        return env
+    def test_no_hints_at_start(self):
+        env = self._make_env()
+        assert env._ep["context_hints"] == []
+    def test_unlock_fires_after_finding_trigger_issue(self):
+        """Finding hardcoded_jwt_secret must append db_schema_hint."""
+        env = self._make_env()
+        jwt_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "hardcoded_jwt_secret")
+        target_line = jwt_issue["line_range"][0]
+        env._step_count = 1
+        reward = env._handle_add_comment(
+            type("A", (), {
+                "line_number": target_line,
+                "comment": "JWT_SECRET is hardcoded — must be loaded from environment variable to prevent token forgery",
+                "severity": type("S", (), {"value": "critical"})(),
+                "category": type("C", (), {"value": "security"})(),
+            })()
+        )
+        assert "hardcoded_jwt_secret" in env._ep["issues_found"]
+        assert len(env._ep["context_hints"]) == 1
+        assert "db_schema_hint" in env._ep["hints_unlocked"]
+        assert "Database Schema" in env._ep["context_hints"][0]
+    def test_unlock_fires_only_once(self):
+        """The same hint must not be appended twice even if issue found again."""
+        env = self._make_env()
+        jwt_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "hardcoded_jwt_secret")
+        target_line = jwt_issue["line_range"][0]
+        for _ in range(3):
+            env._step_count += 1
+            env._handle_add_comment(
+                type("A", (), {
+                    "line_number": target_line,
+                    "comment": "JWT_SECRET is hardcoded — must be loaded from environment variable",
+                    "severity": type("S", (), {"value": "critical"})(),
+                    "category": type("C", (), {"value": "security"})(),
+                })()
+            )
+        assert len(env._ep["context_hints"]) == 1
+    def test_second_unlock_fires_independently(self):
+        """Finding no_rate_limit must append nginx_config_hint independently."""
+        env = self._make_env()
+        rate_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "no_rate_limit")
+        target_line = rate_issue["line_range"][0]
+        env._step_count = 1
+        env._handle_add_comment(
+            type("A", (), {
+                "line_number": target_line,
+                "comment": "No rate limiting on /auth endpoint — susceptible to brute-force attacks",
+                "severity": type("S", (), {"value": "error"})(),
+                "category": type("C", (), {"value": "security"})(),
+            })()
+        )
+        assert "nginx_config_hint" in env._ep["hints_unlocked"]
+        assert any("nginx" in h.lower() for h in env._ep["context_hints"])
+    def test_both_unlocks_can_fire_in_same_episode(self):
+        """Both hints can be unlocked within one episode."""
+        env = self._make_env()
+        task = env._ep["task"]
+        jwt_issue = next(i for i in task["issues"] if i["id"] == "hardcoded_jwt_secret")
+        rate_issue = next(i for i in task["issues"] if i["id"] == "no_rate_limit")
+        for step, (issue, kw) in enumerate([
+            (jwt_issue, "JWT_SECRET is hardcoded — must be loaded from environment variable to prevent forgery"),
+            (rate_issue, "No rate limiting on /auth endpoint — susceptible to brute-force attacks"),
+        ], start=1):
+            env._step_count = step
+            env._handle_add_comment(
+                type("A", (), {
+                    "line_number": issue["line_range"][0],
+                    "comment": kw,
+                    "severity": type("S", (), {"value": "critical"})(),
+                    "category": type("C", (), {"value": "security"})(),
+                })()
+            )
+        assert len(env._ep["context_hints"]) == 2
+        assert env._ep["hints_unlocked"] == {"db_schema_hint", "nginx_config_hint"}
+    def test_context_hints_appear_in_observation(self):
+        """context_hints list must be non-empty in the observation after an unlock."""
+        env = self._make_env()
+        jwt_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "hardcoded_jwt_secret")
+        env._step_count = 1
+        env._handle_add_comment(
+            type("A", (), {
+                "line_number": jwt_issue["line_range"][0],
+                "comment": "JWT_SECRET is hardcoded — must be loaded from environment variable",
+                "severity": type("S", (), {"value": "critical"})(),
+                "category": type("C", (), {"value": "security"})(),
+            })()
+        )
+        obs = env._make_obs(reward=0.0, done=False)
+        assert len(obs.context_hints) == 1
+        assert "Database Schema" in obs.context_hints[0]
+# ===========================================================================
+# GET_CONTEXT ACTION TESTS
+# ===========================================================================
+class TestGetContext:
+    def _make_env(self):
+        try:
+            from server.CodeReviewAgent_environment import ProbeEnvironment
+        except ImportError:
+            from CodeReviewAgent_environment import ProbeEnvironment  # type: ignore
+        from server.mutator import mutate_task as _mt
+        from server.grader import CodeReviewGrader as _G
+        env = ProbeEnvironment()
+        task = _mt(TASKS[1], seed=0)
+        env._grader = _G(task)
+        env._ep = env._fresh_episode(task)
+        return env
+    def test_get_context_near_issue_no_penalty(self):
+        """Probing a line near a real issue must cost 0.0."""
+        env = self._make_env()
+        issue_line = env._ep["task"]["issues"][0]["line_range"][0]
+        env._step_count = 1
+        reward = env._handle_get_context(
+            type("A", (), {"line_number": issue_line})()
+        )
+        assert reward.total == 0.0
+        assert reward.passed is True
+    def test_get_context_far_from_issue_costs_penalty(self):
+        """Probing a line far from any issue must cost -0.01."""
+        env = self._make_env()
+        env._step_count = 1
+        reward = env._handle_get_context(
+            type("A", (), {"line_number": 999})()
+        )
+        assert reward.total == pytest.approx(-0.01, abs=0.001)
+        assert reward.passed is False
+    def test_get_context_no_line_number_penalised(self):
+        """GET_CONTEXT with no line_number must return -0.02."""
+        env = self._make_env()
+        env._step_count = 1
+        reward = env._handle_get_context(
+            type("A", (), {"line_number": None})()
+        )
+        assert reward.total == pytest.approx(-0.02, abs=0.001)
+    def test_get_context_snippet_stored_in_history(self):
+        """The context probe must be recorded in review_comments."""
+        env = self._make_env()
+        env._step_count = 1
+        env._handle_get_context(
+            type("A", (), {"line_number": 4})()
+        )
+        probes = [c for c in env._ep["review_comments"] if c.get("type") == "context_probe"]
+        assert len(probes) == 1
+        assert probes[0]["line"] == 4
+        assert "context" in probes[0]
+    def test_get_context_snippet_contains_requested_line(self):
+        """The returned snippet must reference the requested line number."""
+        env = self._make_env()
+        env._step_count = 1
+        reward = env._handle_get_context(
+            type("A", (), {"line_number": 4})()
+        )
+        # explanation contains the formatted snippet with line numbers
+        assert "4:" in reward.explanation or "4 :" in reward.explanation

tests/test_grader.py ADDED Viewed

	@@ -0,0 +1,397 @@

+"""
+Tests for CodeReviewGrader — validates all 5 RL attack scenarios plus
+edge cases for the three anti-exploit fixes made in grader.py.
+Attack targets (from the task spec):
+  Lazy / vague output   → 0.00 – 0.15
+  Average output        → 0.30 – 0.50
+  Good output           → 0.60 – 0.80
+  Perfect output        → 0.85 – 1.00
+  Wrong bug reported    → penalty / 0.00
+Coverage:
+  1. Lazy attack
+  2. Vague attack
+  3. Wrong-bug / hallucination attack
+  4. Perfect output
+  5. Base-model (average) output
+  6. LINE_TOLERANCE boundary (fix 1)
+  7. Minimum comment length guard (fix 2)
+  8. False-positive penalty value (fix 3)
+  9. final_score — full coverage + correct decision
+  10. final_score — zero coverage + wrong decision
+  11. final_score — partial coverage
+  12. Duplicate SUBMIT_REVIEW penalty (environment layer)
+  13. already_found deduplication
+  14. None / empty comment guard
+"""
+import sys
+import os
+import pytest
+# Ensure the project root (containing the `server` package) is on the path
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from server.grader import CodeReviewGrader, LINE_TOLERANCE
+from server.tasks import TASKS
+# ── Fixtures ──────────────────────────────────────────────────────────────────
+@pytest.fixture
+def task0():
+    """Ultra-easy bootstrap task (2 issues, equal weight 1.0 each)."""
+    return TASKS[0]
+@pytest.fixture
+def task1():
+    """Easy task (3 issues)."""
+    return TASKS[1]
+@pytest.fixture
+def grader0(task0):
+    return CodeReviewGrader(task0)
+@pytest.fixture
+def grader1(task1):
+    return CodeReviewGrader(task1)
+# ── Sanity ────────────────────────────────────────────────────────────────────
+def test_line_tolerance_value():
+    """LINE_TOLERANCE must be 2 after the anti-exploit fix."""
+    assert LINE_TOLERANCE == 2
+# ── 1. Lazy attack ────────────────────────────────────────────────────────────
+def test_lazy_attack_no_credit(grader0):
+    """Generic comment with no matching keyword earns only false-positive penalty."""
+    score, found, _ = grader0.score_comment(
+        line_number=4,
+        # deliberately avoids all task-0 keywords (off-by-one, index, range,
+        # bug, security, password, credential, hardcoded, env, secret, etc.)
+        comment="This function could probably be improved with some refactoring.",
+        already_found=[],
+    )
+    assert found == []
+    assert score <= 0.0  # pure false-positive penalty, no credit
+def test_lazy_attack_wrong_line(grader0):
+    """Keyword present but line number far from issue — no credit awarded."""
+    score, found, _ = grader0.score_comment(
+        line_number=99,  # far from issue at line 4
+        comment="off-by-one indexerror range",
+        already_found=[],
+    )
+    assert found == []
+    assert score < 0.0  # false-positive penalty applied
+# ── 2. Vague attack ───────────────────────────────────────────────────────────
+def test_vague_attack_category_only(grader0):
+    """Mentioning category ('bug') on correct line but no specific keyword — no credit."""
+    score, found, _ = grader0.score_comment(
+        line_number=4,
+        comment="This code has a logical issue.",
+        already_found=[],
+    )
+    assert found == []
+    assert score <= 0.0
+# ── 3. Wrong-bug / hallucination attack ──────────────────────────────────────
+def test_wrong_bug_on_correct_line_wrong_keyword(grader0):
+    """Hallucinated keyword on the correct line must not earn credit."""
+    score, found, _ = grader0.score_comment(
+        line_number=4,
+        comment="This has a performance bottleneck and memory leak issue here.",
+        already_found=[],
+    )
+    # 'performance' / 'memory' are not in bootstrap_off_by_one keywords
+    assert found == []
+    assert score <= 0.0
+def test_wrong_bug_wrong_line_right_keyword(grader0):
+    """Right keyword, wrong line — line_hit must block the credit."""
+    score, found, _ = grader0.score_comment(
+        line_number=50,  # nowhere near line 4 or 11
+        comment="off-by-one indexerror range len + 1",
+        already_found=[],
+    )
+    assert found == []
+    assert score <= 0.0
+# ── 4. Perfect output ─────────────────────────────────────────────────────────
+def test_perfect_comment_task0_issue1(grader0):
+    """Exact keyword + exact line → full credit for issue 1."""
+    score, found, breakdown = grader0.score_comment(
+        line_number=4,
+        comment="Off-by-one error: range(len(data) + 1) causes IndexError on the last iteration.",
+        already_found=[],
+    )
+    assert "bootstrap_off_by_one" in found
+    assert breakdown["issue_credit"] == pytest.approx(0.30, abs=0.01)
+    assert score > 0.0
+def test_perfect_comment_task0_issue2(grader0):
+    """Exact keyword + exact line → full credit for issue 2."""
+    score, found, _ = grader0.score_comment(
+        line_number=11,
+        comment="Hardcoded password / credential in source — move to environment variable.",
+        already_found=[],
+    )
+    assert "bootstrap_hardcoded_cred" in found
+    assert score > 0.0
+def test_perfect_final_score_task0(grader0):
+    """Full coverage + correct decision gives max terminal reward.
+    final_score() is the TERMINAL component only (coverage 0.20 + decision 0.10
+    + efficiency 0.10 = max 0.40).  The per-comment 0.60 accumulates separately
+    during the episode via score_comment().  Assert the realistic terminal range.
+    """
+    reward = grader0.final_score(
+        issues_found=["bootstrap_off_by_one", "bootstrap_hardcoded_cred"],
+        review_decision="request_changes",
+        step_count=4,
+        max_steps=6,
+        current_step=4,
+    )
+    # coverage_bonus=0.20 + decision_score=0.10 + efficiency_bonus>0 → ~0.33-0.40
+    assert reward.total >= 0.30
+    assert reward.components["coverage_bonus"] == pytest.approx(0.20, abs=0.01)
+    assert reward.components["decision_score"] == pytest.approx(0.10, abs=0.001)
+    assert reward.passed is True
+# ── 5. Base-model (average) output ───────────────────────────────────────────
+def test_base_model_finds_one_of_two(grader0):
+    """Agent that finds 1/2 issues correctly should score in the average range."""
+    # Step 1: correct comment finding issue 1
+    score1, found1, _ = grader0.score_comment(
+        line_number=4,
+        comment="range(len(data) + 1) has an off-by-one bug causing IndexError.",
+        already_found=[],
+    )
+    # Step 2: vague comment on issue 2 line — no keyword match
+    score2, found2, _ = grader0.score_comment(
+        line_number=11,
+        comment="This line looks like it might have an issue with the connection string.",
+        already_found=found1,
+    )
+    reward = grader0.final_score(
+        issues_found=found1 + found2,
+        review_decision="request_changes",
+        step_count=4,
+        max_steps=6,
+        current_step=4,
+    )
+    # 50 % coverage → coverage_bonus=0.10, correct_decision=+0.10 → 0.20 total
+    # Well below the 0.85 perfect ceiling, above 0.10 lazy floor
+    assert 0.15 <= reward.total <= 0.55
+# ── 6. LINE_TOLERANCE boundary ────────────────────────────────────────────────
+def test_line_just_inside_tolerance(grader0):
+    """line_number at start - LINE_TOLERANCE must still match."""
+    issue_start = TASKS[0]["issues"][0]["line_range"][0]  # 4
+    score, found, _ = grader0.score_comment(
+        line_number=issue_start - LINE_TOLERANCE,  # exactly at boundary
+        comment="off-by-one indexerror range(len + 1) causes crash here",
+        already_found=[],
+    )
+    assert "bootstrap_off_by_one" in found
+def test_line_just_outside_tolerance(grader0):
+    """line_number at start - LINE_TOLERANCE - 1 must NOT match."""
+    issue_start = TASKS[0]["issues"][0]["line_range"][0]  # 4
+    score, found, _ = grader0.score_comment(
+        line_number=issue_start - LINE_TOLERANCE - 1,  # one beyond boundary
+        comment="off-by-one indexerror range(len + 1) causes crash here",
+        already_found=[],
+    )
+    assert found == []
+    assert score <= 0.0
+# ── 7. Minimum comment length guard ──────────────────────────────────────────
+def test_short_keyword_comment_no_credit(grader0):
+    """A comment ≤ 15 chars containing a matching keyword must NOT earn credit."""
+    score, found, _ = grader0.score_comment(
+        line_number=4,
+        comment="indexerror",  # 10 chars — below 15-char threshold
+        already_found=[],
+    )
+    assert found == []
+    # short comment → neither credit nor false-positive penalty
+    assert score == 0.0
+def test_short_comment_no_false_positive_penalty(grader0):
+    """A short comment that matches nothing must NOT be penalised (too trivial)."""
+    score, found, _ = grader0.score_comment(
+        line_number=99,
+        comment="hmm",  # 3 chars
+        already_found=[],
+    )
+    assert found == []
+    assert score == 0.0
+def test_borderline_length_comment(grader0):
+    """A 16-char comment (just above threshold) with keyword + correct line earns credit."""
+    score, found, _ = grader0.score_comment(
+        line_number=4,
+        comment="off-by-one range!",  # 17 chars, > 15
+        already_found=[],
+    )
+    assert "bootstrap_off_by_one" in found
+    assert score > 0.0
+# ── 8. False-positive penalty value ──────────────────────────────────────────
+def test_false_positive_penalty_magnitude(grader0):
+    """Each wrong substantive comment must cost exactly -0.05."""
+    score, found, breakdown = grader0.score_comment(
+        line_number=99,
+        comment="This line has a performance issue with the loop structure.",
+        already_found=[],
+    )
+    assert found == []
+    assert breakdown["false_positive_penalty"] == pytest.approx(-0.05, abs=0.001)
+def test_multiple_false_positives_accumulate(grader0):
+    """Two wrong comments should each attract -0.05 independently."""
+    s1, _, bd1 = grader0.score_comment(
+        line_number=99,
+        comment="This line has a performance issue with the loop structure.",
+        already_found=[],
+    )
+    s2, _, bd2 = grader0.score_comment(
+        line_number=88,
+        comment="There is a design problem with this database call here.",
+        already_found=[],
+    )
+    assert bd1["false_positive_penalty"] == pytest.approx(-0.05, abs=0.001)
+    assert bd2["false_positive_penalty"] == pytest.approx(-0.05, abs=0.001)
+    # Combined penalty is -0.10 — within the -0.1 to -0.2 spec for 2 wrong claims
+    assert s1 + s2 == pytest.approx(-0.10, abs=0.001)
+# ── 9. final_score — full coverage + correct decision ─────────────────────────
+def test_final_score_full_coverage_correct_decision(grader1):
+    """100% coverage + correct decision → max terminal reward ~0.37-0.40."""
+    all_ids = [iss["id"] for iss in TASKS[1]["issues"]]
+    reward = grader1.final_score(
+        issues_found=all_ids,
+        review_decision="request_changes",
+        step_count=5,
+        max_steps=15,
+        current_step=5,
+    )
+    assert reward.total >= 0.30
+    assert reward.passed is True
+    assert reward.terminal is True
+    assert reward.components["coverage_bonus"] == pytest.approx(0.20, abs=0.01)
+    assert reward.components["decision_score"] == pytest.approx(0.10, abs=0.001)
+# ── 10. final_score — zero coverage + wrong decision ─────────────────────────
+def test_final_score_zero_coverage_wrong_decision(grader1):
+    reward = grader1.final_score(
+        issues_found=[],
+        review_decision="approve",  # wrong — should be request_changes
+        step_count=15,
+        max_steps=15,
+        current_step=15,
+    )
+    assert reward.total <= 0.0
+    assert reward.passed is False
+    assert reward.components["decision_score"] == pytest.approx(-0.10, abs=0.001)
+    assert reward.components["coverage_bonus"] == pytest.approx(0.0, abs=0.001)
+# ── 11. final_score — partial coverage ───────────────────────────────────────
+def test_final_score_partial_coverage(grader1):
+    """Finding 1 out of 3 issues (weight 1.0 / 2.5 total) with correct decision."""
+    reward = grader1.final_score(
+        issues_found=["off_by_one"],  # weight 1.0 out of 2.5 total
+        review_decision="request_changes",
+        step_count=10,
+        max_steps=15,
+        current_step=10,
+    )
+    # coverage = 1.0/2.5 = 0.40 → coverage_bonus = 0.08
+    # decision_score = +0.10
+    # efficiency_bonus = 0.0 (coverage < 0.60)
+    # total = 0.18
+    assert 0.10 <= reward.total <= 0.30
+    assert reward.passed is False  # coverage < 60 %
+# ── 12. Already-found deduplication ──────────────────────────────────────────
+def test_already_found_not_double_credited(grader0):
+    """An issue already in already_found must not be credited again."""
+    score, found, _ = grader0.score_comment(
+        line_number=4,
+        comment="off-by-one indexerror range(len + 1) causes crash on last item",
+        already_found=["bootstrap_off_by_one"],  # pre-marked as found
+    )
+    assert "bootstrap_off_by_one" not in found
+    assert score <= 0.0  # false-positive penalty since nothing was matched
+# ── 13. None / empty comment guard ───────────────────────────────────────────
+def test_none_comment_returns_zero(grader0):
+    score, found, breakdown = grader0.score_comment(
+        line_number=4,
+        comment=None,
+        already_found=[],
+    )
+    assert score == 0.0
+    assert found == []
+    assert breakdown == {}
+def test_empty_comment_returns_zero(grader0):
+    score, found, _ = grader0.score_comment(
+        line_number=4,
+        comment="",
+        already_found=[],
+    )
+    assert score == 0.0
+    assert found == []
+# ── 14. Task weight totals are non-zero (guards __init__) ────────────────────
+def test_all_task_total_weights_positive():
+    for task in TASKS:
+        grader = CodeReviewGrader(task)
+        assert grader.total_weight > 0.0, f"Task {task['id']} has zero total weight"

uv.lock CHANGED Viewed

@@ -882,6 +882,7 @@ dependencies = [
     { name = "gradio-client" },
     { name = "typer" },
 ]
 wheels = [
     { url = "https://files.pythonhosted.org/packages/30/2d/afff2ee87e75d8eb85c92bb8cf0e15b05c23c2ebd8fd8dec781d8601ed7f/hf_gradio-0.4.1-py3-none-any.whl", hash = "sha256:76b8cb8be6abe62d74c1ad2d35b42f0629db89aa9e1a8d033cecfe7c856eeab3", size = 4482, upload-time = "2026-04-17T19:53:31.827Z" },
 ]
@@ -1571,32 +1572,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/12/cf/03675d8bd8ecbf4445504d8071adab19f5f993676795708e36402ab38263/openapi_pydantic-0.5.1-py3-none-any.whl", hash = "sha256:a3a09ef4586f5bd760a8df7f43028b60cafb6d9f61de2acba9574766255ab146", size = 96381, upload-time = "2025-01-08T19:29:25.275Z" },
 ]
-[[package]]
-name = "openenv-codereviewagent"
-version = "0.1.0"
-source = { editable = "." }
-dependencies = [
-    { name = "openai" },
-    { name = "openenv-core", extra = ["core"] },
-    { name = "python-dotenv" },
-]
-[package.optional-dependencies]
-dev = [
-    { name = "pytest" },
-    { name = "pytest-cov" },
-]
-[package.metadata]
-requires-dist = [
-    { name = "openai", specifier = ">=1.0.0" },
-    { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
-    { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
-    { name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.0.0" },
-    { name = "python-dotenv", specifier = ">=1.2.2" },
-]
-provides-extras = ["dev"]
 [[package]]
 name = "openenv-core"
 version = "0.2.3"
@@ -1632,6 +1607,44 @@ core = [
     { name = "websockets" },
 ]
 [[package]]
 name = "opentelemetry-api"
 version = "1.41.0"

     { name = "gradio-client" },
     { name = "typer" },
 ]
+sdist = { url = "https://files.pythonhosted.org/packages/ce/86/c9694b7cfada5780e75769e60dc161a161f4dd7fc91b61db5e3a3338bef9/hf_gradio-0.4.1.tar.gz", hash = "sha256:a017d942618f0d495a58ee4563047fa04bef614c00e0cb789a9a6d0633cffa7b", size = 6560, upload-time = "2026-04-22T14:01:32.334Z" }
 wheels = [
     { url = "https://files.pythonhosted.org/packages/30/2d/afff2ee87e75d8eb85c92bb8cf0e15b05c23c2ebd8fd8dec781d8601ed7f/hf_gradio-0.4.1-py3-none-any.whl", hash = "sha256:76b8cb8be6abe62d74c1ad2d35b42f0629db89aa9e1a8d033cecfe7c856eeab3", size = 4482, upload-time = "2026-04-17T19:53:31.827Z" },
 ]
     { url = "https://files.pythonhosted.org/packages/12/cf/03675d8bd8ecbf4445504d8071adab19f5f993676795708e36402ab38263/openapi_pydantic-0.5.1-py3-none-any.whl", hash = "sha256:a3a09ef4586f5bd760a8df7f43028b60cafb6d9f61de2acba9574766255ab146", size = 96381, upload-time = "2025-01-08T19:29:25.275Z" },
 ]
 [[package]]
 name = "openenv-core"
 version = "0.2.3"
     { name = "websockets" },
 ]
+[[package]]
+name = "openenv-probe"
+version = "0.1.0"
+source = { editable = "." }
+dependencies = [
+    { name = "openai" },
+    { name = "openenv-core", extra = ["core"] },
+    { name = "python-dotenv" },
+]
+[package.optional-dependencies]
+dev = [
+    { name = "pytest" },
+    { name = "pytest-cov" },
+]
+[package.dev-dependencies]
+dev = [
+    { name = "pytest" },
+    { name = "pytest-cov" },
+]
+[package.metadata]
+requires-dist = [
+    { name = "openai", specifier = ">=1.0.0" },
+    { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
+    { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
+    { name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.0.0" },
+    { name = "python-dotenv", specifier = ">=1.2.2" },
+]
+provides-extras = ["dev"]
+[package.metadata.requires-dev]
+dev = [
+    { name = "pytest", specifier = ">=9.0.3" },
+    { name = "pytest-cov", specifier = ">=7.1.0" },
+]
 [[package]]
 name = "opentelemetry-api"
 version = "1.41.0"