Thakur, Mahipal commited on
Commit
ab287c4
·
1 Parent(s): 62f5d41

feat: add dynamic world modeling — mutation engine, GET_CONTEXT action, causal chain task

Browse files

server/mutator.py: variable rename + line shift + constant variance per episode

server/tasks.py: Task 6 causal chain with progressive context unlock

server/CodeReviewAgent_environment.py: wire mutation, GET_CONTEXT, unlock logic

models.py: add GET_CONTEXT action type + context_hints observation field

tests/test_dynamic_world.py: 26 tests covering all new features

refactor: rename project from CodeReviewAgent to PRobe

All class names: ProbeAction, ProbeObservation, ProbeEnv, ProbeEnvironment

pyproject.toml, openenv.yaml, README.md, __init__.py fully updated

50/50 tests passing

README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: CodeReviewAgent Environment
3
  emoji: 🔍
4
  colorFrom: blue
5
  colorTo: green
@@ -12,13 +12,22 @@ tags:
12
  - code-review
13
  - rl-training
14
  - grpo
 
 
15
  ---
16
 
17
- # CodeReviewAgentOpenEnv Environment
18
 
19
  > **OpenEnv Hackathon 2026 · Theme #3.1 — World Modeling (Professional Tasks)**
20
 
21
- An RL training environment where an LLM learns to perform structured **pull-request code reviews** on real Python source files. The agent must identify bugs, security vulnerabilities, performance bottlenecks, and design issues — and submit a structured review with line-level comments.
 
 
 
 
 
 
 
22
 
23
  ---
24
 
@@ -31,15 +40,17 @@ This environment provides a **reward signal** that directly measures review qual
31
 
32
  ## Environment Design
33
 
34
- ### Tasks (5 total)
35
 
36
  | ID | Difficulty | File | Issues | Domain |
37
  |----|-----------|------|--------|--------|
38
- | 0 | Easy | `utils.py` | 3 | Logic bugs, off-by-one, dead code |
39
- | 1 | Medium | `auth.py` | 5 | SQL injection, MD5, eval(), hardcoded creds |
40
- | 2 | Hard | `data_pipeline.py` | 7 | N+1, SSL bypass, thread leak, OOM cache |
41
- | 3 | Medium | `async_worker.py` | 5 | Race condition, missing await, resource leak |
42
- | 4 | Hard | `api_server.py` | 6 | Command injection, path traversal, pickle RCE |
 
 
43
 
44
  Tasks cycle automatically on each `reset()` call.
45
 
@@ -47,18 +58,19 @@ Tasks cycle automatically on each `reset()` call.
47
 
48
  ```python
49
  {
50
- "code_snippet": str, # Python source to review
51
- "task_description": str, # What to look for
52
- "file_name": str,
53
- "task_id": int, # 0–4
54
- "task_difficulty": str, # easy / medium / hard
55
- "review_history": list, # actions taken so far this episode
56
- "step_count": int,
57
- "max_steps": int,
58
  "issues_found_count": int,
59
- "total_issues": int,
60
- "done": bool,
61
- "reward": float,
 
62
  }
63
  ```
64
 
@@ -66,7 +78,8 @@ Tasks cycle automatically on each `reset()` call.
66
 
67
  | action_type | Required fields | Effect |
68
  |-------------|----------------|--------|
69
- | `add_comment` | `line_number`, `comment`, `severity`, `category` | Annotate a line; partial reward if it matches a ground-truth issue |
 
70
  | `request_changes` | `comment` | Signal PR needs work |
71
  | `approve` | — | Approve PR (penalised if issues remain) |
72
  | `submit_review` | — | Finalise review; terminal reward |
@@ -86,7 +99,54 @@ Terminal (SUBMIT_REVIEW):
86
  Maximum achievable: ~1.0
87
  ```
88
 
89
- Grading uses **keyword + line-range matching** (±3 lines tolerance) against hand-labelled ground-truth issues — no LLM judge needed, fully deterministic.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ---
92
 
@@ -141,11 +201,11 @@ All install, training, evaluation, and plotting cells are included.
141
 
142
  *(Fill in after training run)*
143
 
144
- | Model | Avg Reward | Task-0 | Task-1 | Task-2 | Task-3 | Task-4 |
145
- |-------|-----------|--------|--------|--------|--------|--------|
146
- | GPT-4o-mini (baseline) | — | — | — | — | — | — |
147
- | Qwen2.5-1.5B (untrained) | — | — | — | — | — | — |
148
- | Qwen2.5-1.5B (GRPO 3 epochs) | — | — | — | — | — | — |
149
 
150
  Training curves: `training_curves.png` · Per-task rewards: `per_task_reward.png`
151
 
@@ -154,17 +214,21 @@ Training curves: `training_curves.png` · Per-task rewards: `per_task_reward.png
154
  ## Project Structure
155
 
156
  ```
157
- CodeReviewAgent/
158
  ├── openenv.yaml # OpenEnv manifest
159
  ├── pyproject.toml
160
  ├── models.py # Action + Observation types
161
  ├── client.py # OpenEnv client
162
- ── server/
163
- ├── app.py # FastAPI server
164
- ├── CodeReviewAgent_environment.py
165
- ├── grader.py # Deterministic reward grader
166
- ├── tasks.py # 5 ground-truth tasks
167
- ── Dockerfile
 
 
 
 
168
  train_grpo.py # GRPO training script
169
  train_grpo_colab.ipynb # Colab notebook
170
  baseline.py # GPT-4o-mini baseline
 
1
  ---
2
+ title: PRobe Environment
3
  emoji: 🔍
4
  colorFrom: blue
5
  colorTo: green
 
12
  - code-review
13
  - rl-training
14
  - grpo
15
+ - world-modeling
16
+ - probe
17
  ---
18
 
19
+ # PRobePull Request Investigation Environment
20
 
21
  > **OpenEnv Hackathon 2026 · Theme #3.1 — World Modeling (Professional Tasks)**
22
 
23
+ > *An RL environment where agents learn to investigate code like a security researcher, not scan it like a linter.*
24
+
25
+ PRobe is an RL training environment where an LLM learns to perform structured **pull-request code reviews** on real Python source files. The agent must identify bugs, security vulnerabilities, performance bottlenecks, and design issues — and submit a structured review with line-level comments.
26
+
27
+ The name has three meanings that map directly to the environment's design:
28
+ - **PR** — the domain: pull-request review
29
+ - **Probe** — the `get_context` action where the agent literally probes lines for deeper context
30
+ - **World Modeling** — an agent that *investigates* a partially observable system, updating its beliefs as new evidence is revealed
31
 
32
  ---
33
 
 
40
 
41
  ## Environment Design
42
 
43
+ ### Tasks (7 total)
44
 
45
  | ID | Difficulty | File | Issues | Domain |
46
  |----|-----------|------|--------|--------|
47
+ | 0 | Ultra-easy | `bootstrap.py` | 2 | Off-by-one, hardcoded credential (hinted in comments) |
48
+ | 1 | Easy | `utils.py` | 3 | Logic bugs, off-by-one, dead code |
49
+ | 2 | Medium | `auth.py` | 5 | SQL injection, MD5, eval(), hardcoded creds |
50
+ | 3 | Hard | `data_pipeline.py` | 7 | N+1, SSL bypass, thread leak, OOM cache |
51
+ | 4 | Medium | `async_worker.py` | 5 | Race condition, missing await, resource leak |
52
+ | 5 | Hard | `api_server.py` | 6 | Command injection, path traversal, pickle RCE |
53
+ | 6 | Hard | `auth_service.py` | 6 | **Causal chain** — JWT forgery → privilege escalation |
54
 
55
  Tasks cycle automatically on each `reset()` call.
56
 
 
58
 
59
  ```python
60
  {
61
+ "code_snippet": str, # Python source to review (mutated each episode)
62
+ "task_description": str, # What to look for
63
+ "file_name": str,
64
+ "task_id": int, # 0–6
65
+ "task_difficulty": str, # ultra-easy / easy / medium / hard
66
+ "review_history": list, # actions taken so far this episode
67
+ "step_count": int,
68
+ "max_steps": int,
69
  "issues_found_count": int,
70
+ "total_issues": int,
71
+ "context_hints": list, # causal hints unlocked so far (Task 6)
72
+ "done": bool,
73
+ "reward": float,
74
  }
75
  ```
76
 
 
78
 
79
  | action_type | Required fields | Effect |
80
  |-------------|----------------|--------|
81
+ | `add_comment` | `line_number`, `comment`, `severity`, `category` | Annotate a line; reward if it matches a ground-truth issue |
82
+ | `get_context` | `line_number` | Reveal ±5 lines of context around a line (free near issues, −0.01 elsewhere) |
83
  | `request_changes` | `comment` | Signal PR needs work |
84
  | `approve` | — | Approve PR (penalised if issues remain) |
85
  | `submit_review` | — | Finalise review; terminal reward |
 
99
  Maximum achievable: ~1.0
100
  ```
101
 
102
+ Grading uses **keyword + line-range matching** (±2 lines tolerance) against hand-labelled ground-truth issues — no LLM judge needed, fully deterministic.
103
+
104
+ ---
105
+
106
+ ## Dynamic World Features (v3)
107
+
108
+ ### Code Mutation
109
+ Every `reset()` applies three surface-level mutations so the agent must *read* code each episode rather than memorise tokens:
110
+
111
+ | Mutation | Effect |
112
+ |---|---|
113
+ | Variable rename | One identifier swapped for a synonym (e.g. `total` → `acc`) |
114
+ | Line shift | One blank line inserted above the first issue, shifting all `line_range` values by +1 |
115
+ | Constant variance | One numeric literal nudged ±1 (e.g. `range(1000)` → `range(999)`) |
116
+
117
+ Mutations are fully **deterministic** given the episode seed — reproducible but always fresh.
118
+
119
+ ### GET_CONTEXT Action
120
+ The agent can spend a step probing any line to receive ±5 lines of surrounding context:
121
+
122
+ ```python
123
+ action = ProbeAction(
124
+ action_type="get_context",
125
+ line_number=37,
126
+ )
127
+ # Observation will contain a context snippet around line 37
128
+ # Cost: -0.01 if line is far from any real issue, 0.00 if near one
129
+ ```
130
+
131
+ ### Causal Unlock Chain (Task 6)
132
+ Task 6 implements a **progressive world model**: finding certain issues unlocks new context hints that reveal deeper parts of the system:
133
+
134
+ ```
135
+ Find hardcoded JWT secret
136
+
137
+
138
+ DB schema revealed ──► agent sees plaintext passwords + role table
139
+
140
+
141
+ Can now reason: leaked secret → forge admin token → privilege escalation
142
+
143
+ Find missing rate-limit
144
+
145
+
146
+ nginx config revealed ──► confirms /auth fully exposed, no IP filtering
147
+ ```
148
+
149
+ This rewards genuine *causal reasoning* — the agent must update its world model as new evidence arrives.
150
 
151
  ---
152
 
 
201
 
202
  *(Fill in after training run)*
203
 
204
+ | Model | Avg Reward | Task-0 | Task-1 | Task-2 | Task-3 | Task-4 | Task-5 | Task-6 |
205
+ |-------|-----------|--------|--------|--------|--------|--------|--------|--------|
206
+ | GPT-4o-mini (baseline) | — | — | — | — | — | — | — | — |
207
+ | Qwen2.5-1.5B (untrained) | — | — | — | — | — | — | — | — |
208
+ | Qwen2.5-1.5B (GRPO 3 epochs) | — | — | — | — | — | — | — | — |
209
 
210
  Training curves: `training_curves.png` · Per-task rewards: `per_task_reward.png`
211
 
 
214
  ## Project Structure
215
 
216
  ```
217
+ PRobe/
218
  ├── openenv.yaml # OpenEnv manifest
219
  ├── pyproject.toml
220
  ├── models.py # Action + Observation types
221
  ├── client.py # OpenEnv client
222
+ ── server/
223
+ ├── app.py # FastAPI server
224
+ ├── PRobe_environment.py # Environment core
225
+ ├── grader.py # Deterministic reward grader
226
+ ├── mutator.py # Code mutation engine (dynamic world)
227
+ │ ├── tasks.py # 7 ground-truth tasks
228
+ │ └── Dockerfile
229
+ ├── tests/
230
+ │ ├── test_grader.py # 24 grader tests (all 5 RL attacks)
231
+ │ └── test_dynamic_world.py # 26 dynamic world tests
232
  train_grpo.py # GRPO training script
233
  train_grpo_colab.ipynb # Colab notebook
234
  baseline.py # GPT-4o-mini baseline
__init__.py CHANGED
@@ -4,13 +4,13 @@
4
  # This source code is licensed under the BSD-style license found in the
5
  # LICENSE file in the root directory of this source tree.
6
 
7
- """Codereviewagent Environment."""
8
 
9
- from .client import CodereviewagentEnv
10
- from .models import CodereviewagentAction, CodereviewagentObservation
11
 
12
  __all__ = [
13
- "CodereviewagentAction",
14
- "CodereviewagentObservation",
15
- "CodereviewagentEnv",
16
  ]
 
4
  # This source code is licensed under the BSD-style license found in the
5
  # LICENSE file in the root directory of this source tree.
6
 
7
+ """PRobe \u2014 Pull Request Investigation Environment."""
8
 
9
+ from .client import ProbeEnv
10
+ from .models import ProbeAction, ProbeObservation
11
 
12
  __all__ = [
13
+ "ProbeAction",
14
+ "ProbeObservation",
15
+ "ProbeEnv",
16
  ]
__pycache__/__init__.cpython-314.pyc CHANGED
Binary files a/__pycache__/__init__.cpython-314.pyc and b/__pycache__/__init__.cpython-314.pyc differ
 
__pycache__/client.cpython-314.pyc CHANGED
Binary files a/__pycache__/client.cpython-314.pyc and b/__pycache__/client.cpython-314.pyc differ
 
__pycache__/models.cpython-314.pyc CHANGED
Binary files a/__pycache__/models.cpython-314.pyc and b/__pycache__/models.cpython-314.pyc differ
 
client.py CHANGED
@@ -1,40 +1,39 @@
1
- """CodeReviewAgent Environment Client."""
2
 
3
- from typing import Dict
4
 
5
  from openenv.core import EnvClient
6
  from openenv.core.client_types import StepResult
7
  from openenv.core.env_server.types import State
8
 
9
- from .models import CodereviewagentAction, CodereviewagentObservation
10
 
11
 
12
- class CodereviewagentEnv(
13
- EnvClient[CodereviewagentAction, CodereviewagentObservation, State]
14
- ):
15
  """
16
- Client for the CodeReviewAgent environment.
17
 
18
  Maintains a persistent WebSocket connection to the server.
19
 
20
- Example:
21
- >>> with CodereviewagentEnv(base_url="http://localhost:8000") as env:
22
- ... result = env.reset()
23
- ... print(result.observation.task_description)
24
- ...
25
- ... action = CodereviewagentAction(
26
- ... action_type="add_comment",
27
- ... line_number=4,
28
- ... comment="Off-by-one: range(len+1) causes IndexError",
29
- ... severity="error",
30
- ... category="bug",
31
- ... )
32
- ... result = env.step(action)
33
- ... print(result.reward)
 
34
  """
35
 
36
- def _step_payload(self, action: CodereviewagentAction) -> Dict:
37
- payload = {"action_type": action.action_type.value}
38
  if action.line_number is not None:
39
  payload["line_number"] = action.line_number
40
  if action.comment is not None:
@@ -46,31 +45,19 @@ class CodereviewagentEnv(
46
  return payload
47
 
48
  def _parse_result(
49
- self, payload: Dict
50
- ) -> StepResult[CodereviewagentObservation]:
51
- obs_data = payload.get("observation", {})
52
- observation = CodereviewagentObservation(
53
- code_snippet=obs_data.get("code_snippet", ""),
54
- task_description=obs_data.get("task_description", ""),
55
- file_name=obs_data.get("file_name", ""),
56
- task_id=obs_data.get("task_id", 0),
57
- task_difficulty=obs_data.get("task_difficulty", "easy"),
58
- review_history=obs_data.get("review_history", []),
59
- step_count=obs_data.get("step_count", 0),
60
- max_steps=obs_data.get("max_steps", 20),
61
- issues_found_count=obs_data.get("issues_found_count", 0),
62
- total_issues=obs_data.get("total_issues", 0),
63
- done=payload.get("done", False),
64
- reward=payload.get("reward"),
65
- metadata=obs_data.get("metadata", {}),
66
- )
67
  return StepResult(
68
  observation=observation,
69
- reward=payload.get("reward"),
70
- done=payload.get("done", False),
71
  )
72
 
73
- def _parse_state(self, payload: Dict) -> State:
74
  return State(
75
  episode_id=payload.get("episode_id"),
76
  step_count=payload.get("step_count", 0),
 
1
+ """PRobe Environment Client."""
2
 
3
+ from __future__ import annotations
4
 
5
  from openenv.core import EnvClient
6
  from openenv.core.client_types import StepResult
7
  from openenv.core.env_server.types import State
8
 
9
+ from .models import ProbeAction, ProbeObservation
10
 
11
 
12
+ class ProbeEnv(EnvClient[ProbeAction, ProbeObservation, State]):
 
 
13
  """
14
+ Client for the PRobe environment.
15
 
16
  Maintains a persistent WebSocket connection to the server.
17
 
18
+ Example::
19
+
20
+ with ProbeEnv(base_url="http://localhost:8000") as env:
21
+ result = env.reset()
22
+ print(result.observation.task_description)
23
+
24
+ action = ProbeAction(
25
+ action_type="add_comment",
26
+ line_number=4,
27
+ comment="Off-by-one: range(len+1) causes IndexError",
28
+ severity="error",
29
+ category="bug",
30
+ )
31
+ result = env.step(action)
32
+ print(result.reward)
33
  """
34
 
35
+ def _step_payload(self, action: ProbeAction) -> dict:
36
+ payload: dict = {"action_type": action.action_type.value}
37
  if action.line_number is not None:
38
  payload["line_number"] = action.line_number
39
  if action.comment is not None:
 
45
  return payload
46
 
47
  def _parse_result(
48
+ self, payload: dict
49
+ ) -> StepResult[ProbeObservation]:
50
+ obs_data: dict = payload.get("observation", {})
51
+ # Use model_validate so new fields added to ProbeObservation
52
+ # are picked up automatically without changing this method.
53
+ observation = ProbeObservation.model_validate(obs_data)
 
 
 
 
 
 
 
 
 
 
 
 
54
  return StepResult(
55
  observation=observation,
56
+ reward=float(payload.get("reward") or 0.0),
57
+ done=bool(payload.get("done", False)),
58
  )
59
 
60
+ def _parse_state(self, payload: dict) -> State:
61
  return State(
62
  episode_id=payload.get("episode_id"),
63
  step_count=payload.get("step_count", 0),
models.py CHANGED
@@ -1,10 +1,12 @@
1
  """
2
- Data models for the CodeReviewAgent Environment.
3
 
4
  An agent reviews Python source files, identifies bugs, security issues,
5
  and design problems, then submits a structured review.
6
  """
7
 
 
 
8
  from enum import Enum
9
  from typing import Any
10
 
@@ -13,13 +15,18 @@ from pydantic import BaseModel, ConfigDict, Field
13
 
14
 
15
  class ActionType(str, Enum):
 
 
16
  ADD_COMMENT = "add_comment"
 
17
  REQUEST_CHANGES = "request_changes"
18
  APPROVE = "approve"
19
  SUBMIT_REVIEW = "submit_review"
20
 
21
 
22
  class Severity(str, Enum):
 
 
23
  INFO = "info"
24
  WARNING = "warning"
25
  ERROR = "error"
@@ -27,6 +34,8 @@ class Severity(str, Enum):
27
 
28
 
29
  class IssueCategory(str, Enum):
 
 
30
  BUG = "bug"
31
  SECURITY = "security"
32
  PERFORMANCE = "performance"
@@ -36,62 +45,90 @@ class IssueCategory(str, Enum):
36
 
37
  class RewardType(BaseModel):
38
  """
39
- Structured reward returned by step().
40
-
41
- total : final clamped score in [-1.0, 1.0]
42
- components : named sub-scores before clamping (may sum outside [-1, 1])
43
- passed : True when the action was a clear positive signal
44
- explanation : human-readable breakdown for logging / debugging
45
- step : environment step this reward was issued at
46
- terminal : True only on the SUBMIT_REVIEW step
 
47
  """
48
 
49
  model_config = ConfigDict(frozen=True)
50
 
51
  total: float = Field(..., ge=-1.0, le=1.0)
52
  components: dict[str, float] = Field(default_factory=dict)
53
- passed: bool = Field(False)
54
- explanation: str = Field("")
55
- step: int = Field(0)
56
- terminal: bool = Field(False)
57
 
58
 
59
- class CodereviewagentAction(Action):
60
  """
61
- - ADD_COMMENT : annotate a specific line with a review comment
62
- - REQUEST_CHANGES: mark the PR as needing changes
63
- - APPROVE : approve the PR (only when no significant issues remain)
64
- - SUBMIT_REVIEW : finalize and submit the review (ends the episode)
 
 
 
 
65
  """
66
 
67
  action_type: ActionType = Field(..., description="Type of review action")
68
- line_number: int | None = Field(None, description="Source line being commented on")
69
- comment: str | None = Field(None, description="Review comment text")
70
- severity: Severity | None = Field(None, description="Issue severity level")
71
- category: IssueCategory | None = Field(None, description="Issue category")
 
 
 
 
72
 
73
 
74
- class CodereviewagentObservation(Observation):
75
  """
76
- Contains the code to review, task instructions, and the running
77
- review history so the agent can track what it has already flagged.
78
- The `reward` field mirrors the most recent step reward for convenience;
79
- the authoritative reward is the RewardType returned by step().
80
  """
81
 
82
- code_snippet: str = Field(default="", description="Python source code to review")
83
  task_description: str = Field(default="", description="Review instructions and goals")
84
  file_name: str = Field(default="", description="Name of the file being reviewed")
85
- task_id: int = Field(default=0, description="Current task index")
86
  task_difficulty: str = Field(default="ultra-easy", description="Task difficulty label")
87
  review_history: list[dict[str, Any]] = Field(
88
  default_factory=list,
89
- description="Ordered list of actions taken so far this episode",
 
 
 
 
 
 
 
 
90
  )
91
- step_count: int = Field(default=0, description="Steps taken in current episode")
92
- max_steps: int = Field(default=6, description="Step budget for this task")
93
- issues_found_count: int = Field(default=0, description="Number of issues identified so far")
94
- total_issues: int = Field(default=0, description="Total issues in this task")
95
  done: bool = Field(default=False, description="Whether the episode has ended")
96
- reward: float = Field(default=0.0, description="Most recent step reward (mirror of RewardType.total)")
 
 
 
 
 
97
  metadata: dict[str, Any] = Field(default_factory=dict, description="Extra episode metadata")
 
 
 
 
 
 
 
 
 
 
 
1
  """
2
+ Data models for the PRobe Environment.
3
 
4
  An agent reviews Python source files, identifies bugs, security issues,
5
  and design problems, then submits a structured review.
6
  """
7
 
8
+ from __future__ import annotations
9
+
10
  from enum import Enum
11
  from typing import Any
12
 
 
15
 
16
 
17
  class ActionType(str, Enum):
18
+ """All actions the agent may take during a review episode."""
19
+
20
  ADD_COMMENT = "add_comment"
21
+ GET_CONTEXT = "get_context" # probe a line for deeper causal context
22
  REQUEST_CHANGES = "request_changes"
23
  APPROVE = "approve"
24
  SUBMIT_REVIEW = "submit_review"
25
 
26
 
27
  class Severity(str, Enum):
28
+ """Severity levels for review comments."""
29
+
30
  INFO = "info"
31
  WARNING = "warning"
32
  ERROR = "error"
 
34
 
35
 
36
  class IssueCategory(str, Enum):
37
+ """Issue category taxonomy used in review comments."""
38
+
39
  BUG = "bug"
40
  SECURITY = "security"
41
  PERFORMANCE = "performance"
 
45
 
46
  class RewardType(BaseModel):
47
  """
48
+ Structured reward returned by ``step()``.
49
+
50
+ Attributes:
51
+ total: Final clamped score in ``[-1.0, 1.0]``.
52
+ components: Named sub-scores before clamping (may sum outside ``[-1, 1]``).
53
+ passed: ``True`` when the action produced a clear positive signal.
54
+ explanation: Human-readable breakdown for logging / debugging.
55
+ step: Environment step at which this reward was issued.
56
+ terminal: ``True`` only on the ``SUBMIT_REVIEW`` step.
57
  """
58
 
59
  model_config = ConfigDict(frozen=True)
60
 
61
  total: float = Field(..., ge=-1.0, le=1.0)
62
  components: dict[str, float] = Field(default_factory=dict)
63
+ passed: bool = Field(default=False)
64
+ explanation: str = Field(default="")
65
+ step: int = Field(default=0, ge=0)
66
+ terminal: bool = Field(default=False)
67
 
68
 
69
+ class ProbeAction(Action):
70
  """
71
+ An action the agent submits during a review episode.
72
+
73
+ Action types:
74
+ ADD_COMMENT — annotate a specific line with a review comment.
75
+ GET_CONTEXT — reveal ±5 lines of context around a line number.
76
+ REQUEST_CHANGES — mark the PR as requiring changes before merge.
77
+ APPROVE — approve the PR (penalised if issues remain).
78
+ SUBMIT_REVIEW — finalise and submit the review (ends the episode).
79
  """
80
 
81
  action_type: ActionType = Field(..., description="Type of review action")
82
+ line_number: int | None = Field(
83
+ default=None,
84
+ ge=1,
85
+ description="1-based source line being commented on or probed",
86
+ )
87
+ comment: str | None = Field(default=None, description="Review comment text")
88
+ severity: Severity | None = Field(default=None, description="Issue severity level")
89
+ category: IssueCategory | None = Field(default=None, description="Issue category")
90
 
91
 
92
+ class ProbeObservation(Observation):
93
  """
94
+ The observation returned to the agent after every ``reset()`` / ``step()``.
95
+
96
+ The ``reward`` field mirrors ``RewardType.total`` for the most recent step
97
+ as a convenience; the authoritative reward object is returned by ``step()``.
98
  """
99
 
100
+ code_snippet: str = Field(default="", description="Python source code to review (mutated each episode)")
101
  task_description: str = Field(default="", description="Review instructions and goals")
102
  file_name: str = Field(default="", description="Name of the file being reviewed")
103
+ task_id: int = Field(default=0, ge=0, description="Current task index (0–6)")
104
  task_difficulty: str = Field(default="ultra-easy", description="Task difficulty label")
105
  review_history: list[dict[str, Any]] = Field(
106
  default_factory=list,
107
+ description="Ordered list of all actions taken so far this episode",
108
+ )
109
+ step_count: int = Field(default=0, ge=0, description="Steps taken in current episode")
110
+ max_steps: int = Field(default=6, ge=1, description="Step budget for this task")
111
+ issues_found_count: int = Field(default=0, ge=0, description="Distinct issues identified so far")
112
+ total_issues: int = Field(default=0, ge=0, description="Total ground-truth issues in this task")
113
+ context_hints: list[str] = Field(
114
+ default_factory=list,
115
+ description="Causal context unlocked by finding key issues — read these before continuing",
116
  )
 
 
 
 
117
  done: bool = Field(default=False, description="Whether the episode has ended")
118
+ reward: float = Field(
119
+ default=0.0,
120
+ ge=-1.0,
121
+ le=1.0,
122
+ description="Most recent step reward (mirrors RewardType.total)",
123
+ )
124
  metadata: dict[str, Any] = Field(default_factory=dict, description="Extra episode metadata")
125
+
126
+
127
+ __all__ = [
128
+ "ActionType",
129
+ "IssueCategory",
130
+ "ProbeAction",
131
+ "ProbeObservation",
132
+ "RewardType",
133
+ "Severity",
134
+ ]
openenv.yaml CHANGED
@@ -1,32 +1,40 @@
1
  spec_version: 1
2
- name: CodeReviewAgent
3
  type: space
4
  runtime: fastapi
5
  app: server.app:app
6
  port: 8000
7
 
8
  description: >
9
- Code review environment where an agent reviews Python source files,
10
- identifies bugs, security vulnerabilities, performance bottlenecks,
11
- and design issues, then submits a structured review with comments
12
- and a final decision (request_changes or approve).
 
13
 
14
  tasks:
15
  - id: 0
 
 
 
 
 
 
 
16
  name: Basic Bug Detection
17
  difficulty: easy
18
  description: Identify logical bugs in a simple Python utility module
19
  max_steps: 15
20
  issues: 3
21
 
22
- - id: 1
23
  name: Security Vulnerability Review
24
  difficulty: medium
25
  description: Find security vulnerabilities in an authentication module
26
  max_steps: 20
27
  issues: 5
28
 
29
- - id: 2
30
  name: Full Architecture and Performance Review
31
  difficulty: hard
32
  description: >
@@ -35,14 +43,14 @@ tasks:
35
  max_steps: 30
36
  issues: 7
37
 
38
- - id: 3
39
  name: Async Worker Review
40
  difficulty: medium
41
  description: Find concurrency bugs and resource leaks in an async worker
42
  max_steps: 20
43
  issues: 5
44
 
45
- - id: 4
46
  name: Flask API Security Review
47
  difficulty: hard
48
  description: >
@@ -51,19 +59,30 @@ tasks:
51
  max_steps: 30
52
  issues: 6
53
 
 
 
 
 
 
 
 
 
 
 
54
  observation:
55
  type: object
56
  fields:
57
- code_snippet: {type: string, description: "Python source to review"}
58
  task_description: {type: string, description: "Review instructions"}
59
  file_name: {type: string}
60
- task_id: {type: integer, range: [0, 4]}
61
- task_difficulty: {type: string, values: [easy, medium, hard]}
62
  review_history: {type: array, description: "Actions taken so far"}
63
  step_count: {type: integer}
64
  max_steps: {type: integer}
65
  issues_found_count: {type: integer}
66
  total_issues: {type: integer}
 
67
  done: {type: boolean}
68
  reward: {type: number}
69
 
@@ -72,7 +91,7 @@ action:
72
  fields:
73
  action_type:
74
  type: enum
75
- values: [add_comment, request_changes, approve, submit_review]
76
  line_number: {type: integer, required: false}
77
  comment: {type: string, required: false}
78
  severity:
@@ -88,9 +107,11 @@ reward_design:
88
  range: [-1.0, 1.0]
89
  per_step:
90
  issue_found: "up to 0.60 total (weight/total_weight × 0.60 per issue)"
91
- false_positive: -0.02
92
  correct_request_changes: +0.05
93
  bad_approval: -0.15
 
 
94
  terminal:
95
  coverage_bonus: "coverage × 0.20 (max +0.20)"
96
  decision_correct: +0.10
 
1
  spec_version: 1
2
+ name: PRobe
3
  type: space
4
  runtime: fastapi
5
  app: server.app:app
6
  port: 8000
7
 
8
  description: >
9
+ PRobe (Pull Request Investigation Environment) an RL training environment
10
+ where an agent reviews Python source files, identifies bugs, security
11
+ vulnerabilities, performance bottlenecks, and design issues, then submits a
12
+ structured review. Features dynamic code mutation, a GET_CONTEXT probe action,
13
+ and a causal unlock chain for genuine world-model reasoning.
14
 
15
  tasks:
16
  - id: 0
17
+ name: Bootstrap Obvious Issues
18
+ difficulty: ultra-easy
19
+ description: Off-by-one and hardcoded credential, both hinted in comments
20
+ max_steps: 6
21
+ issues: 2
22
+
23
+ - id: 1
24
  name: Basic Bug Detection
25
  difficulty: easy
26
  description: Identify logical bugs in a simple Python utility module
27
  max_steps: 15
28
  issues: 3
29
 
30
+ - id: 2
31
  name: Security Vulnerability Review
32
  difficulty: medium
33
  description: Find security vulnerabilities in an authentication module
34
  max_steps: 20
35
  issues: 5
36
 
37
+ - id: 3
38
  name: Full Architecture and Performance Review
39
  difficulty: hard
40
  description: >
 
43
  max_steps: 30
44
  issues: 7
45
 
46
+ - id: 4
47
  name: Async Worker Review
48
  difficulty: medium
49
  description: Find concurrency bugs and resource leaks in an async worker
50
  max_steps: 20
51
  issues: 5
52
 
53
+ - id: 5
54
  name: Flask API Security Review
55
  difficulty: hard
56
  description: >
 
59
  max_steps: 30
60
  issues: 6
61
 
62
+ - id: 6
63
+ name: Causal Secrets Leak Investigation
64
+ difficulty: hard
65
+ description: >
66
+ JWT auth service review with causal unlock chain — finding key issues
67
+ reveals DB schema and nginx config, enabling deeper attack-path reasoning
68
+ max_steps: 35
69
+ issues: 6
70
+ causal_unlocks: true
71
+
72
  observation:
73
  type: object
74
  fields:
75
+ code_snippet: {type: string, description: "Python source to review (mutated each episode)"}
76
  task_description: {type: string, description: "Review instructions"}
77
  file_name: {type: string}
78
+ task_id: {type: integer, range: [0, 6]}
79
+ task_difficulty: {type: string, values: [ultra-easy, easy, medium, hard]}
80
  review_history: {type: array, description: "Actions taken so far"}
81
  step_count: {type: integer}
82
  max_steps: {type: integer}
83
  issues_found_count: {type: integer}
84
  total_issues: {type: integer}
85
+ context_hints: {type: array, description: "Causal hints unlocked by finding key issues"}
86
  done: {type: boolean}
87
  reward: {type: number}
88
 
 
91
  fields:
92
  action_type:
93
  type: enum
94
+ values: [add_comment, get_context, request_changes, approve, submit_review]
95
  line_number: {type: integer, required: false}
96
  comment: {type: string, required: false}
97
  severity:
 
107
  range: [-1.0, 1.0]
108
  per_step:
109
  issue_found: "up to 0.60 total (weight/total_weight × 0.60 per issue)"
110
+ false_positive: -0.05
111
  correct_request_changes: +0.05
112
  bad_approval: -0.15
113
+ context_probe_near_issue: 0.00
114
+ context_probe_far: -0.01
115
  terminal:
116
  coverage_bonus: "coverage × 0.20 (max +0.20)"
117
  decision_correct: +0.10
openenv_CodeReviewAgent.egg-info/SOURCES.txt CHANGED
@@ -1,4 +1,7 @@
1
  README.md
 
 
 
2
  pyproject.toml
3
  ./__init__.py
4
  ./client.py
@@ -13,4 +16,5 @@ server/CodeReviewAgent_environment.py
13
  server/__init__.py
14
  server/app.py
15
  server/grader.py
16
- server/tasks.py
 
 
1
  README.md
2
+ __init__.py
3
+ client.py
4
+ models.py
5
  pyproject.toml
6
  ./__init__.py
7
  ./client.py
 
16
  server/__init__.py
17
  server/app.py
18
  server/grader.py
19
+ server/tasks.py
20
+ tests/test_grader.py
openenv_PRobe.egg-info/PKG-INFO ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Metadata-Version: 2.4
2
+ Name: openenv-PRobe
3
+ Version: 0.1.0
4
+ Summary: PRobe — Pull Request Investigation Environment for OpenEnv
5
+ Requires-Python: >=3.10
6
+ Requires-Dist: openenv-core[core]>=0.2.2
7
+ Requires-Dist: openai>=1.0.0
8
+ Requires-Dist: python-dotenv>=1.2.2
9
+ Provides-Extra: dev
10
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
11
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
openenv_PRobe.egg-info/SOURCES.txt ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ README.md
2
+ pyproject.toml
3
+ ./__init__.py
4
+ ./client.py
5
+ ./models.py
6
+ openenv_PRobe.egg-info/PKG-INFO
7
+ openenv_PRobe.egg-info/SOURCES.txt
8
+ openenv_PRobe.egg-info/dependency_links.txt
9
+ openenv_PRobe.egg-info/entry_points.txt
10
+ openenv_PRobe.egg-info/requires.txt
11
+ openenv_PRobe.egg-info/top_level.txt
12
+ server/CodeReviewAgent_environment.py
13
+ server/__init__.py
14
+ server/app.py
15
+ server/grader.py
16
+ server/mutator.py
17
+ server/tasks.py
18
+ tests/test_dynamic_world.py
19
+ tests/test_grader.py
openenv_PRobe.egg-info/dependency_links.txt ADDED
@@ -0,0 +1 @@
 
 
1
+
openenv_PRobe.egg-info/entry_points.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ [console_scripts]
2
+ server = PRobe.server.app:main
openenv_PRobe.egg-info/requires.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ openenv-core[core]>=0.2.2
2
+ openai>=1.0.0
3
+ python-dotenv>=1.2.2
4
+
5
+ [dev]
6
+ pytest>=8.0.0
7
+ pytest-cov>=4.0.0
openenv_PRobe.egg-info/top_level.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ PRobe
pyproject.toml CHANGED
@@ -9,9 +9,9 @@ requires = ["setuptools>=45", "wheel"]
9
  build-backend = "setuptools.build_meta"
10
 
11
  [project]
12
- name = "openenv-CodeReviewAgent"
13
  version = "0.1.0"
14
- description = "Codereviewagent environment for OpenEnv"
15
  requires-python = ">=3.10"
16
  dependencies = [
17
  # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
@@ -31,10 +31,15 @@ dev = [
31
 
32
  [project.scripts]
33
  # Server entry point - enables running via: uv run --project . server
34
- # or: python -m CodeReviewAgent.server.app
35
- server = "CodeReviewAgent.server.app:main"
36
 
37
  [tool.setuptools]
38
  include-package-data = true
39
- packages = ["CodeReviewAgent", "CodeReviewAgent.server"]
40
- package-dir = { "CodeReviewAgent" = ".", "CodeReviewAgent.server" = "server" }
 
 
 
 
 
 
 
9
  build-backend = "setuptools.build_meta"
10
 
11
  [project]
12
+ name = "openenv-PRobe"
13
  version = "0.1.0"
14
+ description = "PRobe Pull Request Investigation Environment for OpenEnv"
15
  requires-python = ">=3.10"
16
  dependencies = [
17
  # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
 
31
 
32
  [project.scripts]
33
  # Server entry point - enables running via: uv run --project . server
34
+ server = "PRobe.server.app:main"
 
35
 
36
  [tool.setuptools]
37
  include-package-data = true
38
+ packages = ["PRobe", "PRobe.server"]
39
+ package-dir = { "PRobe" = ".", "PRobe.server" = "server" }
40
+
41
+ [dependency-groups]
42
+ dev = [
43
+ "pytest>=9.0.3",
44
+ "pytest-cov>=7.1.0",
45
+ ]
server/CodeReviewAgent_environment.py CHANGED
@@ -6,7 +6,18 @@ Episode lifecycle:
6
  2. step(a) → (Obs, RewardType, done, info) (execute one action)
7
  3. state() → dict (full internal snapshot)
8
 
9
- Tasks cycle automatically: 0 (ultra-easy) → 1 (easy) → … → 5 (hard flask) → 0 …
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  Thread / task safety: each Environment instance owns its own state.
12
  For concurrent GRPO rollouts spin up one instance per worker.
@@ -15,6 +26,8 @@ For concurrent GRPO rollouts spin up one instance per worker.
15
  from __future__ import annotations
16
 
17
  import asyncio
 
 
18
  from typing import Any
19
  from uuid import uuid4
20
 
@@ -24,30 +37,30 @@ from openenv.core.env_server.types import State
24
  try:
25
  from ..models import (
26
  ActionType,
27
- CodereviewagentAction,
28
- CodereviewagentObservation,
29
  RewardType,
30
  )
31
- from .grader import CodeReviewGrader
 
32
  from .tasks import TASKS
33
  except ImportError:
34
  from models import ( # type: ignore[no-redef]
35
  ActionType,
36
- CodereviewagentAction,
37
- CodereviewagentObservation,
38
  RewardType,
39
  )
40
- from server.grader import CodeReviewGrader # type: ignore[no-redef]
41
- from server.tasks import TASKS # type: ignore[no-redef]
 
42
 
43
- # Sentinel reward returned on non-terminal steps that produce no signal
44
- _ZERO_REWARD = RewardType(total=0.0, components={}, passed=False,
45
- explanation="No signal this step.", step=0, terminal=False)
46
 
47
 
48
- class CodereviewagentEnvironment(Environment):
49
  """
50
- OpenEnv-compliant code-review environment.
51
 
52
  Public interface is fully async. The sync wrappers (reset / step / state)
53
  required by openenv's create_app are also provided; they delegate to the
@@ -76,23 +89,28 @@ class CodereviewagentEnvironment(Environment):
76
  "review_decision": None,
77
  "review_submitted": False,
78
  "cumulative_reward": 0.0,
 
 
 
79
  }
80
 
81
  # ── Async-native interface (primary) ──────────────────────────────────
82
 
83
- async def async_reset(self) -> CodereviewagentObservation:
84
  task_id = self._reset_count % len(TASKS)
 
85
  self._reset_count += 1
86
  self._episode_id = str(uuid4())
87
  self._step_count = 0
88
- task = TASKS[task_id]
 
89
  self._grader = CodeReviewGrader(task)
90
  self._ep = self._fresh_episode(task)
91
  return self._make_obs(reward=0.0, done=False)
92
 
93
  async def async_step(
94
- self, action: CodereviewagentAction
95
- ) -> tuple[CodereviewagentObservation, RewardType, bool, dict[str, Any]]:
96
  self._step_count += 1
97
  task = self._ep["task"]
98
  done = False
@@ -101,6 +119,9 @@ class CodereviewagentEnvironment(Environment):
101
  if action.action_type == ActionType.ADD_COMMENT:
102
  reward_obj = self._handle_add_comment(action)
103
 
 
 
 
104
  elif action.action_type == ActionType.REQUEST_CHANGES:
105
  reward_obj = self._handle_request_changes(action)
106
 
@@ -165,32 +186,29 @@ class CodereviewagentEnvironment(Environment):
165
 
166
  # ── Sync wrappers (openenv / create_app compatibility) ────────────────
167
 
168
- def reset(self) -> CodereviewagentObservation: # type: ignore[override]
169
  try:
170
- loop = asyncio.get_running_loop()
171
  except RuntimeError:
172
  return asyncio.run(self.async_reset())
173
- # Called from inside a running loop (e.g. pytest-asyncio) run directly
174
- import concurrent.futures
175
  with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
176
- fut = pool.submit(asyncio.run, self.async_reset())
177
- return fut.result()
178
 
179
- def step(self, action: CodereviewagentAction) -> CodereviewagentObservation: # type: ignore[override]
180
  """
181
  Sync step for openenv compatibility.
182
  Returns only the Observation (reward is embedded in obs.reward).
183
  Use async_step() for the full (obs, reward, done, info) tuple.
184
  """
185
  try:
186
- loop = asyncio.get_running_loop()
187
  except RuntimeError:
188
  obs, _, _, _ = asyncio.run(self.async_step(action))
189
  return obs
190
- import concurrent.futures
191
  with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
192
- fut = pool.submit(asyncio.run, self.async_step(action))
193
- obs, _, _, _ = fut.result()
194
  return obs
195
 
196
  @property
@@ -199,7 +217,7 @@ class CodereviewagentEnvironment(Environment):
199
 
200
  # ── Action handlers ───────────────────────────────────────────────────
201
 
202
- def _handle_add_comment(self, action: CodereviewagentAction) -> RewardType:
203
  entry = {
204
  "type": "comment",
205
  "line": action.line_number,
@@ -224,6 +242,9 @@ class CodereviewagentEnvironment(Environment):
224
  else:
225
  explanation = "Comment recorded; no new issue matched."
226
 
 
 
 
227
  return RewardType(
228
  total=clamped,
229
  components=breakdown,
@@ -233,7 +254,79 @@ class CodereviewagentEnvironment(Environment):
233
  terminal=False,
234
  )
235
 
236
- def _handle_request_changes(self, action: CodereviewagentAction) -> RewardType:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
  self._ep["review_decision"] = "request_changes"
238
  self._ep["review_comments"].append(
239
  {"type": "request_changes", "text": action.comment}
@@ -304,9 +397,9 @@ class CodereviewagentEnvironment(Environment):
304
 
305
  # ── Observation builder ───────────────────────────────────────────────
306
 
307
- def _make_obs(self, reward: float, done: bool) -> CodereviewagentObservation:
308
  task = self._ep["task"]
309
- return CodereviewagentObservation(
310
  code_snippet=task["code"],
311
  task_description=task["description"],
312
  file_name=task["file_name"],
@@ -319,9 +412,11 @@ class CodereviewagentEnvironment(Environment):
319
  total_issues=len(task["issues"]),
320
  done=done,
321
  reward=round(max(-1.0, min(1.0, reward)), 4),
 
322
  metadata={
323
  "cumulative_reward": self._ep.get("cumulative_reward", 0.0),
324
  "review_decision": self._ep.get("review_decision"),
325
  "episode_id": self._episode_id,
 
326
  },
327
  )
 
6
  2. step(a) → (Obs, RewardType, done, info) (execute one action)
7
  3. state() → dict (full internal snapshot)
8
 
9
+ Tasks cycle automatically: 0 (ultra-easy) → 1 (easy) → … → 6 (causal chain) → 0 …
10
+
11
+ Dynamic world features (v3)
12
+ ───────────────────────────
13
+ • Code mutation — each episode applies surface-level variable renames,
14
+ a line shift, and a constant nudge so the agent must
15
+ read the code rather than memorise tokens.
16
+ • GET_CONTEXT — the agent can spend a step probing a specific line to
17
+ receive the surrounding ±5 lines of context.
18
+ • Causal unlocks — finding certain issues appends a new context hint to
19
+ the observation, modelling real-world situations where
20
+ one discovery leads to deeper investigation.
21
 
22
  Thread / task safety: each Environment instance owns its own state.
23
  For concurrent GRPO rollouts spin up one instance per worker.
 
26
  from __future__ import annotations
27
 
28
  import asyncio
29
+ import concurrent.futures
30
+ import logging
31
  from typing import Any
32
  from uuid import uuid4
33
 
 
37
  try:
38
  from ..models import (
39
  ActionType,
40
+ ProbeAction,
41
+ ProbeObservation,
42
  RewardType,
43
  )
44
+ from .grader import CodeReviewGrader, LINE_TOLERANCE
45
+ from .mutator import mutate_task
46
  from .tasks import TASKS
47
  except ImportError:
48
  from models import ( # type: ignore[no-redef]
49
  ActionType,
50
+ ProbeAction,
51
+ ProbeObservation,
52
  RewardType,
53
  )
54
+ from server.grader import CodeReviewGrader, LINE_TOLERANCE # type: ignore[no-redef]
55
+ from server.mutator import mutate_task # type: ignore[no-redef]
56
+ from server.tasks import TASKS # type: ignore[no-redef]
57
 
58
+ log = logging.getLogger(__name__)
 
 
59
 
60
 
61
+ class ProbeEnvironment(Environment):
62
  """
63
+ PRobe Pull Request Investigation Environment.
64
 
65
  Public interface is fully async. The sync wrappers (reset / step / state)
66
  required by openenv's create_app are also provided; they delegate to the
 
89
  "review_decision": None,
90
  "review_submitted": False,
91
  "cumulative_reward": 0.0,
92
+ # causal world-modeling state
93
+ "context_hints": [], # list[str] of unlocked hint texts
94
+ "hints_unlocked": set(), # set[str] of hint keys already fired
95
  }
96
 
97
  # ── Async-native interface (primary) ──────────────────────────────────
98
 
99
+ async def async_reset(self) -> ProbeObservation:
100
  task_id = self._reset_count % len(TASKS)
101
+ seed = self._reset_count # unique seed per episode
102
  self._reset_count += 1
103
  self._episode_id = str(uuid4())
104
  self._step_count = 0
105
+ # Apply surface mutation so the agent cannot memorise tokens
106
+ task = mutate_task(TASKS[task_id], seed=seed)
107
  self._grader = CodeReviewGrader(task)
108
  self._ep = self._fresh_episode(task)
109
  return self._make_obs(reward=0.0, done=False)
110
 
111
  async def async_step(
112
+ self, action: ProbeAction
113
+ ) -> tuple[ProbeObservation, RewardType, bool, dict[str, Any]]:
114
  self._step_count += 1
115
  task = self._ep["task"]
116
  done = False
 
119
  if action.action_type == ActionType.ADD_COMMENT:
120
  reward_obj = self._handle_add_comment(action)
121
 
122
+ elif action.action_type == ActionType.GET_CONTEXT:
123
+ reward_obj = self._handle_get_context(action)
124
+
125
  elif action.action_type == ActionType.REQUEST_CHANGES:
126
  reward_obj = self._handle_request_changes(action)
127
 
 
186
 
187
  # ── Sync wrappers (openenv / create_app compatibility) ────────────────
188
 
189
+ def reset(self) -> ProbeObservation: # type: ignore[override]
190
  try:
191
+ asyncio.get_running_loop()
192
  except RuntimeError:
193
  return asyncio.run(self.async_reset())
194
+ # Called from inside a running loop (e.g. pytest-asyncio) -- run in a
195
+ # fresh thread that has its own event loop.
196
  with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
197
+ return pool.submit(asyncio.run, self.async_reset()).result()
 
198
 
199
+ def step(self, action: ProbeAction) -> ProbeObservation: # type: ignore[override]
200
  """
201
  Sync step for openenv compatibility.
202
  Returns only the Observation (reward is embedded in obs.reward).
203
  Use async_step() for the full (obs, reward, done, info) tuple.
204
  """
205
  try:
206
+ asyncio.get_running_loop()
207
  except RuntimeError:
208
  obs, _, _, _ = asyncio.run(self.async_step(action))
209
  return obs
 
210
  with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
211
+ obs, _, _, _ = pool.submit(asyncio.run, self.async_step(action)).result()
 
212
  return obs
213
 
214
  @property
 
217
 
218
  # ── Action handlers ───────────────────────────────────────────────────
219
 
220
+ def _handle_add_comment(self, action: ProbeAction) -> RewardType:
221
  entry = {
222
  "type": "comment",
223
  "line": action.line_number,
 
242
  else:
243
  explanation = "Comment recorded; no new issue matched."
244
 
245
+ # ── Causal unlock: check whether any newly found issue reveals context
246
+ self._unlock_causal_hints(new_finds)
247
+
248
  return RewardType(
249
  total=clamped,
250
  components=breakdown,
 
254
  terminal=False,
255
  )
256
 
257
+ def _unlock_causal_hints(self, newly_found: list[str]) -> None:
258
+ """Append context hint text for any issue that has an 'unlocks' key."""
259
+ task = self._ep["task"]
260
+ hint_map: dict[str, str] = task.get("context_hints", {})
261
+ for issue in task["issues"]:
262
+ unlock_key = issue.get("unlocks")
263
+ if (
264
+ unlock_key
265
+ and issue["id"] in newly_found
266
+ and unlock_key not in self._ep["hints_unlocked"]
267
+ and unlock_key in hint_map
268
+ ):
269
+ self._ep["hints_unlocked"].add(unlock_key)
270
+ self._ep["context_hints"].append(hint_map[unlock_key])
271
+
272
+ def _handle_get_context(
273
+ self, action: ProbeAction
274
+ ) -> RewardType:
275
+ """
276
+ GET_CONTEXT — reveal ±5 lines around the requested line number.
277
+
278
+ Costs a small step penalty (-0.01) to discourage random probing,
279
+ but rewards focused investigation (line near an actual issue: 0.0
280
+ net cost — penalty waived).
281
+ """
282
+ line_number = action.line_number
283
+ task = self._ep["task"]
284
+ code_lines = task["code"].split("\n")
285
+
286
+ if line_number is None:
287
+ return RewardType(
288
+ total=-0.02,
289
+ components={"invalid_context_probe": -0.02},
290
+ passed=False,
291
+ explanation="GET_CONTEXT requires a line_number.",
292
+ step=self._step_count,
293
+ terminal=False,
294
+ )
295
+
296
+ # Build snippet
297
+ start = max(0, line_number - 6)
298
+ end = min(len(code_lines), line_number + 5)
299
+ snippet_lines = [
300
+ f"{i + 1:3}: {code_lines[i]}" for i in range(start, end)
301
+ ]
302
+ snippet = "\n".join(snippet_lines)
303
+
304
+ # Check if probed line is near a real issue (within LINE_TOLERANCE).
305
+ near_issue = any(
306
+ (iss["line_range"][0] - LINE_TOLERANCE) <= line_number <= (iss["line_range"][1] + LINE_TOLERANCE)
307
+ for iss in task["issues"]
308
+ )
309
+ penalty = 0.0 if near_issue else -0.01
310
+
311
+ # Store the context result in review history so the agent can see it
312
+ self._ep["review_comments"].append({
313
+ "type": "context_probe",
314
+ "line": line_number,
315
+ "context": snippet,
316
+ })
317
+
318
+ return RewardType(
319
+ total=penalty,
320
+ components={"context_probe_penalty": penalty},
321
+ passed=near_issue,
322
+ explanation=(
323
+ f"Context around line {line_number}:\n{snippet}"
324
+ ),
325
+ step=self._step_count,
326
+ terminal=False,
327
+ )
328
+
329
+ def _handle_request_changes(self, action: ProbeAction) -> RewardType:
330
  self._ep["review_decision"] = "request_changes"
331
  self._ep["review_comments"].append(
332
  {"type": "request_changes", "text": action.comment}
 
397
 
398
  # ── Observation builder ───────────────────────────────────────────────
399
 
400
+ def _make_obs(self, reward: float, done: bool) -> ProbeObservation:
401
  task = self._ep["task"]
402
+ return ProbeObservation(
403
  code_snippet=task["code"],
404
  task_description=task["description"],
405
  file_name=task["file_name"],
 
412
  total_issues=len(task["issues"]),
413
  done=done,
414
  reward=round(max(-1.0, min(1.0, reward)), 4),
415
+ context_hints=list(self._ep.get("context_hints", [])),
416
  metadata={
417
  "cumulative_reward": self._ep.get("cumulative_reward", 0.0),
418
  "review_decision": self._ep.get("review_decision"),
419
  "episode_id": self._episode_id,
420
+ "mutation_seed": self._ep["task"].get("_mutation_seed"),
421
  },
422
  )
server/__init__.py CHANGED
@@ -4,8 +4,8 @@
4
  # This source code is licensed under the BSD-style license found in the
5
  # LICENSE file in the root directory of this source tree.
6
 
7
- """Codereviewagent environment server components."""
8
 
9
- from .CodeReviewAgent_environment import CodereviewagentEnvironment
10
 
11
- __all__ = ["CodereviewagentEnvironment"]
 
4
  # This source code is licensed under the BSD-style license found in the
5
  # LICENSE file in the root directory of this source tree.
6
 
7
+ """PRobe environment server components."""
8
 
9
+ from .CodeReviewAgent_environment import ProbeEnvironment
10
 
11
+ __all__ = ["ProbeEnvironment"]
server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc CHANGED
Binary files a/server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc and b/server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc differ
 
server/__pycache__/__init__.cpython-314.pyc CHANGED
Binary files a/server/__pycache__/__init__.cpython-314.pyc and b/server/__pycache__/__init__.cpython-314.pyc differ
 
server/__pycache__/grader.cpython-314.pyc CHANGED
Binary files a/server/__pycache__/grader.cpython-314.pyc and b/server/__pycache__/grader.cpython-314.pyc differ
 
server/__pycache__/mutator.cpython-314.pyc ADDED
Binary file (5.86 kB). View file
 
server/__pycache__/tasks.cpython-314.pyc CHANGED
Binary files a/server/__pycache__/tasks.cpython-314.pyc and b/server/__pycache__/tasks.cpython-314.pyc differ
 
server/app.py CHANGED
@@ -1,5 +1,5 @@
1
  """
2
- Async FastAPI server for the CodeReviewAgent environment.
3
 
4
  Endpoints:
5
  POST /reset — start a new episode (HTTP session)
@@ -20,9 +20,11 @@ falls back to a minimal HTML redirect page.
20
  from __future__ import annotations
21
 
22
  import json
 
23
  from contextlib import asynccontextmanager
24
  from typing import Any
25
 
 
26
  from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
27
  from fastapi.responses import HTMLResponse
28
 
@@ -33,22 +35,24 @@ except Exception: # pragma: no cover
33
  _OPENENV_AVAILABLE = False
34
 
35
  try:
36
- from ..models import CodereviewagentAction, CodereviewagentObservation, RewardType
37
- from .CodeReviewAgent_environment import CodereviewagentEnvironment
38
  except ModuleNotFoundError:
39
- from models import CodereviewagentAction, CodereviewagentObservation, RewardType # type: ignore
40
- from server.CodeReviewAgent_environment import CodereviewagentEnvironment # type: ignore
 
 
41
 
42
 
43
  # ── Shared HTTP session env ───────────────────────────────────────────────────
44
 
45
- _http_env: CodereviewagentEnvironment | None = None
46
 
47
 
48
  @asynccontextmanager
49
  async def lifespan(application: FastAPI):
50
  global _http_env
51
- _http_env = CodereviewagentEnvironment()
52
  yield
53
  _http_env = None
54
 
@@ -58,7 +62,7 @@ async def lifespan(application: FastAPI):
58
  class StepResponse:
59
  def __init__(
60
  self,
61
- obs: CodereviewagentObservation,
62
  reward: RewardType,
63
  done: bool,
64
  info: dict[str, Any],
@@ -81,7 +85,7 @@ class StepResponse:
81
 
82
  def _build_app() -> FastAPI:
83
  application = FastAPI(
84
- title="CodeReviewAgent",
85
  description="OpenEnv code-review environment — async FastAPI server.",
86
  version="2.0.0",
87
  lifespan=lifespan,
@@ -91,19 +95,22 @@ def _build_app() -> FastAPI:
91
 
92
  @application.post("/reset", summary="Start a new episode")
93
  async def reset_endpoint() -> dict[str, Any]:
94
- assert _http_env is not None
 
95
  obs = await _http_env.async_reset()
96
  return {"observation": obs.model_dump(), "reward": None, "done": False, "info": {}}
97
 
98
  @application.post("/step", summary="Execute one action")
99
- async def step_endpoint(action: CodereviewagentAction) -> dict[str, Any]:
100
- assert _http_env is not None
 
101
  obs, reward, done, info = await _http_env.async_step(action)
102
  return StepResponse(obs, reward, done, info).to_dict()
103
 
104
  @application.get("/state", summary="Current episode state snapshot")
105
  async def state_endpoint() -> dict[str, Any]:
106
- assert _http_env is not None
 
107
  return await _http_env.async_state()
108
 
109
  @application.get("/health", summary="Liveness probe")
@@ -113,8 +120,8 @@ def _build_app() -> FastAPI:
113
  @application.get("/schema", summary="Action and observation JSON schemas")
114
  async def schema() -> dict[str, Any]:
115
  return {
116
- "action": CodereviewagentAction.model_json_schema(),
117
- "observation": CodereviewagentObservation.model_json_schema(),
118
  "reward": RewardType.model_json_schema(),
119
  }
120
 
@@ -123,7 +130,7 @@ def _build_app() -> FastAPI:
123
  @application.websocket("/ws")
124
  async def ws_endpoint(websocket: WebSocket) -> None:
125
  await websocket.accept()
126
- env = CodereviewagentEnvironment()
127
  try:
128
  while True:
129
  raw = await websocket.receive_text()
@@ -138,7 +145,7 @@ def _build_app() -> FastAPI:
138
 
139
  elif cmd == "step":
140
  try:
141
- action = CodereviewagentAction(**msg["action"])
142
  except Exception as exc:
143
  await websocket.send_json({"type": "error", "detail": str(exc)})
144
  continue
@@ -170,9 +177,9 @@ def _build_app() -> FastAPI:
170
  @application.get("/web", response_class=HTMLResponse, include_in_schema=False)
171
  async def web_ui() -> str:
172
  return """
173
- <!doctype html><html><head><title>CodeReviewAgent</title></head>
174
- <body>
175
- <h2>CodeReviewAgent Environment</h2>
176
  <p>API docs: <a href="/docs">/docs</a></p>
177
  <p>Health: <a href="/health">/health</a></p>
178
  <p>Schema: <a href="/schema">/schema</a></p>
@@ -185,8 +192,7 @@ def _build_app() -> FastAPI:
185
  app = _build_app()
186
 
187
 
188
- def main(host: str = "0.0.0.0", port: int = 8000) -> None:
189
- import uvicorn
190
  uvicorn.run(app, host=host, port=port)
191
 
192
 
 
1
  """
2
+ Async FastAPI server for the PRobe environment.
3
 
4
  Endpoints:
5
  POST /reset — start a new episode (HTTP session)
 
20
  from __future__ import annotations
21
 
22
  import json
23
+ import logging
24
  from contextlib import asynccontextmanager
25
  from typing import Any
26
 
27
+ import uvicorn
28
  from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
29
  from fastapi.responses import HTMLResponse
30
 
 
35
  _OPENENV_AVAILABLE = False
36
 
37
  try:
38
+ from ..models import ProbeAction, ProbeObservation, RewardType
39
+ from .CodeReviewAgent_environment import ProbeEnvironment
40
  except ModuleNotFoundError:
41
+ from models import ProbeAction, ProbeObservation, RewardType # type: ignore
42
+ from server.CodeReviewAgent_environment import ProbeEnvironment # type: ignore
43
+
44
+ log = logging.getLogger(__name__)
45
 
46
 
47
  # ── Shared HTTP session env ───────────────────────────────────────────────────
48
 
49
+ _http_env: ProbeEnvironment | None = None
50
 
51
 
52
  @asynccontextmanager
53
  async def lifespan(application: FastAPI):
54
  global _http_env
55
+ _http_env = ProbeEnvironment()
56
  yield
57
  _http_env = None
58
 
 
62
  class StepResponse:
63
  def __init__(
64
  self,
65
+ obs: ProbeObservation,
66
  reward: RewardType,
67
  done: bool,
68
  info: dict[str, Any],
 
85
 
86
  def _build_app() -> FastAPI:
87
  application = FastAPI(
88
+ title="PRobe",
89
  description="OpenEnv code-review environment — async FastAPI server.",
90
  version="2.0.0",
91
  lifespan=lifespan,
 
95
 
96
  @application.post("/reset", summary="Start a new episode")
97
  async def reset_endpoint() -> dict[str, Any]:
98
+ if _http_env is None:
99
+ raise HTTPException(status_code=503, detail="Environment not initialised")
100
  obs = await _http_env.async_reset()
101
  return {"observation": obs.model_dump(), "reward": None, "done": False, "info": {}}
102
 
103
  @application.post("/step", summary="Execute one action")
104
+ async def step_endpoint(action: ProbeAction) -> dict[str, Any]:
105
+ if _http_env is None:
106
+ raise HTTPException(status_code=503, detail="Environment not initialised")
107
  obs, reward, done, info = await _http_env.async_step(action)
108
  return StepResponse(obs, reward, done, info).to_dict()
109
 
110
  @application.get("/state", summary="Current episode state snapshot")
111
  async def state_endpoint() -> dict[str, Any]:
112
+ if _http_env is None:
113
+ raise HTTPException(status_code=503, detail="Environment not initialised")
114
  return await _http_env.async_state()
115
 
116
  @application.get("/health", summary="Liveness probe")
 
120
  @application.get("/schema", summary="Action and observation JSON schemas")
121
  async def schema() -> dict[str, Any]:
122
  return {
123
+ "action": ProbeAction.model_json_schema(),
124
+ "observation": ProbeObservation.model_json_schema(),
125
  "reward": RewardType.model_json_schema(),
126
  }
127
 
 
130
  @application.websocket("/ws")
131
  async def ws_endpoint(websocket: WebSocket) -> None:
132
  await websocket.accept()
133
+ env = ProbeEnvironment()
134
  try:
135
  while True:
136
  raw = await websocket.receive_text()
 
145
 
146
  elif cmd == "step":
147
  try:
148
+ action = ProbeAction(**msg["action"])
149
  except Exception as exc:
150
  await websocket.send_json({"type": "error", "detail": str(exc)})
151
  continue
 
177
  @application.get("/web", response_class=HTMLResponse, include_in_schema=False)
178
  async def web_ui() -> str:
179
  return """
180
+ <!doctype html><html><head><title>PRobe</title></head>
181
+ <body style="font-family:sans-serif;padding:2rem">
182
+ <h2>PRobe Environment</h2>
183
  <p>API docs: <a href="/docs">/docs</a></p>
184
  <p>Health: <a href="/health">/health</a></p>
185
  <p>Schema: <a href="/schema">/schema</a></p>
 
192
  app = _build_app()
193
 
194
 
195
+ def main(host: str = "0.0.0.0", port: int = 8000) -> None: # noqa: S104
 
196
  uvicorn.run(app, host=host, port=port)
197
 
198
 
server/grader.py CHANGED
@@ -1,27 +1,29 @@
1
  """
2
- Deterministic grader for CodeReviewAgent tasks.
3
 
4
  Scoring design
5
  --------------
6
  During the episode (ADD_COMMENT actions):
7
- +weight/total_weight * 0.60 per newly found issue (max 0.60 cumulative)
8
- -0.02 per false-positive (substantive comment, no match)
9
-
10
- Final (SUBMIT_REVIEW):
11
- +coverage * 0.20 weighted coverage bonus (max 0.20)
12
- +/-0.10 correct / incorrect final decision
13
- +efficiency * 0.10 step-efficiency bonus when coverage >= 60%
14
-
15
- Maximum achievable total: ~1.0 Minimum: 1.0
16
-
17
- Anti-exploit rule (enforced since v2):
18
- A comment MUST satisfy BOTH:
19
- 1. keyword_hit at least one issue keyword appears in the comment text
20
- 2. line_hit comment line_number is within ±LINE_TOLERANCE of the issue
21
- `category` match is NOT sufficient on its own. This closes the keyword-spam
22
- exploit where a model dumps all known keywords on a single line.
23
  """
24
 
 
 
25
  from typing import Any
26
 
27
  try:
@@ -29,15 +31,26 @@ try:
29
  except ImportError:
30
  from models import RewardType # type: ignore[no-redef]
31
 
32
- LINE_TOLERANCE: int = 3 # lines either side of an issue's declared range
 
 
 
 
 
 
 
 
 
33
 
34
 
35
  class CodeReviewGrader:
 
 
36
  def __init__(self, task: dict[str, Any]) -> None:
37
  self.task = task
38
  self.total_weight: float = sum(iss["weight"] for iss in task["issues"])
39
 
40
- # ── Per-comment scoring ───────────────────────────────────────────────
41
 
42
  def score_comment(
43
  self,
@@ -51,16 +64,19 @@ class CodeReviewGrader:
51
  Returns:
52
  (reward_delta, newly_found_issue_ids, component_breakdown)
53
 
54
- Match condition (BOTH required no shortcut):
55
- keyword_hit AND line_hit
 
56
  """
57
  if not comment:
58
  return 0.0, [], {}
59
 
60
  comment_lower = comment.lower()
 
 
 
61
  newly_found: list[str] = []
62
  issue_credit: float = 0.0
63
- false_positive_penalty: float = 0.0
64
 
65
  for issue in self.task["issues"]:
66
  if issue["id"] in already_found:
@@ -69,15 +85,15 @@ class CodeReviewGrader:
69
  keyword_hit = any(kw.lower() in comment_lower for kw in issue["keywords"])
70
  line_hit = self._line_in_range(line_number, issue["line_range"])
71
 
72
- # BOTH conditions required no cat_hit shortcut
73
- if keyword_hit and line_hit:
74
- credit = (issue["weight"] / self.total_weight) * 0.60
75
  newly_found.append(issue["id"])
76
  issue_credit += credit
77
 
78
- # Penalise substantive comments that matched nothing
79
- if not newly_found and comment and len(comment.strip()) > 15:
80
- false_positive_penalty = -0.02
 
81
 
82
  total = round(issue_credit + false_positive_penalty, 4)
83
  breakdown = {
@@ -86,7 +102,7 @@ class CodeReviewGrader:
86
  }
87
  return total, newly_found, breakdown
88
 
89
- # ── Terminal scoring ──────────────────────────────────────────────────
90
 
91
  def final_score(
92
  self,
@@ -98,9 +114,13 @@ class CodeReviewGrader:
98
  ) -> RewardType:
99
  """
100
  Compute the terminal reward on SUBMIT_REVIEW.
101
- Returns a fully typed RewardType with component breakdown.
 
 
 
102
  """
103
- unique_found = list(set(issues_found))
 
104
  found_weight = sum(
105
  iss["weight"]
106
  for iss in self.task["issues"]
@@ -108,12 +128,18 @@ class CodeReviewGrader:
108
  )
109
  coverage = found_weight / self.total_weight if self.total_weight > 0 else 0.0
110
 
111
- correct_decision = self.task.get("correct_decision", "request_changes")
112
- decision_score = 0.10 if review_decision == correct_decision else -0.10
 
 
113
 
114
  efficiency = max(0.0, 1.0 - step_count / max_steps)
115
- efficiency_bonus = round(0.10 * efficiency, 4) if coverage >= 0.60 else 0.0
116
- coverage_bonus = round(coverage * 0.20, 4)
 
 
 
 
117
 
118
  raw_total = coverage_bonus + decision_score + efficiency_bonus
119
  clamped = round(max(-1.0, min(1.0, raw_total)), 4)
@@ -123,23 +149,24 @@ class CodeReviewGrader:
123
  "decision_score": round(decision_score, 4),
124
  "efficiency_bonus": efficiency_bonus,
125
  }
 
126
  explanation = (
127
- f"Found {len(unique_found)}/{len(self.task['issues'])} issues "
128
  f"(weighted coverage {coverage:.0%}). "
129
- f"Decision '{review_decision}' was "
130
  f"{'correct' if review_decision == correct_decision else 'incorrect'}. "
131
  f"Used {step_count}/{max_steps} steps."
132
  )
133
  return RewardType(
134
  total=clamped,
135
  components=components,
136
- passed=review_decision == correct_decision and coverage >= 0.60,
137
  explanation=explanation,
138
  step=current_step,
139
  terminal=True,
140
  )
141
 
142
- # ── Helper ────────────────────────────────────────────────────────────
143
 
144
  @staticmethod
145
  def _line_in_range(
 
1
  """
2
+ Deterministic reward grader for PRobe tasks.
3
 
4
  Scoring design
5
  --------------
6
  During the episode (ADD_COMMENT actions):
7
+ + weight/total_weight * ISSUE_REWARD_POOL per newly found issue
8
+ - FALSE_POSITIVE_PENALTY per substantive unmatched comment
9
+
10
+ Terminal (SUBMIT_REVIEW):
11
+ + coverage * COVERAGE_POOL weighted coverage bonus (max COVERAGE_POOL)
12
+ +/- DECISION_REWARD correct / incorrect final decision
13
+ + efficiency * EFFICIENCY_POOL step-efficiency bonus when coverage >= COVERAGE_THRESHOLD
14
+
15
+ Maximum achievable total: ~1.0 Minimum: -1.0
16
+
17
+ Anti-exploit rules (v3):
18
+ A comment MUST satisfy ALL of:
19
+ 1. keyword_hit -- at least one issue keyword appears in the comment text
20
+ 2. line_hit -- comment line_number is within +/-LINE_TOLERANCE of the issue
21
+ 3. substantive -- comment is longer than MIN_COMMENT_LENGTH characters
22
+ This prevents keyword-spam, wide-net line fishing, and trivial one-word matches.
23
  """
24
 
25
+ from __future__ import annotations
26
+
27
  from typing import Any
28
 
29
  try:
 
31
  except ImportError:
32
  from models import RewardType # type: ignore[no-redef]
33
 
34
+ # -- Grading hyper-parameters ------------------------------------------------
35
+ LINE_TOLERANCE: int = 2 # lines either side of an issue's declared range
36
+ MIN_COMMENT_LENGTH: int = 15 # chars -- comments shorter than this earn no credit
37
+
38
+ ISSUE_REWARD_POOL: float = 0.60 # max cumulative credit from ADD_COMMENT
39
+ COVERAGE_POOL: float = 0.20 # terminal coverage bonus ceiling
40
+ DECISION_REWARD: float = 0.10 # +/- for correct/incorrect final decision
41
+ EFFICIENCY_POOL: float = 0.10 # max terminal efficiency bonus
42
+ COVERAGE_THRESHOLD: float = 0.60 # min coverage to unlock efficiency bonus
43
+ FALSE_POSITIVE_PENALTY: float = -0.05 # per substantive unmatched comment
44
 
45
 
46
  class CodeReviewGrader:
47
+ """Scores agent actions against a task's ground-truth issue list."""
48
+
49
  def __init__(self, task: dict[str, Any]) -> None:
50
  self.task = task
51
  self.total_weight: float = sum(iss["weight"] for iss in task["issues"])
52
 
53
+ # -- Per-comment scoring -------------------------------------------------
54
 
55
  def score_comment(
56
  self,
 
64
  Returns:
65
  (reward_delta, newly_found_issue_ids, component_breakdown)
66
 
67
+ Match condition (ALL required -- no shortcut)::
68
+
69
+ keyword_hit AND line_hit AND substantive
70
  """
71
  if not comment:
72
  return 0.0, [], {}
73
 
74
  comment_lower = comment.lower()
75
+ # Compute once -- used for both the credit path and the penalty path.
76
+ substantive: bool = len(comment.strip()) > MIN_COMMENT_LENGTH
77
+
78
  newly_found: list[str] = []
79
  issue_credit: float = 0.0
 
80
 
81
  for issue in self.task["issues"]:
82
  if issue["id"] in already_found:
 
85
  keyword_hit = any(kw.lower() in comment_lower for kw in issue["keywords"])
86
  line_hit = self._line_in_range(line_number, issue["line_range"])
87
 
88
+ if keyword_hit and line_hit and substantive:
89
+ credit = (issue["weight"] / self.total_weight) * ISSUE_REWARD_POOL
 
90
  newly_found.append(issue["id"])
91
  issue_credit += credit
92
 
93
+ # Penalise substantive comments that matched nothing.
94
+ false_positive_penalty: float = (
95
+ FALSE_POSITIVE_PENALTY if (not newly_found and substantive) else 0.0
96
+ )
97
 
98
  total = round(issue_credit + false_positive_penalty, 4)
99
  breakdown = {
 
102
  }
103
  return total, newly_found, breakdown
104
 
105
+ # -- Terminal scoring ----------------------------------------------------
106
 
107
  def final_score(
108
  self,
 
114
  ) -> RewardType:
115
  """
116
  Compute the terminal reward on SUBMIT_REVIEW.
117
+
118
+ Returns a fully-typed RewardType with a per-component breakdown.
119
+ De-duplicates issues_found with stable ordering so results are
120
+ deterministic regardless of insertion order.
121
  """
122
+ # sorted() gives stable ordering so results are reproducible.
123
+ unique_found: list[str] = sorted(set(issues_found))
124
  found_weight = sum(
125
  iss["weight"]
126
  for iss in self.task["issues"]
 
128
  )
129
  coverage = found_weight / self.total_weight if self.total_weight > 0 else 0.0
130
 
131
+ correct_decision: str = self.task.get("correct_decision", "request_changes")
132
+ decision_score = (
133
+ DECISION_REWARD if review_decision == correct_decision else -DECISION_REWARD
134
+ )
135
 
136
  efficiency = max(0.0, 1.0 - step_count / max_steps)
137
+ efficiency_bonus = (
138
+ round(EFFICIENCY_POOL * efficiency, 4)
139
+ if coverage >= COVERAGE_THRESHOLD
140
+ else 0.0
141
+ )
142
+ coverage_bonus = round(coverage * COVERAGE_POOL, 4)
143
 
144
  raw_total = coverage_bonus + decision_score + efficiency_bonus
145
  clamped = round(max(-1.0, min(1.0, raw_total)), 4)
 
149
  "decision_score": round(decision_score, 4),
150
  "efficiency_bonus": efficiency_bonus,
151
  }
152
+ total_issues = len(self.task["issues"])
153
  explanation = (
154
+ f"Found {len(unique_found)}/{total_issues} issues "
155
  f"(weighted coverage {coverage:.0%}). "
156
+ f"Decision {review_decision!r} was "
157
  f"{'correct' if review_decision == correct_decision else 'incorrect'}. "
158
  f"Used {step_count}/{max_steps} steps."
159
  )
160
  return RewardType(
161
  total=clamped,
162
  components=components,
163
+ passed=review_decision == correct_decision and coverage >= COVERAGE_THRESHOLD,
164
  explanation=explanation,
165
  step=current_step,
166
  terminal=True,
167
  )
168
 
169
+ # -- Helper --------------------------------------------------------------
170
 
171
  @staticmethod
172
  def _line_in_range(
server/mutator.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Code Mutation Engine -- makes the world dynamic.
3
+
4
+ Each call to ``mutate_task()`` returns a deep copy of a task with:
5
+
6
+ 1. Variable renaming -- one identifier swapped for a synonym so the agent
7
+ cannot memorise exact token strings between episodes.
8
+ 2. Line shifting -- an inert blank line inserted above the first issue,
9
+ shifting all issue line_ranges down by 1. The agent
10
+ must *read* the code each episode.
11
+ 3. Constant variance -- numeric literals (e.g. range limits, sleep durations)
12
+ are nudged +/-1 so the agent sees a fresh surface
13
+ without changing the underlying bug.
14
+
15
+ Mutation is fully deterministic given a seed, so training runs are
16
+ reproducible while still being different across episodes.
17
+
18
+ Design principle
19
+ ----------------
20
+ Mutations must NEVER change *whether* a bug exists or *which line category*
21
+ it falls in. They only change surface tokens and line positions so the agent
22
+ cannot exploit memorisation.
23
+ """
24
+
25
+ from __future__ import annotations
26
+
27
+ import copy
28
+ import random
29
+ import re
30
+ from typing import Any
31
+
32
+
33
+ # -- Variable synonym table --------------------------------------------------
34
+ # Maps original identifiers -> list of drop-in synonyms.
35
+ # Only single-token renames that do not affect semantics are listed.
36
+
37
+ _SYNONYMS: dict[str, list[str]] = {
38
+ "total": ["acc", "running_total", "summed"],
39
+ "numbers": ["values", "nums", "items"],
40
+ "result": ["output", "response", "ret"],
41
+ "data": ["payload", "records", "entries"],
42
+ "item": ["record", "entry", "obj"],
43
+ "items": ["records", "entries", "objects"],
44
+ "user": ["account", "principal", "member"],
45
+ "users": ["accounts", "principals", "members"],
46
+ "password": ["passwd", "secret", "credential"],
47
+ "username": ["user_name", "login", "uname"],
48
+ "command": ["cmd", "instruction", "directive"],
49
+ "filename": ["file_name", "fname", "path_name"],
50
+ "url": ["endpoint", "uri", "address"],
51
+ "attempt": ["try_num", "iteration", "retry_idx"],
52
+ "counter": ["count", "tally", "n"],
53
+ "session": ["conn", "http_session", "client"],
54
+ "results": ["findings", "collected", "gathered"],
55
+ "cache": ["store", "lookup", "memo"],
56
+ "transformed": ["processed", "mapped", "converted"],
57
+ }
58
+
59
+
60
+ def mutate_task(base_task: dict[str, Any], seed: int) -> dict[str, Any]:
61
+ """
62
+ Return a mutated deep-copy of *base_task* using *seed* for reproducibility.
63
+
64
+ The returned task is structurally identical to the original -- same keys,
65
+ same issue ids, same categories -- but with surface-level code changes and
66
+ adjusted line_ranges.
67
+ """
68
+ rng = random.Random(seed)
69
+ task: dict[str, Any] = copy.deepcopy(base_task)
70
+
71
+ code: str = task["code"]
72
+ issues: list[dict[str, Any]] = task["issues"]
73
+
74
+ # -- 1. Variable rename --------------------------------------------------
75
+ candidates = [orig for orig in _SYNONYMS if re.search(rf"\b{orig}\b", code)]
76
+ if candidates:
77
+ original = rng.choice(candidates)
78
+ replacement = rng.choice(_SYNONYMS[original])
79
+ # Whole-word replace to avoid partial matches.
80
+ code = re.sub(rf"\b{original}\b", replacement, code)
81
+ # Keep the keyword list in sync so the grader still matches.
82
+ for issue in issues:
83
+ issue["keywords"] = [
84
+ replacement if kw == original else kw
85
+ for kw in issue["keywords"]
86
+ ]
87
+
88
+ # -- 2. Line shift -- insert one blank line before the first issue --------
89
+ if issues:
90
+ first_line = min(iss["line_range"][0] for iss in issues)
91
+ # Convert 1-based line number to 0-based list index.
92
+ insert_before = max(0, first_line - 2)
93
+ lines = code.split("\n")
94
+ lines.insert(insert_before, "")
95
+ code = "\n".join(lines)
96
+ # Shift every issue line_range down by 1 to match the new positions.
97
+ for issue in issues:
98
+ start, end = issue["line_range"]
99
+ issue["line_range"] = (start + 1, end + 1)
100
+
101
+ # -- 3. Constant variance -- nudge one numeric literal -------------------
102
+ # Exclude numbers that appear only inside a comment on the same line,
103
+ # to avoid corrupting annotated line references.
104
+ numeric_matches = [
105
+ m
106
+ for m in re.finditer(r"\b([2-9]|[1-9]\d+)\b", code)
107
+ if not re.search(r"#[^\n]*" + re.escape(m.group()), code[: m.end()])
108
+ ]
109
+ if numeric_matches:
110
+ chosen = rng.choice(numeric_matches)
111
+ original_val = int(chosen.group())
112
+ delta = rng.choice([-1, 1])
113
+ new_val = max(2, original_val + delta) # never go below 2
114
+ code = code[: chosen.start()] + str(new_val) + code[chosen.end() :]
115
+
116
+ task["code"] = code
117
+ task["issues"] = issues
118
+ # Tag the task so the environment can record mutation metadata.
119
+ task["_mutation_seed"] = seed
120
+ return task
121
+
122
+
123
+ __all__ = ["mutate_task"]
server/tasks.py CHANGED
@@ -716,4 +716,228 @@ def admin_panel():
716
  ],
717
  "correct_decision": "request_changes",
718
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
719
  ]
 
716
  ],
717
  "correct_decision": "request_changes",
718
  },
719
+
720
+ # ── Task 6: Causal Chain — Secrets Leak Investigation ────────────────────
721
+ #
722
+ # WORLD-MODELING DESIGN
723
+ # ─────────────────────
724
+ # This task implements a *causal observation chain*:
725
+ #
726
+ # Phase 1 (lines visible from the start)
727
+ # The agent sees a Flask service with two obvious surface issues.
728
+ # Finding issue A (hardcoded JWT secret) *unlocks* Phase 2 context.
729
+ #
730
+ # Phase 2 (revealed after issue A is found)
731
+ # A hidden DB schema snippet is appended to the observation, exposing
732
+ # a privilege-escalation path that only makes sense once the secret
733
+ # leak is understood. This rewards genuine causal reasoning:
734
+ # "the leaked secret lets an attacker forge admin tokens → they can
735
+ # reach the unguarded /admin/promote endpoint → full privilege
736
+ # escalation."
737
+ #
738
+ # Phase 3 (revealed after issue B is found)
739
+ # After the agent flags the missing rate-limit, the server's nginx
740
+ # config fragment is revealed, showing that /auth is also missing
741
+ # the global IP-allowlist — confirming the attack surface is wider
742
+ # than the code alone suggests.
743
+ #
744
+ # The chained field `"unlocks"` in each issue entry names the context_key
745
+ # that the environment injects into the observation when that issue is found.
746
+ # The environment layer reads this and appends the hint to `context_hints`.
747
+ {
748
+ "id": 6,
749
+ "name": "Causal Secrets Leak Investigation",
750
+ "difficulty": "hard",
751
+ "file_name": "auth_service.py",
752
+ "description": (
753
+ "Review this authentication service carefully. "
754
+ "Some issues unlock additional context about the wider system — "
755
+ "read every new hint you receive before continuing. "
756
+ "Use get_context on any suspicious line to reveal surrounding detail. "
757
+ "Identify all issues, then submit your review."
758
+ ),
759
+ "max_steps": 35,
760
+ "code": """\
761
+ import jwt
762
+ import sqlite3
763
+ import time
764
+ from flask import Flask, request, jsonify
765
+
766
+ app = Flask(__name__)
767
+
768
+ # ---- configuration ----------------------------------------------------------
769
+ JWT_SECRET = "super-secret-jwt-key-do-not-share" # line 9: hardcoded secret
770
+ JWT_ALGORITHM = "HS256"
771
+
772
+ # ---- helpers ----------------------------------------------------------------
773
+
774
+ def create_token(user_id: int, role: str) -> str:
775
+ payload = {
776
+ "sub": user_id,
777
+ "role": role,
778
+ "exp": time.time() + 3600,
779
+ }
780
+ return jwt.encode(payload, JWT_SECRET, algorithm=JWT_ALGORITHM)
781
+
782
+
783
+ def verify_token(token: str) -> dict:
784
+ # line 23: algorithm not pinned — accepts ["none"] attack if lib < 2.0
785
+ return jwt.decode(token, JWT_SECRET, algorithms=["HS256", "none"])
786
+
787
+
788
+ # ---- routes -----------------------------------------------------------------
789
+
790
+ @app.route("/auth", methods=["POST"])
791
+ def authenticate():
792
+ \"\"\"Issue a JWT for valid credentials.\"\"\"
793
+ body = request.get_json(force=True)
794
+ uname = body.get("username", "")
795
+ pwd = body.get("password", "")
796
+ # line 33: no rate limiting → brute-force possible
797
+ conn = sqlite3.connect("users.db")
798
+ cursor = conn.cursor()
799
+ # line 37: f-string SQL → injection
800
+ cursor.execute(f"SELECT id, role FROM users WHERE username='{uname}' AND password='{pwd}'")
801
+ row = cursor.fetchone()
802
+ conn.close()
803
+ if row:
804
+ return jsonify({"token": create_token(row[0], row[1])})
805
+ return jsonify({"error": "invalid credentials"}), 401
806
+
807
+
808
+ @app.route("/admin/promote", methods=["POST"])
809
+ def promote_user():
810
+ \"\"\"Promote a user to admin — JWT required.\"\"\"
811
+ token = request.headers.get("Authorization", "").replace("Bearer ", "")
812
+ try:
813
+ claims = verify_token(token)
814
+ except Exception:
815
+ return jsonify({"error": "unauthorized"}), 401
816
+ # line 51: role taken directly from token — no DB re-validation
817
+ if claims.get("role") == "admin":
818
+ target = request.json.get("user_id")
819
+ conn = sqlite3.connect("users.db")
820
+ conn.execute(f"UPDATE users SET role='admin' WHERE id={target}") # line 55: injection
821
+ conn.commit()
822
+ conn.close()
823
+ return jsonify({"promoted": target})
824
+ return jsonify({"error": "forbidden"}), 403
825
+ """,
826
+ # ── Ground-truth issues ───────────────────────────────────────────
827
+ "issues": [
828
+ {
829
+ "id": "hardcoded_jwt_secret",
830
+ "description": "JWT_SECRET is hard-coded; anyone with source access can forge tokens",
831
+ "line_range": (9, 9),
832
+ "keywords": [
833
+ "hardcoded", "hard-coded", "jwt_secret", "secret", "jwt",
834
+ "environment variable", "env var", "os.environ", "forge",
835
+ "hardcode", "token secret",
836
+ ],
837
+ "category": "security",
838
+ "severity": "critical",
839
+ "weight": 1.0,
840
+ # Finding this issue unlocks the DB schema context hint
841
+ "unlocks": "db_schema_hint",
842
+ },
843
+ {
844
+ "id": "jwt_none_algorithm",
845
+ "description": (
846
+ "jwt.decode accepts 'none' algorithm — attacker can craft an "
847
+ "unsigned token and bypass signature verification"
848
+ ),
849
+ "line_range": (23, 24),
850
+ "keywords": [
851
+ "none", "algorithm", "alg", "unsigned", "bypass",
852
+ "jwt", "signature", "verify", "none algorithm",
853
+ ],
854
+ "category": "security",
855
+ "severity": "critical",
856
+ "weight": 1.0,
857
+ },
858
+ {
859
+ "id": "no_rate_limit",
860
+ "description": "/auth endpoint has no rate limiting — susceptible to brute-force",
861
+ "line_range": (33, 34),
862
+ "keywords": [
863
+ "rate limit", "rate-limit", "brute force", "brute-force",
864
+ "throttle", "throttling", "flood", "limit", "attempts",
865
+ ],
866
+ "category": "security",
867
+ "severity": "error",
868
+ "weight": 0.75,
869
+ # Finding this unlocks the nginx config hint
870
+ "unlocks": "nginx_config_hint",
871
+ },
872
+ {
873
+ "id": "sql_injection_auth",
874
+ "description": "f-string interpolation in SQL query on /auth → injection",
875
+ "line_range": (37, 38),
876
+ "keywords": [
877
+ "sql injection", "sql", "injection", "f-string", "parameterized",
878
+ "sanitize", "escape", "prepared statement", "placeholder",
879
+ ],
880
+ "category": "security",
881
+ "severity": "critical",
882
+ "weight": 1.0,
883
+ },
884
+ {
885
+ "id": "role_from_token_only",
886
+ "description": (
887
+ "Role is read directly from the JWT payload without re-checking the DB — "
888
+ "a forged or stale token grants permanent privilege"
889
+ ),
890
+ "line_range": (51, 52),
891
+ "keywords": [
892
+ "role", "token", "db", "database", "re-check", "revalidat",
893
+ "stale", "privilege", "escalation", "claims", "payload",
894
+ "not verified", "trust",
895
+ ],
896
+ "category": "security",
897
+ "severity": "critical",
898
+ "weight": 1.0,
899
+ },
900
+ {
901
+ "id": "sql_injection_promote",
902
+ "description": "f-string SQL in /admin/promote UPDATE query → second-order injection",
903
+ "line_range": (55, 55),
904
+ "keywords": [
905
+ "sql injection", "sql", "injection", "f-string", "parameterized",
906
+ "prepared statement", "placeholder", "update", "second order",
907
+ ],
908
+ "category": "security",
909
+ "severity": "critical",
910
+ "weight": 1.0,
911
+ },
912
+ ],
913
+ "correct_decision": "request_changes",
914
+ # ── Causal context hints — revealed progressively ─────────────────
915
+ # Each value is injected into the observation once the triggering
916
+ # issue is found. The agent must incorporate this new information
917
+ # into its ongoing world model.
918
+ "context_hints": {
919
+ "db_schema_hint": (
920
+ "=== UNLOCKED: Database Schema (users.db) ===\n"
921
+ " CREATE TABLE users (\n"
922
+ " id INTEGER PRIMARY KEY,\n"
923
+ " username TEXT UNIQUE NOT NULL,\n"
924
+ " password TEXT NOT NULL, -- stored as plaintext!\n"
925
+ " role TEXT DEFAULT 'viewer' -- 'viewer' | 'editor' | 'admin'\n"
926
+ " );\n"
927
+ "NOTE: The /admin/promote endpoint can elevate any user to 'admin'. "
928
+ "Combined with a forged JWT (from the leaked secret), an attacker "
929
+ "can reach this endpoint with admin claims and promote themselves."
930
+ ),
931
+ "nginx_config_hint": (
932
+ "=== UNLOCKED: nginx reverse-proxy config (nginx.conf excerpt) ===\n"
933
+ " location /auth {\n"
934
+ " proxy_pass http://auth_service:5000;\n"
935
+ " # no ip_allowlist, no limit_req_zone\n"
936
+ " }\n"
937
+ "NOTE: The nginx layer adds no rate-limiting or IP filtering "
938
+ "in front of /auth, confirming the brute-force surface is "
939
+ "fully exposed to the internet."
940
+ ),
941
+ },
942
+ },
943
  ]
tests/__init__.py ADDED
File without changes
tests/__pycache__/__init__.cpython-314.pyc ADDED
Binary file (162 Bytes). View file
 
tests/__pycache__/test_dynamic_world.cpython-314-pytest-9.0.3.pyc ADDED
Binary file (48.8 kB). View file
 
tests/__pycache__/test_grader.cpython-314-pytest-9.0.3.pyc ADDED
Binary file (47.6 kB). View file
 
tests/test_dynamic_world.py ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Tests for the dynamic world features:
3
+ - server/mutator.py (code mutation engine)
4
+ - Task 6 (causal chain / progressive observation)
5
+ - GET_CONTEXT action (line-context probing)
6
+ - Causal unlock chain (context_hints injected into observation)
7
+ """
8
+
9
+ import sys
10
+ import os
11
+ import copy
12
+
13
+ import pytest
14
+
15
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
16
+
17
+ from server.mutator import mutate_task
18
+ from server.tasks import TASKS
19
+ from server.grader import CodeReviewGrader
20
+
21
+ # ---------------------------------------------------------------------------
22
+ # Helpers
23
+ # ---------------------------------------------------------------------------
24
+
25
+ TASK6 = TASKS[6] # causal chain task
26
+
27
+
28
+ def _grader(task):
29
+ return CodeReviewGrader(task)
30
+
31
+
32
+ # ===========================================================================
33
+ # MUTATOR TESTS
34
+ # ===========================================================================
35
+
36
+ class TestMutator:
37
+
38
+ def test_returns_deep_copy(self):
39
+ """mutate_task must not modify the original TASKS entry."""
40
+ original_code = TASKS[1]["code"]
41
+ _ = mutate_task(TASKS[1], seed=0)
42
+ assert TASKS[1]["code"] == original_code
43
+
44
+ def test_mutation_seed_tag(self):
45
+ """Mutated task carries _mutation_seed matching the supplied seed."""
46
+ t = mutate_task(TASKS[1], seed=42)
47
+ assert t["_mutation_seed"] == 42
48
+
49
+ def test_different_seeds_differ(self):
50
+ """Two different seeds should (almost always) produce different code."""
51
+ t1 = mutate_task(TASKS[1], seed=0)
52
+ t2 = mutate_task(TASKS[1], seed=1)
53
+ # At minimum the blank-line insert shifts are different; codes differ
54
+ assert t1["code"] != TASKS[1]["code"] or t2["code"] != TASKS[1]["code"]
55
+
56
+ def test_same_seed_is_deterministic(self):
57
+ """Same seed must always produce identical output."""
58
+ t1 = mutate_task(TASKS[2], seed=99)
59
+ t2 = mutate_task(TASKS[2], seed=99)
60
+ assert t1["code"] == t2["code"]
61
+ assert t1["issues"] == t2["issues"]
62
+
63
+ def test_line_shift_applied(self):
64
+ """Line shift must move every issue line_range down by exactly 1."""
65
+ original = copy.deepcopy(TASKS[1])
66
+ mutated = mutate_task(TASKS[1], seed=7)
67
+ orig_ranges = [iss["line_range"] for iss in original["issues"]]
68
+ mut_ranges = [iss["line_range"] for iss in mutated["issues"]]
69
+ for orig_r, mut_r in zip(orig_ranges, mut_ranges):
70
+ assert mut_r[0] == orig_r[0] + 1
71
+ assert mut_r[1] == orig_r[1] + 1
72
+
73
+ def test_issue_count_preserved(self):
74
+ """Mutation must not add or remove issues."""
75
+ for task in TASKS[:6]: # skip task 6 here, tested separately
76
+ mutated = mutate_task(task, seed=5)
77
+ assert len(mutated["issues"]) == len(task["issues"])
78
+
79
+ def test_issue_ids_preserved(self):
80
+ """Issue ids must be unchanged after mutation."""
81
+ original_ids = [i["id"] for i in TASKS[2]["issues"]]
82
+ mutated_ids = [i["id"] for i in mutate_task(TASKS[2], seed=3)["issues"]]
83
+ assert original_ids == mutated_ids
84
+
85
+ def test_grader_still_matches_after_mutation(self):
86
+ """
87
+ The grader must still award credit after mutation.
88
+ Use the off-by-one issue in task 1 — keyword 'range' is always present
89
+ and line_range shifts by exactly 1.
90
+ """
91
+ mutated = mutate_task(TASKS[1], seed=10)
92
+ g = _grader(mutated)
93
+ off_by_one = next(i for i in mutated["issues"] if i["id"] == "off_by_one")
94
+ target_line = off_by_one["line_range"][0]
95
+
96
+ score, found, _ = g.score_comment(
97
+ line_number=target_line,
98
+ comment="off-by-one error: range(len + 1) causes IndexError on the last iteration",
99
+ already_found=[],
100
+ )
101
+ assert "off_by_one" in found
102
+ assert score > 0.0
103
+
104
+ def test_correct_decision_preserved(self):
105
+ """correct_decision must be unchanged by mutation."""
106
+ for task in TASKS:
107
+ mutated = mutate_task(task, seed=1)
108
+ assert mutated["correct_decision"] == task["correct_decision"]
109
+
110
+
111
+ # ===========================================================================
112
+ # TASK 6 STRUCTURE TESTS
113
+ # ===========================================================================
114
+
115
+ class TestTask6Structure:
116
+
117
+ def test_task6_exists(self):
118
+ assert len(TASKS) >= 7, "Task 6 (causal chain) must exist in TASKS"
119
+
120
+ def test_task6_has_context_hints(self):
121
+ assert "context_hints" in TASK6
122
+ assert len(TASK6["context_hints"]) >= 2
123
+
124
+ def test_task6_unlock_keys_present(self):
125
+ """Every 'unlocks' key in an issue must exist in context_hints dict."""
126
+ hints = TASK6["context_hints"]
127
+ for issue in TASK6["issues"]:
128
+ key = issue.get("unlocks")
129
+ if key:
130
+ assert key in hints, f"Issue {issue['id']} unlocks '{key}' but key not in context_hints"
131
+
132
+ def test_task6_total_weight_positive(self):
133
+ g = _grader(TASK6)
134
+ assert g.total_weight > 0.0
135
+
136
+ def test_task6_has_chained_issues(self):
137
+ """At least two issues must have an 'unlocks' field."""
138
+ unlocking = [i for i in TASK6["issues"] if i.get("unlocks")]
139
+ assert len(unlocking) >= 2
140
+
141
+ def test_task6_correct_decision(self):
142
+ assert TASK6["correct_decision"] == "request_changes"
143
+
144
+
145
+ # ===========================================================================
146
+ # CAUSAL UNLOCK CHAIN TESTS (environment layer)
147
+ # ===========================================================================
148
+
149
+ class TestCausalUnlock:
150
+ """
151
+ Test the unlock mechanic via the environment's _unlock_causal_hints helper
152
+ and _handle_add_comment pipeline.
153
+ """
154
+
155
+ def _make_env(self):
156
+ """Return a fresh environment instance fast-forwarded to task 6."""
157
+ import asyncio
158
+ try:
159
+ from server.CodeReviewAgent_environment import ProbeEnvironment
160
+ except ImportError:
161
+ from CodeReviewAgent_environment import ProbeEnvironment # type: ignore
162
+
163
+ env = ProbeEnvironment()
164
+ # force-set episode to task 6 (bypass cycling for test speed)
165
+ from server.mutator import mutate_task as _mt
166
+ task = _mt(TASK6, seed=0)
167
+ from server.grader import CodeReviewGrader as _G
168
+ env._grader = _G(task)
169
+ env._ep = env._fresh_episode(task)
170
+ return env
171
+
172
+ def test_no_hints_at_start(self):
173
+ env = self._make_env()
174
+ assert env._ep["context_hints"] == []
175
+
176
+ def test_unlock_fires_after_finding_trigger_issue(self):
177
+ """Finding hardcoded_jwt_secret must append db_schema_hint."""
178
+ env = self._make_env()
179
+ jwt_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "hardcoded_jwt_secret")
180
+ target_line = jwt_issue["line_range"][0]
181
+
182
+ env._step_count = 1
183
+ reward = env._handle_add_comment(
184
+ type("A", (), {
185
+ "line_number": target_line,
186
+ "comment": "JWT_SECRET is hardcoded — must be loaded from environment variable to prevent token forgery",
187
+ "severity": type("S", (), {"value": "critical"})(),
188
+ "category": type("C", (), {"value": "security"})(),
189
+ })()
190
+ )
191
+ assert "hardcoded_jwt_secret" in env._ep["issues_found"]
192
+ assert len(env._ep["context_hints"]) == 1
193
+ assert "db_schema_hint" in env._ep["hints_unlocked"]
194
+ assert "Database Schema" in env._ep["context_hints"][0]
195
+
196
+ def test_unlock_fires_only_once(self):
197
+ """The same hint must not be appended twice even if issue found again."""
198
+ env = self._make_env()
199
+ jwt_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "hardcoded_jwt_secret")
200
+ target_line = jwt_issue["line_range"][0]
201
+
202
+ for _ in range(3):
203
+ env._step_count += 1
204
+ env._handle_add_comment(
205
+ type("A", (), {
206
+ "line_number": target_line,
207
+ "comment": "JWT_SECRET is hardcoded — must be loaded from environment variable",
208
+ "severity": type("S", (), {"value": "critical"})(),
209
+ "category": type("C", (), {"value": "security"})(),
210
+ })()
211
+ )
212
+ assert len(env._ep["context_hints"]) == 1
213
+
214
+ def test_second_unlock_fires_independently(self):
215
+ """Finding no_rate_limit must append nginx_config_hint independently."""
216
+ env = self._make_env()
217
+ rate_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "no_rate_limit")
218
+ target_line = rate_issue["line_range"][0]
219
+
220
+ env._step_count = 1
221
+ env._handle_add_comment(
222
+ type("A", (), {
223
+ "line_number": target_line,
224
+ "comment": "No rate limiting on /auth endpoint — susceptible to brute-force attacks",
225
+ "severity": type("S", (), {"value": "error"})(),
226
+ "category": type("C", (), {"value": "security"})(),
227
+ })()
228
+ )
229
+ assert "nginx_config_hint" in env._ep["hints_unlocked"]
230
+ assert any("nginx" in h.lower() for h in env._ep["context_hints"])
231
+
232
+ def test_both_unlocks_can_fire_in_same_episode(self):
233
+ """Both hints can be unlocked within one episode."""
234
+ env = self._make_env()
235
+ task = env._ep["task"]
236
+
237
+ jwt_issue = next(i for i in task["issues"] if i["id"] == "hardcoded_jwt_secret")
238
+ rate_issue = next(i for i in task["issues"] if i["id"] == "no_rate_limit")
239
+
240
+ for step, (issue, kw) in enumerate([
241
+ (jwt_issue, "JWT_SECRET is hardcoded — must be loaded from environment variable to prevent forgery"),
242
+ (rate_issue, "No rate limiting on /auth endpoint — susceptible to brute-force attacks"),
243
+ ], start=1):
244
+ env._step_count = step
245
+ env._handle_add_comment(
246
+ type("A", (), {
247
+ "line_number": issue["line_range"][0],
248
+ "comment": kw,
249
+ "severity": type("S", (), {"value": "critical"})(),
250
+ "category": type("C", (), {"value": "security"})(),
251
+ })()
252
+ )
253
+
254
+ assert len(env._ep["context_hints"]) == 2
255
+ assert env._ep["hints_unlocked"] == {"db_schema_hint", "nginx_config_hint"}
256
+
257
+ def test_context_hints_appear_in_observation(self):
258
+ """context_hints list must be non-empty in the observation after an unlock."""
259
+ env = self._make_env()
260
+ jwt_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "hardcoded_jwt_secret")
261
+ env._step_count = 1
262
+ env._handle_add_comment(
263
+ type("A", (), {
264
+ "line_number": jwt_issue["line_range"][0],
265
+ "comment": "JWT_SECRET is hardcoded — must be loaded from environment variable",
266
+ "severity": type("S", (), {"value": "critical"})(),
267
+ "category": type("C", (), {"value": "security"})(),
268
+ })()
269
+ )
270
+ obs = env._make_obs(reward=0.0, done=False)
271
+ assert len(obs.context_hints) == 1
272
+ assert "Database Schema" in obs.context_hints[0]
273
+
274
+
275
+ # ===========================================================================
276
+ # GET_CONTEXT ACTION TESTS
277
+ # ===========================================================================
278
+
279
+ class TestGetContext:
280
+
281
+ def _make_env(self):
282
+ try:
283
+ from server.CodeReviewAgent_environment import ProbeEnvironment
284
+ except ImportError:
285
+ from CodeReviewAgent_environment import ProbeEnvironment # type: ignore
286
+ from server.mutator import mutate_task as _mt
287
+ from server.grader import CodeReviewGrader as _G
288
+ env = ProbeEnvironment()
289
+ task = _mt(TASKS[1], seed=0)
290
+ env._grader = _G(task)
291
+ env._ep = env._fresh_episode(task)
292
+ return env
293
+
294
+ def test_get_context_near_issue_no_penalty(self):
295
+ """Probing a line near a real issue must cost 0.0."""
296
+ env = self._make_env()
297
+ issue_line = env._ep["task"]["issues"][0]["line_range"][0]
298
+ env._step_count = 1
299
+ reward = env._handle_get_context(
300
+ type("A", (), {"line_number": issue_line})()
301
+ )
302
+ assert reward.total == 0.0
303
+ assert reward.passed is True
304
+
305
+ def test_get_context_far_from_issue_costs_penalty(self):
306
+ """Probing a line far from any issue must cost -0.01."""
307
+ env = self._make_env()
308
+ env._step_count = 1
309
+ reward = env._handle_get_context(
310
+ type("A", (), {"line_number": 999})()
311
+ )
312
+ assert reward.total == pytest.approx(-0.01, abs=0.001)
313
+ assert reward.passed is False
314
+
315
+ def test_get_context_no_line_number_penalised(self):
316
+ """GET_CONTEXT with no line_number must return -0.02."""
317
+ env = self._make_env()
318
+ env._step_count = 1
319
+ reward = env._handle_get_context(
320
+ type("A", (), {"line_number": None})()
321
+ )
322
+ assert reward.total == pytest.approx(-0.02, abs=0.001)
323
+
324
+ def test_get_context_snippet_stored_in_history(self):
325
+ """The context probe must be recorded in review_comments."""
326
+ env = self._make_env()
327
+ env._step_count = 1
328
+ env._handle_get_context(
329
+ type("A", (), {"line_number": 4})()
330
+ )
331
+ probes = [c for c in env._ep["review_comments"] if c.get("type") == "context_probe"]
332
+ assert len(probes) == 1
333
+ assert probes[0]["line"] == 4
334
+ assert "context" in probes[0]
335
+
336
+ def test_get_context_snippet_contains_requested_line(self):
337
+ """The returned snippet must reference the requested line number."""
338
+ env = self._make_env()
339
+ env._step_count = 1
340
+ reward = env._handle_get_context(
341
+ type("A", (), {"line_number": 4})()
342
+ )
343
+ # explanation contains the formatted snippet with line numbers
344
+ assert "4:" in reward.explanation or "4 :" in reward.explanation
tests/test_grader.py ADDED
@@ -0,0 +1,397 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Tests for CodeReviewGrader — validates all 5 RL attack scenarios plus
3
+ edge cases for the three anti-exploit fixes made in grader.py.
4
+
5
+ Attack targets (from the task spec):
6
+ Lazy / vague output → 0.00 – 0.15
7
+ Average output → 0.30 – 0.50
8
+ Good output → 0.60 – 0.80
9
+ Perfect output → 0.85 – 1.00
10
+ Wrong bug reported → penalty / 0.00
11
+
12
+ Coverage:
13
+ 1. Lazy attack
14
+ 2. Vague attack
15
+ 3. Wrong-bug / hallucination attack
16
+ 4. Perfect output
17
+ 5. Base-model (average) output
18
+ 6. LINE_TOLERANCE boundary (fix 1)
19
+ 7. Minimum comment length guard (fix 2)
20
+ 8. False-positive penalty value (fix 3)
21
+ 9. final_score — full coverage + correct decision
22
+ 10. final_score — zero coverage + wrong decision
23
+ 11. final_score — partial coverage
24
+ 12. Duplicate SUBMIT_REVIEW penalty (environment layer)
25
+ 13. already_found deduplication
26
+ 14. None / empty comment guard
27
+ """
28
+
29
+ import sys
30
+ import os
31
+
32
+ import pytest
33
+
34
+ # Ensure the project root (containing the `server` package) is on the path
35
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
36
+
37
+ from server.grader import CodeReviewGrader, LINE_TOLERANCE
38
+ from server.tasks import TASKS
39
+
40
+
41
+ # ── Fixtures ──────────────────────────────────────────────────────────────────
42
+
43
+ @pytest.fixture
44
+ def task0():
45
+ """Ultra-easy bootstrap task (2 issues, equal weight 1.0 each)."""
46
+ return TASKS[0]
47
+
48
+
49
+ @pytest.fixture
50
+ def task1():
51
+ """Easy task (3 issues)."""
52
+ return TASKS[1]
53
+
54
+
55
+ @pytest.fixture
56
+ def grader0(task0):
57
+ return CodeReviewGrader(task0)
58
+
59
+
60
+ @pytest.fixture
61
+ def grader1(task1):
62
+ return CodeReviewGrader(task1)
63
+
64
+
65
+ # ── Sanity ────────────────────────────────────────────────────────────────────
66
+
67
+ def test_line_tolerance_value():
68
+ """LINE_TOLERANCE must be 2 after the anti-exploit fix."""
69
+ assert LINE_TOLERANCE == 2
70
+
71
+
72
+ # ── 1. Lazy attack ────────────────────────────────────────────────────────────
73
+
74
+ def test_lazy_attack_no_credit(grader0):
75
+ """Generic comment with no matching keyword earns only false-positive penalty."""
76
+ score, found, _ = grader0.score_comment(
77
+ line_number=4,
78
+ # deliberately avoids all task-0 keywords (off-by-one, index, range,
79
+ # bug, security, password, credential, hardcoded, env, secret, etc.)
80
+ comment="This function could probably be improved with some refactoring.",
81
+ already_found=[],
82
+ )
83
+ assert found == []
84
+ assert score <= 0.0 # pure false-positive penalty, no credit
85
+
86
+
87
+ def test_lazy_attack_wrong_line(grader0):
88
+ """Keyword present but line number far from issue — no credit awarded."""
89
+ score, found, _ = grader0.score_comment(
90
+ line_number=99, # far from issue at line 4
91
+ comment="off-by-one indexerror range",
92
+ already_found=[],
93
+ )
94
+ assert found == []
95
+ assert score < 0.0 # false-positive penalty applied
96
+
97
+
98
+ # ── 2. Vague attack ───────────────────────────────────────────────────────────
99
+
100
+ def test_vague_attack_category_only(grader0):
101
+ """Mentioning category ('bug') on correct line but no specific keyword — no credit."""
102
+ score, found, _ = grader0.score_comment(
103
+ line_number=4,
104
+ comment="This code has a logical issue.",
105
+ already_found=[],
106
+ )
107
+ assert found == []
108
+ assert score <= 0.0
109
+
110
+
111
+ # ── 3. Wrong-bug / hallucination attack ──────────────────────────────────────
112
+
113
+ def test_wrong_bug_on_correct_line_wrong_keyword(grader0):
114
+ """Hallucinated keyword on the correct line must not earn credit."""
115
+ score, found, _ = grader0.score_comment(
116
+ line_number=4,
117
+ comment="This has a performance bottleneck and memory leak issue here.",
118
+ already_found=[],
119
+ )
120
+ # 'performance' / 'memory' are not in bootstrap_off_by_one keywords
121
+ assert found == []
122
+ assert score <= 0.0
123
+
124
+
125
+ def test_wrong_bug_wrong_line_right_keyword(grader0):
126
+ """Right keyword, wrong line — line_hit must block the credit."""
127
+ score, found, _ = grader0.score_comment(
128
+ line_number=50, # nowhere near line 4 or 11
129
+ comment="off-by-one indexerror range len + 1",
130
+ already_found=[],
131
+ )
132
+ assert found == []
133
+ assert score <= 0.0
134
+
135
+
136
+ # ── 4. Perfect output ─────────────────────────────────────────────────────────
137
+
138
+ def test_perfect_comment_task0_issue1(grader0):
139
+ """Exact keyword + exact line → full credit for issue 1."""
140
+ score, found, breakdown = grader0.score_comment(
141
+ line_number=4,
142
+ comment="Off-by-one error: range(len(data) + 1) causes IndexError on the last iteration.",
143
+ already_found=[],
144
+ )
145
+ assert "bootstrap_off_by_one" in found
146
+ assert breakdown["issue_credit"] == pytest.approx(0.30, abs=0.01)
147
+ assert score > 0.0
148
+
149
+
150
+ def test_perfect_comment_task0_issue2(grader0):
151
+ """Exact keyword + exact line → full credit for issue 2."""
152
+ score, found, _ = grader0.score_comment(
153
+ line_number=11,
154
+ comment="Hardcoded password / credential in source — move to environment variable.",
155
+ already_found=[],
156
+ )
157
+ assert "bootstrap_hardcoded_cred" in found
158
+ assert score > 0.0
159
+
160
+
161
+ def test_perfect_final_score_task0(grader0):
162
+ """Full coverage + correct decision gives max terminal reward.
163
+
164
+ final_score() is the TERMINAL component only (coverage 0.20 + decision 0.10
165
+ + efficiency 0.10 = max 0.40). The per-comment 0.60 accumulates separately
166
+ during the episode via score_comment(). Assert the realistic terminal range.
167
+ """
168
+ reward = grader0.final_score(
169
+ issues_found=["bootstrap_off_by_one", "bootstrap_hardcoded_cred"],
170
+ review_decision="request_changes",
171
+ step_count=4,
172
+ max_steps=6,
173
+ current_step=4,
174
+ )
175
+ # coverage_bonus=0.20 + decision_score=0.10 + efficiency_bonus>0 → ~0.33-0.40
176
+ assert reward.total >= 0.30
177
+ assert reward.components["coverage_bonus"] == pytest.approx(0.20, abs=0.01)
178
+ assert reward.components["decision_score"] == pytest.approx(0.10, abs=0.001)
179
+ assert reward.passed is True
180
+
181
+
182
+ # ── 5. Base-model (average) output ───────────────────────────────────────────
183
+
184
+ def test_base_model_finds_one_of_two(grader0):
185
+ """Agent that finds 1/2 issues correctly should score in the average range."""
186
+ # Step 1: correct comment finding issue 1
187
+ score1, found1, _ = grader0.score_comment(
188
+ line_number=4,
189
+ comment="range(len(data) + 1) has an off-by-one bug causing IndexError.",
190
+ already_found=[],
191
+ )
192
+ # Step 2: vague comment on issue 2 line — no keyword match
193
+ score2, found2, _ = grader0.score_comment(
194
+ line_number=11,
195
+ comment="This line looks like it might have an issue with the connection string.",
196
+ already_found=found1,
197
+ )
198
+ reward = grader0.final_score(
199
+ issues_found=found1 + found2,
200
+ review_decision="request_changes",
201
+ step_count=4,
202
+ max_steps=6,
203
+ current_step=4,
204
+ )
205
+ # 50 % coverage → coverage_bonus=0.10, correct_decision=+0.10 → 0.20 total
206
+ # Well below the 0.85 perfect ceiling, above 0.10 lazy floor
207
+ assert 0.15 <= reward.total <= 0.55
208
+
209
+
210
+ # ── 6. LINE_TOLERANCE boundary ────────────────────────────────────────────────
211
+
212
+ def test_line_just_inside_tolerance(grader0):
213
+ """line_number at start - LINE_TOLERANCE must still match."""
214
+ issue_start = TASKS[0]["issues"][0]["line_range"][0] # 4
215
+ score, found, _ = grader0.score_comment(
216
+ line_number=issue_start - LINE_TOLERANCE, # exactly at boundary
217
+ comment="off-by-one indexerror range(len + 1) causes crash here",
218
+ already_found=[],
219
+ )
220
+ assert "bootstrap_off_by_one" in found
221
+
222
+
223
+ def test_line_just_outside_tolerance(grader0):
224
+ """line_number at start - LINE_TOLERANCE - 1 must NOT match."""
225
+ issue_start = TASKS[0]["issues"][0]["line_range"][0] # 4
226
+ score, found, _ = grader0.score_comment(
227
+ line_number=issue_start - LINE_TOLERANCE - 1, # one beyond boundary
228
+ comment="off-by-one indexerror range(len + 1) causes crash here",
229
+ already_found=[],
230
+ )
231
+ assert found == []
232
+ assert score <= 0.0
233
+
234
+
235
+ # ── 7. Minimum comment length guard ──────────────────────────────────────────
236
+
237
+ def test_short_keyword_comment_no_credit(grader0):
238
+ """A comment ≤ 15 chars containing a matching keyword must NOT earn credit."""
239
+ score, found, _ = grader0.score_comment(
240
+ line_number=4,
241
+ comment="indexerror", # 10 chars — below 15-char threshold
242
+ already_found=[],
243
+ )
244
+ assert found == []
245
+ # short comment → neither credit nor false-positive penalty
246
+ assert score == 0.0
247
+
248
+
249
+ def test_short_comment_no_false_positive_penalty(grader0):
250
+ """A short comment that matches nothing must NOT be penalised (too trivial)."""
251
+ score, found, _ = grader0.score_comment(
252
+ line_number=99,
253
+ comment="hmm", # 3 chars
254
+ already_found=[],
255
+ )
256
+ assert found == []
257
+ assert score == 0.0
258
+
259
+
260
+ def test_borderline_length_comment(grader0):
261
+ """A 16-char comment (just above threshold) with keyword + correct line earns credit."""
262
+ score, found, _ = grader0.score_comment(
263
+ line_number=4,
264
+ comment="off-by-one range!", # 17 chars, > 15
265
+ already_found=[],
266
+ )
267
+ assert "bootstrap_off_by_one" in found
268
+ assert score > 0.0
269
+
270
+
271
+ # ── 8. False-positive penalty value ──────────────────────────────────────────
272
+
273
+ def test_false_positive_penalty_magnitude(grader0):
274
+ """Each wrong substantive comment must cost exactly -0.05."""
275
+ score, found, breakdown = grader0.score_comment(
276
+ line_number=99,
277
+ comment="This line has a performance issue with the loop structure.",
278
+ already_found=[],
279
+ )
280
+ assert found == []
281
+ assert breakdown["false_positive_penalty"] == pytest.approx(-0.05, abs=0.001)
282
+
283
+
284
+ def test_multiple_false_positives_accumulate(grader0):
285
+ """Two wrong comments should each attract -0.05 independently."""
286
+ s1, _, bd1 = grader0.score_comment(
287
+ line_number=99,
288
+ comment="This line has a performance issue with the loop structure.",
289
+ already_found=[],
290
+ )
291
+ s2, _, bd2 = grader0.score_comment(
292
+ line_number=88,
293
+ comment="There is a design problem with this database call here.",
294
+ already_found=[],
295
+ )
296
+ assert bd1["false_positive_penalty"] == pytest.approx(-0.05, abs=0.001)
297
+ assert bd2["false_positive_penalty"] == pytest.approx(-0.05, abs=0.001)
298
+ # Combined penalty is -0.10 — within the -0.1 to -0.2 spec for 2 wrong claims
299
+ assert s1 + s2 == pytest.approx(-0.10, abs=0.001)
300
+
301
+
302
+ # ── 9. final_score — full coverage + correct decision ─────────────────────────
303
+
304
+ def test_final_score_full_coverage_correct_decision(grader1):
305
+ """100% coverage + correct decision → max terminal reward ~0.37-0.40."""
306
+ all_ids = [iss["id"] for iss in TASKS[1]["issues"]]
307
+ reward = grader1.final_score(
308
+ issues_found=all_ids,
309
+ review_decision="request_changes",
310
+ step_count=5,
311
+ max_steps=15,
312
+ current_step=5,
313
+ )
314
+ assert reward.total >= 0.30
315
+ assert reward.passed is True
316
+ assert reward.terminal is True
317
+ assert reward.components["coverage_bonus"] == pytest.approx(0.20, abs=0.01)
318
+ assert reward.components["decision_score"] == pytest.approx(0.10, abs=0.001)
319
+
320
+
321
+ # ── 10. final_score — zero coverage + wrong decision ─────────────────────────
322
+
323
+ def test_final_score_zero_coverage_wrong_decision(grader1):
324
+ reward = grader1.final_score(
325
+ issues_found=[],
326
+ review_decision="approve", # wrong — should be request_changes
327
+ step_count=15,
328
+ max_steps=15,
329
+ current_step=15,
330
+ )
331
+ assert reward.total <= 0.0
332
+ assert reward.passed is False
333
+ assert reward.components["decision_score"] == pytest.approx(-0.10, abs=0.001)
334
+ assert reward.components["coverage_bonus"] == pytest.approx(0.0, abs=0.001)
335
+
336
+
337
+ # ── 11. final_score — partial coverage ───────────────────────────────────────
338
+
339
+ def test_final_score_partial_coverage(grader1):
340
+ """Finding 1 out of 3 issues (weight 1.0 / 2.5 total) with correct decision."""
341
+ reward = grader1.final_score(
342
+ issues_found=["off_by_one"], # weight 1.0 out of 2.5 total
343
+ review_decision="request_changes",
344
+ step_count=10,
345
+ max_steps=15,
346
+ current_step=10,
347
+ )
348
+ # coverage = 1.0/2.5 = 0.40 → coverage_bonus = 0.08
349
+ # decision_score = +0.10
350
+ # efficiency_bonus = 0.0 (coverage < 0.60)
351
+ # total = 0.18
352
+ assert 0.10 <= reward.total <= 0.30
353
+ assert reward.passed is False # coverage < 60 %
354
+
355
+
356
+ # ── 12. Already-found deduplication ──────────────────────────────────────────
357
+
358
+ def test_already_found_not_double_credited(grader0):
359
+ """An issue already in already_found must not be credited again."""
360
+ score, found, _ = grader0.score_comment(
361
+ line_number=4,
362
+ comment="off-by-one indexerror range(len + 1) causes crash on last item",
363
+ already_found=["bootstrap_off_by_one"], # pre-marked as found
364
+ )
365
+ assert "bootstrap_off_by_one" not in found
366
+ assert score <= 0.0 # false-positive penalty since nothing was matched
367
+
368
+
369
+ # ── 13. None / empty comment guard ───────────────────────────────────────────
370
+
371
+ def test_none_comment_returns_zero(grader0):
372
+ score, found, breakdown = grader0.score_comment(
373
+ line_number=4,
374
+ comment=None,
375
+ already_found=[],
376
+ )
377
+ assert score == 0.0
378
+ assert found == []
379
+ assert breakdown == {}
380
+
381
+
382
+ def test_empty_comment_returns_zero(grader0):
383
+ score, found, _ = grader0.score_comment(
384
+ line_number=4,
385
+ comment="",
386
+ already_found=[],
387
+ )
388
+ assert score == 0.0
389
+ assert found == []
390
+
391
+
392
+ # ── 14. Task weight totals are non-zero (guards __init__) ────────────────────
393
+
394
+ def test_all_task_total_weights_positive():
395
+ for task in TASKS:
396
+ grader = CodeReviewGrader(task)
397
+ assert grader.total_weight > 0.0, f"Task {task['id']} has zero total weight"
uv.lock CHANGED
@@ -882,6 +882,7 @@ dependencies = [
882
  { name = "gradio-client" },
883
  { name = "typer" },
884
  ]
 
885
  wheels = [
886
  { url = "https://files.pythonhosted.org/packages/30/2d/afff2ee87e75d8eb85c92bb8cf0e15b05c23c2ebd8fd8dec781d8601ed7f/hf_gradio-0.4.1-py3-none-any.whl", hash = "sha256:76b8cb8be6abe62d74c1ad2d35b42f0629db89aa9e1a8d033cecfe7c856eeab3", size = 4482, upload-time = "2026-04-17T19:53:31.827Z" },
887
  ]
@@ -1571,32 +1572,6 @@ wheels = [
1571
  { url = "https://files.pythonhosted.org/packages/12/cf/03675d8bd8ecbf4445504d8071adab19f5f993676795708e36402ab38263/openapi_pydantic-0.5.1-py3-none-any.whl", hash = "sha256:a3a09ef4586f5bd760a8df7f43028b60cafb6d9f61de2acba9574766255ab146", size = 96381, upload-time = "2025-01-08T19:29:25.275Z" },
1572
  ]
1573
 
1574
- [[package]]
1575
- name = "openenv-codereviewagent"
1576
- version = "0.1.0"
1577
- source = { editable = "." }
1578
- dependencies = [
1579
- { name = "openai" },
1580
- { name = "openenv-core", extra = ["core"] },
1581
- { name = "python-dotenv" },
1582
- ]
1583
-
1584
- [package.optional-dependencies]
1585
- dev = [
1586
- { name = "pytest" },
1587
- { name = "pytest-cov" },
1588
- ]
1589
-
1590
- [package.metadata]
1591
- requires-dist = [
1592
- { name = "openai", specifier = ">=1.0.0" },
1593
- { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
1594
- { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
1595
- { name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.0.0" },
1596
- { name = "python-dotenv", specifier = ">=1.2.2" },
1597
- ]
1598
- provides-extras = ["dev"]
1599
-
1600
  [[package]]
1601
  name = "openenv-core"
1602
  version = "0.2.3"
@@ -1632,6 +1607,44 @@ core = [
1632
  { name = "websockets" },
1633
  ]
1634
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1635
  [[package]]
1636
  name = "opentelemetry-api"
1637
  version = "1.41.0"
 
882
  { name = "gradio-client" },
883
  { name = "typer" },
884
  ]
885
+ sdist = { url = "https://files.pythonhosted.org/packages/ce/86/c9694b7cfada5780e75769e60dc161a161f4dd7fc91b61db5e3a3338bef9/hf_gradio-0.4.1.tar.gz", hash = "sha256:a017d942618f0d495a58ee4563047fa04bef614c00e0cb789a9a6d0633cffa7b", size = 6560, upload-time = "2026-04-22T14:01:32.334Z" }
886
  wheels = [
887
  { url = "https://files.pythonhosted.org/packages/30/2d/afff2ee87e75d8eb85c92bb8cf0e15b05c23c2ebd8fd8dec781d8601ed7f/hf_gradio-0.4.1-py3-none-any.whl", hash = "sha256:76b8cb8be6abe62d74c1ad2d35b42f0629db89aa9e1a8d033cecfe7c856eeab3", size = 4482, upload-time = "2026-04-17T19:53:31.827Z" },
888
  ]
 
1572
  { url = "https://files.pythonhosted.org/packages/12/cf/03675d8bd8ecbf4445504d8071adab19f5f993676795708e36402ab38263/openapi_pydantic-0.5.1-py3-none-any.whl", hash = "sha256:a3a09ef4586f5bd760a8df7f43028b60cafb6d9f61de2acba9574766255ab146", size = 96381, upload-time = "2025-01-08T19:29:25.275Z" },
1573
  ]
1574
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1575
  [[package]]
1576
  name = "openenv-core"
1577
  version = "0.2.3"
 
1607
  { name = "websockets" },
1608
  ]
1609
 
1610
+ [[package]]
1611
+ name = "openenv-probe"
1612
+ version = "0.1.0"
1613
+ source = { editable = "." }
1614
+ dependencies = [
1615
+ { name = "openai" },
1616
+ { name = "openenv-core", extra = ["core"] },
1617
+ { name = "python-dotenv" },
1618
+ ]
1619
+
1620
+ [package.optional-dependencies]
1621
+ dev = [
1622
+ { name = "pytest" },
1623
+ { name = "pytest-cov" },
1624
+ ]
1625
+
1626
+ [package.dev-dependencies]
1627
+ dev = [
1628
+ { name = "pytest" },
1629
+ { name = "pytest-cov" },
1630
+ ]
1631
+
1632
+ [package.metadata]
1633
+ requires-dist = [
1634
+ { name = "openai", specifier = ">=1.0.0" },
1635
+ { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
1636
+ { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
1637
+ { name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.0.0" },
1638
+ { name = "python-dotenv", specifier = ">=1.2.2" },
1639
+ ]
1640
+ provides-extras = ["dev"]
1641
+
1642
+ [package.metadata.requires-dev]
1643
+ dev = [
1644
+ { name = "pytest", specifier = ">=9.0.3" },
1645
+ { name = "pytest-cov", specifier = ">=7.1.0" },
1646
+ ]
1647
+
1648
  [[package]]
1649
  name = "opentelemetry-api"
1650
  version = "1.41.0"