Spaces:
Runtime error
Runtime error
Thakur, Mahipal commited on
Commit ·
ab287c4
1
Parent(s): 62f5d41
feat: add dynamic world modeling — mutation engine, GET_CONTEXT action, causal chain task
Browse filesserver/mutator.py: variable rename + line shift + constant variance per episode
server/tasks.py: Task 6 causal chain with progressive context unlock
server/CodeReviewAgent_environment.py: wire mutation, GET_CONTEXT, unlock logic
models.py: add GET_CONTEXT action type + context_hints observation field
tests/test_dynamic_world.py: 26 tests covering all new features
refactor: rename project from CodeReviewAgent to PRobe
All class names: ProbeAction, ProbeObservation, ProbeEnv, ProbeEnvironment
pyproject.toml, openenv.yaml, README.md, __init__.py fully updated
50/50 tests passing
- README.md +98 -34
- __init__.py +6 -6
- __pycache__/__init__.cpython-314.pyc +0 -0
- __pycache__/client.cpython-314.pyc +0 -0
- __pycache__/models.cpython-314.pyc +0 -0
- client.py +31 -44
- models.py +72 -35
- openenv.yaml +35 -14
- openenv_CodeReviewAgent.egg-info/SOURCES.txt +5 -1
- openenv_PRobe.egg-info/PKG-INFO +11 -0
- openenv_PRobe.egg-info/SOURCES.txt +19 -0
- openenv_PRobe.egg-info/dependency_links.txt +1 -0
- openenv_PRobe.egg-info/entry_points.txt +2 -0
- openenv_PRobe.egg-info/requires.txt +7 -0
- openenv_PRobe.egg-info/top_level.txt +1 -0
- pyproject.toml +11 -6
- server/CodeReviewAgent_environment.py +127 -32
- server/__init__.py +3 -3
- server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc +0 -0
- server/__pycache__/__init__.cpython-314.pyc +0 -0
- server/__pycache__/grader.cpython-314.pyc +0 -0
- server/__pycache__/mutator.cpython-314.pyc +0 -0
- server/__pycache__/tasks.cpython-314.pyc +0 -0
- server/app.py +28 -22
- server/grader.py +66 -39
- server/mutator.py +123 -0
- server/tasks.py +224 -0
- tests/__init__.py +0 -0
- tests/__pycache__/__init__.cpython-314.pyc +0 -0
- tests/__pycache__/test_dynamic_world.cpython-314-pytest-9.0.3.pyc +0 -0
- tests/__pycache__/test_grader.cpython-314-pytest-9.0.3.pyc +0 -0
- tests/test_dynamic_world.py +344 -0
- tests/test_grader.py +397 -0
- uv.lock +39 -26
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 🔍
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: green
|
|
@@ -12,13 +12,22 @@ tags:
|
|
| 12 |
- code-review
|
| 13 |
- rl-training
|
| 14 |
- grpo
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
-
#
|
| 18 |
|
| 19 |
> **OpenEnv Hackathon 2026 · Theme #3.1 — World Modeling (Professional Tasks)**
|
| 20 |
|
| 21 |
-
An RL
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
---
|
| 24 |
|
|
@@ -31,15 +40,17 @@ This environment provides a **reward signal** that directly measures review qual
|
|
| 31 |
|
| 32 |
## Environment Design
|
| 33 |
|
| 34 |
-
### Tasks (
|
| 35 |
|
| 36 |
| ID | Difficulty | File | Issues | Domain |
|
| 37 |
|----|-----------|------|--------|--------|
|
| 38 |
-
| 0 |
|
| 39 |
-
| 1 |
|
| 40 |
-
| 2 |
|
| 41 |
-
| 3 |
|
| 42 |
-
| 4 |
|
|
|
|
|
|
|
| 43 |
|
| 44 |
Tasks cycle automatically on each `reset()` call.
|
| 45 |
|
|
@@ -47,18 +58,19 @@ Tasks cycle automatically on each `reset()` call.
|
|
| 47 |
|
| 48 |
```python
|
| 49 |
{
|
| 50 |
-
"code_snippet":
|
| 51 |
-
"task_description":
|
| 52 |
-
"file_name":
|
| 53 |
-
"task_id":
|
| 54 |
-
"task_difficulty":
|
| 55 |
-
"review_history":
|
| 56 |
-
"step_count":
|
| 57 |
-
"max_steps":
|
| 58 |
"issues_found_count": int,
|
| 59 |
-
"total_issues":
|
| 60 |
-
"
|
| 61 |
-
"
|
|
|
|
| 62 |
}
|
| 63 |
```
|
| 64 |
|
|
@@ -66,7 +78,8 @@ Tasks cycle automatically on each `reset()` call.
|
|
| 66 |
|
| 67 |
| action_type | Required fields | Effect |
|
| 68 |
|-------------|----------------|--------|
|
| 69 |
-
| `add_comment` | `line_number`, `comment`, `severity`, `category` | Annotate a line;
|
|
|
|
| 70 |
| `request_changes` | `comment` | Signal PR needs work |
|
| 71 |
| `approve` | — | Approve PR (penalised if issues remain) |
|
| 72 |
| `submit_review` | — | Finalise review; terminal reward |
|
|
@@ -86,7 +99,54 @@ Terminal (SUBMIT_REVIEW):
|
|
| 86 |
Maximum achievable: ~1.0
|
| 87 |
```
|
| 88 |
|
| 89 |
-
Grading uses **keyword + line-range matching** (±
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
---
|
| 92 |
|
|
@@ -141,11 +201,11 @@ All install, training, evaluation, and plotting cells are included.
|
|
| 141 |
|
| 142 |
*(Fill in after training run)*
|
| 143 |
|
| 144 |
-
| Model | Avg Reward | Task-0 | Task-1 | Task-2 | Task-3 | Task-4 |
|
| 145 |
-
|-------|-----------|--------|--------|--------|--------|--------|
|
| 146 |
-
| GPT-4o-mini (baseline) | — | — | — | — | — | — |
|
| 147 |
-
| Qwen2.5-1.5B (untrained) | — | — | — | — | — | — |
|
| 148 |
-
| Qwen2.5-1.5B (GRPO 3 epochs) | — | — | — | — | — | — |
|
| 149 |
|
| 150 |
Training curves: `training_curves.png` · Per-task rewards: `per_task_reward.png`
|
| 151 |
|
|
@@ -154,17 +214,21 @@ Training curves: `training_curves.png` · Per-task rewards: `per_task_reward.png
|
|
| 154 |
## Project Structure
|
| 155 |
|
| 156 |
```
|
| 157 |
-
|
| 158 |
├── openenv.yaml # OpenEnv manifest
|
| 159 |
├── pyproject.toml
|
| 160 |
├── models.py # Action + Observation types
|
| 161 |
├── client.py # OpenEnv client
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
train_grpo.py # GRPO training script
|
| 169 |
train_grpo_colab.ipynb # Colab notebook
|
| 170 |
baseline.py # GPT-4o-mini baseline
|
|
|
|
| 1 |
---
|
| 2 |
+
title: PRobe Environment
|
| 3 |
emoji: 🔍
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: green
|
|
|
|
| 12 |
- code-review
|
| 13 |
- rl-training
|
| 14 |
- grpo
|
| 15 |
+
- world-modeling
|
| 16 |
+
- probe
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# PRobe — Pull Request Investigation Environment
|
| 20 |
|
| 21 |
> **OpenEnv Hackathon 2026 · Theme #3.1 — World Modeling (Professional Tasks)**
|
| 22 |
|
| 23 |
+
> *An RL environment where agents learn to investigate code like a security researcher, not scan it like a linter.*
|
| 24 |
+
|
| 25 |
+
PRobe is an RL training environment where an LLM learns to perform structured **pull-request code reviews** on real Python source files. The agent must identify bugs, security vulnerabilities, performance bottlenecks, and design issues — and submit a structured review with line-level comments.
|
| 26 |
+
|
| 27 |
+
The name has three meanings that map directly to the environment's design:
|
| 28 |
+
- **PR** — the domain: pull-request review
|
| 29 |
+
- **Probe** — the `get_context` action where the agent literally probes lines for deeper context
|
| 30 |
+
- **World Modeling** — an agent that *investigates* a partially observable system, updating its beliefs as new evidence is revealed
|
| 31 |
|
| 32 |
---
|
| 33 |
|
|
|
|
| 40 |
|
| 41 |
## Environment Design
|
| 42 |
|
| 43 |
+
### Tasks (7 total)
|
| 44 |
|
| 45 |
| ID | Difficulty | File | Issues | Domain |
|
| 46 |
|----|-----------|------|--------|--------|
|
| 47 |
+
| 0 | Ultra-easy | `bootstrap.py` | 2 | Off-by-one, hardcoded credential (hinted in comments) |
|
| 48 |
+
| 1 | Easy | `utils.py` | 3 | Logic bugs, off-by-one, dead code |
|
| 49 |
+
| 2 | Medium | `auth.py` | 5 | SQL injection, MD5, eval(), hardcoded creds |
|
| 50 |
+
| 3 | Hard | `data_pipeline.py` | 7 | N+1, SSL bypass, thread leak, OOM cache |
|
| 51 |
+
| 4 | Medium | `async_worker.py` | 5 | Race condition, missing await, resource leak |
|
| 52 |
+
| 5 | Hard | `api_server.py` | 6 | Command injection, path traversal, pickle RCE |
|
| 53 |
+
| 6 | Hard | `auth_service.py` | 6 | **Causal chain** — JWT forgery → privilege escalation |
|
| 54 |
|
| 55 |
Tasks cycle automatically on each `reset()` call.
|
| 56 |
|
|
|
|
| 58 |
|
| 59 |
```python
|
| 60 |
{
|
| 61 |
+
"code_snippet": str, # Python source to review (mutated each episode)
|
| 62 |
+
"task_description": str, # What to look for
|
| 63 |
+
"file_name": str,
|
| 64 |
+
"task_id": int, # 0–6
|
| 65 |
+
"task_difficulty": str, # ultra-easy / easy / medium / hard
|
| 66 |
+
"review_history": list, # actions taken so far this episode
|
| 67 |
+
"step_count": int,
|
| 68 |
+
"max_steps": int,
|
| 69 |
"issues_found_count": int,
|
| 70 |
+
"total_issues": int,
|
| 71 |
+
"context_hints": list, # causal hints unlocked so far (Task 6)
|
| 72 |
+
"done": bool,
|
| 73 |
+
"reward": float,
|
| 74 |
}
|
| 75 |
```
|
| 76 |
|
|
|
|
| 78 |
|
| 79 |
| action_type | Required fields | Effect |
|
| 80 |
|-------------|----------------|--------|
|
| 81 |
+
| `add_comment` | `line_number`, `comment`, `severity`, `category` | Annotate a line; reward if it matches a ground-truth issue |
|
| 82 |
+
| `get_context` | `line_number` | Reveal ±5 lines of context around a line (free near issues, −0.01 elsewhere) |
|
| 83 |
| `request_changes` | `comment` | Signal PR needs work |
|
| 84 |
| `approve` | — | Approve PR (penalised if issues remain) |
|
| 85 |
| `submit_review` | — | Finalise review; terminal reward |
|
|
|
|
| 99 |
Maximum achievable: ~1.0
|
| 100 |
```
|
| 101 |
|
| 102 |
+
Grading uses **keyword + line-range matching** (±2 lines tolerance) against hand-labelled ground-truth issues — no LLM judge needed, fully deterministic.
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Dynamic World Features (v3)
|
| 107 |
+
|
| 108 |
+
### Code Mutation
|
| 109 |
+
Every `reset()` applies three surface-level mutations so the agent must *read* code each episode rather than memorise tokens:
|
| 110 |
+
|
| 111 |
+
| Mutation | Effect |
|
| 112 |
+
|---|---|
|
| 113 |
+
| Variable rename | One identifier swapped for a synonym (e.g. `total` → `acc`) |
|
| 114 |
+
| Line shift | One blank line inserted above the first issue, shifting all `line_range` values by +1 |
|
| 115 |
+
| Constant variance | One numeric literal nudged ±1 (e.g. `range(1000)` → `range(999)`) |
|
| 116 |
+
|
| 117 |
+
Mutations are fully **deterministic** given the episode seed — reproducible but always fresh.
|
| 118 |
+
|
| 119 |
+
### GET_CONTEXT Action
|
| 120 |
+
The agent can spend a step probing any line to receive ±5 lines of surrounding context:
|
| 121 |
+
|
| 122 |
+
```python
|
| 123 |
+
action = ProbeAction(
|
| 124 |
+
action_type="get_context",
|
| 125 |
+
line_number=37,
|
| 126 |
+
)
|
| 127 |
+
# Observation will contain a context snippet around line 37
|
| 128 |
+
# Cost: -0.01 if line is far from any real issue, 0.00 if near one
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
### Causal Unlock Chain (Task 6)
|
| 132 |
+
Task 6 implements a **progressive world model**: finding certain issues unlocks new context hints that reveal deeper parts of the system:
|
| 133 |
+
|
| 134 |
+
```
|
| 135 |
+
Find hardcoded JWT secret
|
| 136 |
+
│
|
| 137 |
+
▼
|
| 138 |
+
DB schema revealed ──► agent sees plaintext passwords + role table
|
| 139 |
+
│
|
| 140 |
+
▼
|
| 141 |
+
Can now reason: leaked secret → forge admin token → privilege escalation
|
| 142 |
+
|
| 143 |
+
Find missing rate-limit
|
| 144 |
+
│
|
| 145 |
+
▼
|
| 146 |
+
nginx config revealed ──► confirms /auth fully exposed, no IP filtering
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
This rewards genuine *causal reasoning* — the agent must update its world model as new evidence arrives.
|
| 150 |
|
| 151 |
---
|
| 152 |
|
|
|
|
| 201 |
|
| 202 |
*(Fill in after training run)*
|
| 203 |
|
| 204 |
+
| Model | Avg Reward | Task-0 | Task-1 | Task-2 | Task-3 | Task-4 | Task-5 | Task-6 |
|
| 205 |
+
|-------|-----------|--------|--------|--------|--------|--------|--------|--------|
|
| 206 |
+
| GPT-4o-mini (baseline) | — | — | — | — | — | — | — | — |
|
| 207 |
+
| Qwen2.5-1.5B (untrained) | — | — | — | — | — | — | — | — |
|
| 208 |
+
| Qwen2.5-1.5B (GRPO 3 epochs) | — | — | — | — | — | — | — | — |
|
| 209 |
|
| 210 |
Training curves: `training_curves.png` · Per-task rewards: `per_task_reward.png`
|
| 211 |
|
|
|
|
| 214 |
## Project Structure
|
| 215 |
|
| 216 |
```
|
| 217 |
+
PRobe/
|
| 218 |
├── openenv.yaml # OpenEnv manifest
|
| 219 |
├── pyproject.toml
|
| 220 |
├── models.py # Action + Observation types
|
| 221 |
├── client.py # OpenEnv client
|
| 222 |
+
├── server/
|
| 223 |
+
│ ├── app.py # FastAPI server
|
| 224 |
+
│ ├── PRobe_environment.py # Environment core
|
| 225 |
+
│ ├── grader.py # Deterministic reward grader
|
| 226 |
+
│ ├── mutator.py # Code mutation engine (dynamic world)
|
| 227 |
+
│ ├── tasks.py # 7 ground-truth tasks
|
| 228 |
+
│ └── Dockerfile
|
| 229 |
+
├── tests/
|
| 230 |
+
│ ├── test_grader.py # 24 grader tests (all 5 RL attacks)
|
| 231 |
+
│ └── test_dynamic_world.py # 26 dynamic world tests
|
| 232 |
train_grpo.py # GRPO training script
|
| 233 |
train_grpo_colab.ipynb # Colab notebook
|
| 234 |
baseline.py # GPT-4o-mini baseline
|
__init__.py
CHANGED
|
@@ -4,13 +4,13 @@
|
|
| 4 |
# This source code is licensed under the BSD-style license found in the
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
-
"""
|
| 8 |
|
| 9 |
-
from .client import
|
| 10 |
-
from .models import
|
| 11 |
|
| 12 |
__all__ = [
|
| 13 |
-
"
|
| 14 |
-
"
|
| 15 |
-
"
|
| 16 |
]
|
|
|
|
| 4 |
# This source code is licensed under the BSD-style license found in the
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
+
"""PRobe \u2014 Pull Request Investigation Environment."""
|
| 8 |
|
| 9 |
+
from .client import ProbeEnv
|
| 10 |
+
from .models import ProbeAction, ProbeObservation
|
| 11 |
|
| 12 |
__all__ = [
|
| 13 |
+
"ProbeAction",
|
| 14 |
+
"ProbeObservation",
|
| 15 |
+
"ProbeEnv",
|
| 16 |
]
|
__pycache__/__init__.cpython-314.pyc
CHANGED
|
Binary files a/__pycache__/__init__.cpython-314.pyc and b/__pycache__/__init__.cpython-314.pyc differ
|
|
|
__pycache__/client.cpython-314.pyc
CHANGED
|
Binary files a/__pycache__/client.cpython-314.pyc and b/__pycache__/client.cpython-314.pyc differ
|
|
|
__pycache__/models.cpython-314.pyc
CHANGED
|
Binary files a/__pycache__/models.cpython-314.pyc and b/__pycache__/models.cpython-314.pyc differ
|
|
|
client.py
CHANGED
|
@@ -1,40 +1,39 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
from
|
| 4 |
|
| 5 |
from openenv.core import EnvClient
|
| 6 |
from openenv.core.client_types import StepResult
|
| 7 |
from openenv.core.env_server.types import State
|
| 8 |
|
| 9 |
-
from .models import
|
| 10 |
|
| 11 |
|
| 12 |
-
class
|
| 13 |
-
EnvClient[CodereviewagentAction, CodereviewagentObservation, State]
|
| 14 |
-
):
|
| 15 |
"""
|
| 16 |
-
Client for the
|
| 17 |
|
| 18 |
Maintains a persistent WebSocket connection to the server.
|
| 19 |
|
| 20 |
-
Example:
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
|
|
|
| 34 |
"""
|
| 35 |
|
| 36 |
-
def _step_payload(self, action:
|
| 37 |
-
payload = {"action_type": action.action_type.value}
|
| 38 |
if action.line_number is not None:
|
| 39 |
payload["line_number"] = action.line_number
|
| 40 |
if action.comment is not None:
|
|
@@ -46,31 +45,19 @@ class CodereviewagentEnv(
|
|
| 46 |
return payload
|
| 47 |
|
| 48 |
def _parse_result(
|
| 49 |
-
self, payload:
|
| 50 |
-
) -> StepResult[
|
| 51 |
-
obs_data = payload.get("observation", {})
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
file_name=obs_data.get("file_name", ""),
|
| 56 |
-
task_id=obs_data.get("task_id", 0),
|
| 57 |
-
task_difficulty=obs_data.get("task_difficulty", "easy"),
|
| 58 |
-
review_history=obs_data.get("review_history", []),
|
| 59 |
-
step_count=obs_data.get("step_count", 0),
|
| 60 |
-
max_steps=obs_data.get("max_steps", 20),
|
| 61 |
-
issues_found_count=obs_data.get("issues_found_count", 0),
|
| 62 |
-
total_issues=obs_data.get("total_issues", 0),
|
| 63 |
-
done=payload.get("done", False),
|
| 64 |
-
reward=payload.get("reward"),
|
| 65 |
-
metadata=obs_data.get("metadata", {}),
|
| 66 |
-
)
|
| 67 |
return StepResult(
|
| 68 |
observation=observation,
|
| 69 |
-
reward=payload.get("reward"),
|
| 70 |
-
done=payload.get("done", False),
|
| 71 |
)
|
| 72 |
|
| 73 |
-
def _parse_state(self, payload:
|
| 74 |
return State(
|
| 75 |
episode_id=payload.get("episode_id"),
|
| 76 |
step_count=payload.get("step_count", 0),
|
|
|
|
| 1 |
+
"""PRobe Environment Client."""
|
| 2 |
|
| 3 |
+
from __future__ import annotations
|
| 4 |
|
| 5 |
from openenv.core import EnvClient
|
| 6 |
from openenv.core.client_types import StepResult
|
| 7 |
from openenv.core.env_server.types import State
|
| 8 |
|
| 9 |
+
from .models import ProbeAction, ProbeObservation
|
| 10 |
|
| 11 |
|
| 12 |
+
class ProbeEnv(EnvClient[ProbeAction, ProbeObservation, State]):
|
|
|
|
|
|
|
| 13 |
"""
|
| 14 |
+
Client for the PRobe environment.
|
| 15 |
|
| 16 |
Maintains a persistent WebSocket connection to the server.
|
| 17 |
|
| 18 |
+
Example::
|
| 19 |
+
|
| 20 |
+
with ProbeEnv(base_url="http://localhost:8000") as env:
|
| 21 |
+
result = env.reset()
|
| 22 |
+
print(result.observation.task_description)
|
| 23 |
+
|
| 24 |
+
action = ProbeAction(
|
| 25 |
+
action_type="add_comment",
|
| 26 |
+
line_number=4,
|
| 27 |
+
comment="Off-by-one: range(len+1) causes IndexError",
|
| 28 |
+
severity="error",
|
| 29 |
+
category="bug",
|
| 30 |
+
)
|
| 31 |
+
result = env.step(action)
|
| 32 |
+
print(result.reward)
|
| 33 |
"""
|
| 34 |
|
| 35 |
+
def _step_payload(self, action: ProbeAction) -> dict:
|
| 36 |
+
payload: dict = {"action_type": action.action_type.value}
|
| 37 |
if action.line_number is not None:
|
| 38 |
payload["line_number"] = action.line_number
|
| 39 |
if action.comment is not None:
|
|
|
|
| 45 |
return payload
|
| 46 |
|
| 47 |
def _parse_result(
|
| 48 |
+
self, payload: dict
|
| 49 |
+
) -> StepResult[ProbeObservation]:
|
| 50 |
+
obs_data: dict = payload.get("observation", {})
|
| 51 |
+
# Use model_validate so new fields added to ProbeObservation
|
| 52 |
+
# are picked up automatically without changing this method.
|
| 53 |
+
observation = ProbeObservation.model_validate(obs_data)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
return StepResult(
|
| 55 |
observation=observation,
|
| 56 |
+
reward=float(payload.get("reward") or 0.0),
|
| 57 |
+
done=bool(payload.get("done", False)),
|
| 58 |
)
|
| 59 |
|
| 60 |
+
def _parse_state(self, payload: dict) -> State:
|
| 61 |
return State(
|
| 62 |
episode_id=payload.get("episode_id"),
|
| 63 |
step_count=payload.get("step_count", 0),
|
models.py
CHANGED
|
@@ -1,10 +1,12 @@
|
|
| 1 |
"""
|
| 2 |
-
Data models for the
|
| 3 |
|
| 4 |
An agent reviews Python source files, identifies bugs, security issues,
|
| 5 |
and design problems, then submits a structured review.
|
| 6 |
"""
|
| 7 |
|
|
|
|
|
|
|
| 8 |
from enum import Enum
|
| 9 |
from typing import Any
|
| 10 |
|
|
@@ -13,13 +15,18 @@ from pydantic import BaseModel, ConfigDict, Field
|
|
| 13 |
|
| 14 |
|
| 15 |
class ActionType(str, Enum):
|
|
|
|
|
|
|
| 16 |
ADD_COMMENT = "add_comment"
|
|
|
|
| 17 |
REQUEST_CHANGES = "request_changes"
|
| 18 |
APPROVE = "approve"
|
| 19 |
SUBMIT_REVIEW = "submit_review"
|
| 20 |
|
| 21 |
|
| 22 |
class Severity(str, Enum):
|
|
|
|
|
|
|
| 23 |
INFO = "info"
|
| 24 |
WARNING = "warning"
|
| 25 |
ERROR = "error"
|
|
@@ -27,6 +34,8 @@ class Severity(str, Enum):
|
|
| 27 |
|
| 28 |
|
| 29 |
class IssueCategory(str, Enum):
|
|
|
|
|
|
|
| 30 |
BUG = "bug"
|
| 31 |
SECURITY = "security"
|
| 32 |
PERFORMANCE = "performance"
|
|
@@ -36,62 +45,90 @@ class IssueCategory(str, Enum):
|
|
| 36 |
|
| 37 |
class RewardType(BaseModel):
|
| 38 |
"""
|
| 39 |
-
Structured reward returned by step().
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
|
|
|
| 47 |
"""
|
| 48 |
|
| 49 |
model_config = ConfigDict(frozen=True)
|
| 50 |
|
| 51 |
total: float = Field(..., ge=-1.0, le=1.0)
|
| 52 |
components: dict[str, float] = Field(default_factory=dict)
|
| 53 |
-
passed: bool = Field(False)
|
| 54 |
-
explanation: str = Field("")
|
| 55 |
-
step: int = Field(0)
|
| 56 |
-
terminal: bool = Field(False)
|
| 57 |
|
| 58 |
|
| 59 |
-
class
|
| 60 |
"""
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
"""
|
| 66 |
|
| 67 |
action_type: ActionType = Field(..., description="Type of review action")
|
| 68 |
-
line_number: int | None = Field(
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
|
| 74 |
-
class
|
| 75 |
"""
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
The `reward` field mirrors the most recent step
|
| 79 |
-
the authoritative reward is
|
| 80 |
"""
|
| 81 |
|
| 82 |
-
code_snippet: str = Field(default="", description="Python source code to review")
|
| 83 |
task_description: str = Field(default="", description="Review instructions and goals")
|
| 84 |
file_name: str = Field(default="", description="Name of the file being reviewed")
|
| 85 |
-
task_id: int = Field(default=0, description="Current task index")
|
| 86 |
task_difficulty: str = Field(default="ultra-easy", description="Task difficulty label")
|
| 87 |
review_history: list[dict[str, Any]] = Field(
|
| 88 |
default_factory=list,
|
| 89 |
-
description="Ordered list of actions taken so far this episode",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
)
|
| 91 |
-
step_count: int = Field(default=0, description="Steps taken in current episode")
|
| 92 |
-
max_steps: int = Field(default=6, description="Step budget for this task")
|
| 93 |
-
issues_found_count: int = Field(default=0, description="Number of issues identified so far")
|
| 94 |
-
total_issues: int = Field(default=0, description="Total issues in this task")
|
| 95 |
done: bool = Field(default=False, description="Whether the episode has ended")
|
| 96 |
-
reward: float = Field(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
metadata: dict[str, Any] = Field(default_factory=dict, description="Extra episode metadata")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
"""
|
| 2 |
+
Data models for the PRobe Environment.
|
| 3 |
|
| 4 |
An agent reviews Python source files, identifies bugs, security issues,
|
| 5 |
and design problems, then submits a structured review.
|
| 6 |
"""
|
| 7 |
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
from enum import Enum
|
| 11 |
from typing import Any
|
| 12 |
|
|
|
|
| 15 |
|
| 16 |
|
| 17 |
class ActionType(str, Enum):
|
| 18 |
+
"""All actions the agent may take during a review episode."""
|
| 19 |
+
|
| 20 |
ADD_COMMENT = "add_comment"
|
| 21 |
+
GET_CONTEXT = "get_context" # probe a line for deeper causal context
|
| 22 |
REQUEST_CHANGES = "request_changes"
|
| 23 |
APPROVE = "approve"
|
| 24 |
SUBMIT_REVIEW = "submit_review"
|
| 25 |
|
| 26 |
|
| 27 |
class Severity(str, Enum):
|
| 28 |
+
"""Severity levels for review comments."""
|
| 29 |
+
|
| 30 |
INFO = "info"
|
| 31 |
WARNING = "warning"
|
| 32 |
ERROR = "error"
|
|
|
|
| 34 |
|
| 35 |
|
| 36 |
class IssueCategory(str, Enum):
|
| 37 |
+
"""Issue category taxonomy used in review comments."""
|
| 38 |
+
|
| 39 |
BUG = "bug"
|
| 40 |
SECURITY = "security"
|
| 41 |
PERFORMANCE = "performance"
|
|
|
|
| 45 |
|
| 46 |
class RewardType(BaseModel):
|
| 47 |
"""
|
| 48 |
+
Structured reward returned by ``step()``.
|
| 49 |
+
|
| 50 |
+
Attributes:
|
| 51 |
+
total: Final clamped score in ``[-1.0, 1.0]``.
|
| 52 |
+
components: Named sub-scores before clamping (may sum outside ``[-1, 1]``).
|
| 53 |
+
passed: ``True`` when the action produced a clear positive signal.
|
| 54 |
+
explanation: Human-readable breakdown for logging / debugging.
|
| 55 |
+
step: Environment step at which this reward was issued.
|
| 56 |
+
terminal: ``True`` only on the ``SUBMIT_REVIEW`` step.
|
| 57 |
"""
|
| 58 |
|
| 59 |
model_config = ConfigDict(frozen=True)
|
| 60 |
|
| 61 |
total: float = Field(..., ge=-1.0, le=1.0)
|
| 62 |
components: dict[str, float] = Field(default_factory=dict)
|
| 63 |
+
passed: bool = Field(default=False)
|
| 64 |
+
explanation: str = Field(default="")
|
| 65 |
+
step: int = Field(default=0, ge=0)
|
| 66 |
+
terminal: bool = Field(default=False)
|
| 67 |
|
| 68 |
|
| 69 |
+
class ProbeAction(Action):
|
| 70 |
"""
|
| 71 |
+
An action the agent submits during a review episode.
|
| 72 |
+
|
| 73 |
+
Action types:
|
| 74 |
+
ADD_COMMENT — annotate a specific line with a review comment.
|
| 75 |
+
GET_CONTEXT — reveal ±5 lines of context around a line number.
|
| 76 |
+
REQUEST_CHANGES — mark the PR as requiring changes before merge.
|
| 77 |
+
APPROVE — approve the PR (penalised if issues remain).
|
| 78 |
+
SUBMIT_REVIEW — finalise and submit the review (ends the episode).
|
| 79 |
"""
|
| 80 |
|
| 81 |
action_type: ActionType = Field(..., description="Type of review action")
|
| 82 |
+
line_number: int | None = Field(
|
| 83 |
+
default=None,
|
| 84 |
+
ge=1,
|
| 85 |
+
description="1-based source line being commented on or probed",
|
| 86 |
+
)
|
| 87 |
+
comment: str | None = Field(default=None, description="Review comment text")
|
| 88 |
+
severity: Severity | None = Field(default=None, description="Issue severity level")
|
| 89 |
+
category: IssueCategory | None = Field(default=None, description="Issue category")
|
| 90 |
|
| 91 |
|
| 92 |
+
class ProbeObservation(Observation):
|
| 93 |
"""
|
| 94 |
+
The observation returned to the agent after every ``reset()`` / ``step()``.
|
| 95 |
+
|
| 96 |
+
The ``reward`` field mirrors ``RewardType.total`` for the most recent step
|
| 97 |
+
as a convenience; the authoritative reward object is returned by ``step()``.
|
| 98 |
"""
|
| 99 |
|
| 100 |
+
code_snippet: str = Field(default="", description="Python source code to review (mutated each episode)")
|
| 101 |
task_description: str = Field(default="", description="Review instructions and goals")
|
| 102 |
file_name: str = Field(default="", description="Name of the file being reviewed")
|
| 103 |
+
task_id: int = Field(default=0, ge=0, description="Current task index (0–6)")
|
| 104 |
task_difficulty: str = Field(default="ultra-easy", description="Task difficulty label")
|
| 105 |
review_history: list[dict[str, Any]] = Field(
|
| 106 |
default_factory=list,
|
| 107 |
+
description="Ordered list of all actions taken so far this episode",
|
| 108 |
+
)
|
| 109 |
+
step_count: int = Field(default=0, ge=0, description="Steps taken in current episode")
|
| 110 |
+
max_steps: int = Field(default=6, ge=1, description="Step budget for this task")
|
| 111 |
+
issues_found_count: int = Field(default=0, ge=0, description="Distinct issues identified so far")
|
| 112 |
+
total_issues: int = Field(default=0, ge=0, description="Total ground-truth issues in this task")
|
| 113 |
+
context_hints: list[str] = Field(
|
| 114 |
+
default_factory=list,
|
| 115 |
+
description="Causal context unlocked by finding key issues — read these before continuing",
|
| 116 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
done: bool = Field(default=False, description="Whether the episode has ended")
|
| 118 |
+
reward: float = Field(
|
| 119 |
+
default=0.0,
|
| 120 |
+
ge=-1.0,
|
| 121 |
+
le=1.0,
|
| 122 |
+
description="Most recent step reward (mirrors RewardType.total)",
|
| 123 |
+
)
|
| 124 |
metadata: dict[str, Any] = Field(default_factory=dict, description="Extra episode metadata")
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
__all__ = [
|
| 128 |
+
"ActionType",
|
| 129 |
+
"IssueCategory",
|
| 130 |
+
"ProbeAction",
|
| 131 |
+
"ProbeObservation",
|
| 132 |
+
"RewardType",
|
| 133 |
+
"Severity",
|
| 134 |
+
]
|
openenv.yaml
CHANGED
|
@@ -1,32 +1,40 @@
|
|
| 1 |
spec_version: 1
|
| 2 |
-
name:
|
| 3 |
type: space
|
| 4 |
runtime: fastapi
|
| 5 |
app: server.app:app
|
| 6 |
port: 8000
|
| 7 |
|
| 8 |
description: >
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
and design issues, then submits a
|
| 12 |
-
|
|
|
|
| 13 |
|
| 14 |
tasks:
|
| 15 |
- id: 0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
name: Basic Bug Detection
|
| 17 |
difficulty: easy
|
| 18 |
description: Identify logical bugs in a simple Python utility module
|
| 19 |
max_steps: 15
|
| 20 |
issues: 3
|
| 21 |
|
| 22 |
-
- id:
|
| 23 |
name: Security Vulnerability Review
|
| 24 |
difficulty: medium
|
| 25 |
description: Find security vulnerabilities in an authentication module
|
| 26 |
max_steps: 20
|
| 27 |
issues: 5
|
| 28 |
|
| 29 |
-
- id:
|
| 30 |
name: Full Architecture and Performance Review
|
| 31 |
difficulty: hard
|
| 32 |
description: >
|
|
@@ -35,14 +43,14 @@ tasks:
|
|
| 35 |
max_steps: 30
|
| 36 |
issues: 7
|
| 37 |
|
| 38 |
-
- id:
|
| 39 |
name: Async Worker Review
|
| 40 |
difficulty: medium
|
| 41 |
description: Find concurrency bugs and resource leaks in an async worker
|
| 42 |
max_steps: 20
|
| 43 |
issues: 5
|
| 44 |
|
| 45 |
-
- id:
|
| 46 |
name: Flask API Security Review
|
| 47 |
difficulty: hard
|
| 48 |
description: >
|
|
@@ -51,19 +59,30 @@ tasks:
|
|
| 51 |
max_steps: 30
|
| 52 |
issues: 6
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
observation:
|
| 55 |
type: object
|
| 56 |
fields:
|
| 57 |
-
code_snippet: {type: string, description: "Python source to review"}
|
| 58 |
task_description: {type: string, description: "Review instructions"}
|
| 59 |
file_name: {type: string}
|
| 60 |
-
task_id: {type: integer, range: [0,
|
| 61 |
-
task_difficulty: {type: string, values: [easy, medium, hard]}
|
| 62 |
review_history: {type: array, description: "Actions taken so far"}
|
| 63 |
step_count: {type: integer}
|
| 64 |
max_steps: {type: integer}
|
| 65 |
issues_found_count: {type: integer}
|
| 66 |
total_issues: {type: integer}
|
|
|
|
| 67 |
done: {type: boolean}
|
| 68 |
reward: {type: number}
|
| 69 |
|
|
@@ -72,7 +91,7 @@ action:
|
|
| 72 |
fields:
|
| 73 |
action_type:
|
| 74 |
type: enum
|
| 75 |
-
values: [add_comment, request_changes, approve, submit_review]
|
| 76 |
line_number: {type: integer, required: false}
|
| 77 |
comment: {type: string, required: false}
|
| 78 |
severity:
|
|
@@ -88,9 +107,11 @@ reward_design:
|
|
| 88 |
range: [-1.0, 1.0]
|
| 89 |
per_step:
|
| 90 |
issue_found: "up to 0.60 total (weight/total_weight × 0.60 per issue)"
|
| 91 |
-
false_positive: -0.
|
| 92 |
correct_request_changes: +0.05
|
| 93 |
bad_approval: -0.15
|
|
|
|
|
|
|
| 94 |
terminal:
|
| 95 |
coverage_bonus: "coverage × 0.20 (max +0.20)"
|
| 96 |
decision_correct: +0.10
|
|
|
|
| 1 |
spec_version: 1
|
| 2 |
+
name: PRobe
|
| 3 |
type: space
|
| 4 |
runtime: fastapi
|
| 5 |
app: server.app:app
|
| 6 |
port: 8000
|
| 7 |
|
| 8 |
description: >
|
| 9 |
+
PRobe (Pull Request Investigation Environment) — an RL training environment
|
| 10 |
+
where an agent reviews Python source files, identifies bugs, security
|
| 11 |
+
vulnerabilities, performance bottlenecks, and design issues, then submits a
|
| 12 |
+
structured review. Features dynamic code mutation, a GET_CONTEXT probe action,
|
| 13 |
+
and a causal unlock chain for genuine world-model reasoning.
|
| 14 |
|
| 15 |
tasks:
|
| 16 |
- id: 0
|
| 17 |
+
name: Bootstrap Obvious Issues
|
| 18 |
+
difficulty: ultra-easy
|
| 19 |
+
description: Off-by-one and hardcoded credential, both hinted in comments
|
| 20 |
+
max_steps: 6
|
| 21 |
+
issues: 2
|
| 22 |
+
|
| 23 |
+
- id: 1
|
| 24 |
name: Basic Bug Detection
|
| 25 |
difficulty: easy
|
| 26 |
description: Identify logical bugs in a simple Python utility module
|
| 27 |
max_steps: 15
|
| 28 |
issues: 3
|
| 29 |
|
| 30 |
+
- id: 2
|
| 31 |
name: Security Vulnerability Review
|
| 32 |
difficulty: medium
|
| 33 |
description: Find security vulnerabilities in an authentication module
|
| 34 |
max_steps: 20
|
| 35 |
issues: 5
|
| 36 |
|
| 37 |
+
- id: 3
|
| 38 |
name: Full Architecture and Performance Review
|
| 39 |
difficulty: hard
|
| 40 |
description: >
|
|
|
|
| 43 |
max_steps: 30
|
| 44 |
issues: 7
|
| 45 |
|
| 46 |
+
- id: 4
|
| 47 |
name: Async Worker Review
|
| 48 |
difficulty: medium
|
| 49 |
description: Find concurrency bugs and resource leaks in an async worker
|
| 50 |
max_steps: 20
|
| 51 |
issues: 5
|
| 52 |
|
| 53 |
+
- id: 5
|
| 54 |
name: Flask API Security Review
|
| 55 |
difficulty: hard
|
| 56 |
description: >
|
|
|
|
| 59 |
max_steps: 30
|
| 60 |
issues: 6
|
| 61 |
|
| 62 |
+
- id: 6
|
| 63 |
+
name: Causal Secrets Leak Investigation
|
| 64 |
+
difficulty: hard
|
| 65 |
+
description: >
|
| 66 |
+
JWT auth service review with causal unlock chain — finding key issues
|
| 67 |
+
reveals DB schema and nginx config, enabling deeper attack-path reasoning
|
| 68 |
+
max_steps: 35
|
| 69 |
+
issues: 6
|
| 70 |
+
causal_unlocks: true
|
| 71 |
+
|
| 72 |
observation:
|
| 73 |
type: object
|
| 74 |
fields:
|
| 75 |
+
code_snippet: {type: string, description: "Python source to review (mutated each episode)"}
|
| 76 |
task_description: {type: string, description: "Review instructions"}
|
| 77 |
file_name: {type: string}
|
| 78 |
+
task_id: {type: integer, range: [0, 6]}
|
| 79 |
+
task_difficulty: {type: string, values: [ultra-easy, easy, medium, hard]}
|
| 80 |
review_history: {type: array, description: "Actions taken so far"}
|
| 81 |
step_count: {type: integer}
|
| 82 |
max_steps: {type: integer}
|
| 83 |
issues_found_count: {type: integer}
|
| 84 |
total_issues: {type: integer}
|
| 85 |
+
context_hints: {type: array, description: "Causal hints unlocked by finding key issues"}
|
| 86 |
done: {type: boolean}
|
| 87 |
reward: {type: number}
|
| 88 |
|
|
|
|
| 91 |
fields:
|
| 92 |
action_type:
|
| 93 |
type: enum
|
| 94 |
+
values: [add_comment, get_context, request_changes, approve, submit_review]
|
| 95 |
line_number: {type: integer, required: false}
|
| 96 |
comment: {type: string, required: false}
|
| 97 |
severity:
|
|
|
|
| 107 |
range: [-1.0, 1.0]
|
| 108 |
per_step:
|
| 109 |
issue_found: "up to 0.60 total (weight/total_weight × 0.60 per issue)"
|
| 110 |
+
false_positive: -0.05
|
| 111 |
correct_request_changes: +0.05
|
| 112 |
bad_approval: -0.15
|
| 113 |
+
context_probe_near_issue: 0.00
|
| 114 |
+
context_probe_far: -0.01
|
| 115 |
terminal:
|
| 116 |
coverage_bonus: "coverage × 0.20 (max +0.20)"
|
| 117 |
decision_correct: +0.10
|
openenv_CodeReviewAgent.egg-info/SOURCES.txt
CHANGED
|
@@ -1,4 +1,7 @@
|
|
| 1 |
README.md
|
|
|
|
|
|
|
|
|
|
| 2 |
pyproject.toml
|
| 3 |
./__init__.py
|
| 4 |
./client.py
|
|
@@ -13,4 +16,5 @@ server/CodeReviewAgent_environment.py
|
|
| 13 |
server/__init__.py
|
| 14 |
server/app.py
|
| 15 |
server/grader.py
|
| 16 |
-
server/tasks.py
|
|
|
|
|
|
| 1 |
README.md
|
| 2 |
+
__init__.py
|
| 3 |
+
client.py
|
| 4 |
+
models.py
|
| 5 |
pyproject.toml
|
| 6 |
./__init__.py
|
| 7 |
./client.py
|
|
|
|
| 16 |
server/__init__.py
|
| 17 |
server/app.py
|
| 18 |
server/grader.py
|
| 19 |
+
server/tasks.py
|
| 20 |
+
tests/test_grader.py
|
openenv_PRobe.egg-info/PKG-INFO
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Metadata-Version: 2.4
|
| 2 |
+
Name: openenv-PRobe
|
| 3 |
+
Version: 0.1.0
|
| 4 |
+
Summary: PRobe — Pull Request Investigation Environment for OpenEnv
|
| 5 |
+
Requires-Python: >=3.10
|
| 6 |
+
Requires-Dist: openenv-core[core]>=0.2.2
|
| 7 |
+
Requires-Dist: openai>=1.0.0
|
| 8 |
+
Requires-Dist: python-dotenv>=1.2.2
|
| 9 |
+
Provides-Extra: dev
|
| 10 |
+
Requires-Dist: pytest>=8.0.0; extra == "dev"
|
| 11 |
+
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
|
openenv_PRobe.egg-info/SOURCES.txt
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
README.md
|
| 2 |
+
pyproject.toml
|
| 3 |
+
./__init__.py
|
| 4 |
+
./client.py
|
| 5 |
+
./models.py
|
| 6 |
+
openenv_PRobe.egg-info/PKG-INFO
|
| 7 |
+
openenv_PRobe.egg-info/SOURCES.txt
|
| 8 |
+
openenv_PRobe.egg-info/dependency_links.txt
|
| 9 |
+
openenv_PRobe.egg-info/entry_points.txt
|
| 10 |
+
openenv_PRobe.egg-info/requires.txt
|
| 11 |
+
openenv_PRobe.egg-info/top_level.txt
|
| 12 |
+
server/CodeReviewAgent_environment.py
|
| 13 |
+
server/__init__.py
|
| 14 |
+
server/app.py
|
| 15 |
+
server/grader.py
|
| 16 |
+
server/mutator.py
|
| 17 |
+
server/tasks.py
|
| 18 |
+
tests/test_dynamic_world.py
|
| 19 |
+
tests/test_grader.py
|
openenv_PRobe.egg-info/dependency_links.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
openenv_PRobe.egg-info/entry_points.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[console_scripts]
|
| 2 |
+
server = PRobe.server.app:main
|
openenv_PRobe.egg-info/requires.txt
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
openenv-core[core]>=0.2.2
|
| 2 |
+
openai>=1.0.0
|
| 3 |
+
python-dotenv>=1.2.2
|
| 4 |
+
|
| 5 |
+
[dev]
|
| 6 |
+
pytest>=8.0.0
|
| 7 |
+
pytest-cov>=4.0.0
|
openenv_PRobe.egg-info/top_level.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
PRobe
|
pyproject.toml
CHANGED
|
@@ -9,9 +9,9 @@ requires = ["setuptools>=45", "wheel"]
|
|
| 9 |
build-backend = "setuptools.build_meta"
|
| 10 |
|
| 11 |
[project]
|
| 12 |
-
name = "openenv-
|
| 13 |
version = "0.1.0"
|
| 14 |
-
description = "
|
| 15 |
requires-python = ">=3.10"
|
| 16 |
dependencies = [
|
| 17 |
# Core OpenEnv runtime (provides FastAPI server + HTTP client types)
|
|
@@ -31,10 +31,15 @@ dev = [
|
|
| 31 |
|
| 32 |
[project.scripts]
|
| 33 |
# Server entry point - enables running via: uv run --project . server
|
| 34 |
-
|
| 35 |
-
server = "CodeReviewAgent.server.app:main"
|
| 36 |
|
| 37 |
[tool.setuptools]
|
| 38 |
include-package-data = true
|
| 39 |
-
packages = ["
|
| 40 |
-
package-dir = { "
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
build-backend = "setuptools.build_meta"
|
| 10 |
|
| 11 |
[project]
|
| 12 |
+
name = "openenv-PRobe"
|
| 13 |
version = "0.1.0"
|
| 14 |
+
description = "PRobe — Pull Request Investigation Environment for OpenEnv"
|
| 15 |
requires-python = ">=3.10"
|
| 16 |
dependencies = [
|
| 17 |
# Core OpenEnv runtime (provides FastAPI server + HTTP client types)
|
|
|
|
| 31 |
|
| 32 |
[project.scripts]
|
| 33 |
# Server entry point - enables running via: uv run --project . server
|
| 34 |
+
server = "PRobe.server.app:main"
|
|
|
|
| 35 |
|
| 36 |
[tool.setuptools]
|
| 37 |
include-package-data = true
|
| 38 |
+
packages = ["PRobe", "PRobe.server"]
|
| 39 |
+
package-dir = { "PRobe" = ".", "PRobe.server" = "server" }
|
| 40 |
+
|
| 41 |
+
[dependency-groups]
|
| 42 |
+
dev = [
|
| 43 |
+
"pytest>=9.0.3",
|
| 44 |
+
"pytest-cov>=7.1.0",
|
| 45 |
+
]
|
server/CodeReviewAgent_environment.py
CHANGED
|
@@ -6,7 +6,18 @@ Episode lifecycle:
|
|
| 6 |
2. step(a) → (Obs, RewardType, done, info) (execute one action)
|
| 7 |
3. state() → dict (full internal snapshot)
|
| 8 |
|
| 9 |
-
Tasks cycle automatically: 0 (ultra-easy) → 1 (easy) → … →
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
Thread / task safety: each Environment instance owns its own state.
|
| 12 |
For concurrent GRPO rollouts spin up one instance per worker.
|
|
@@ -15,6 +26,8 @@ For concurrent GRPO rollouts spin up one instance per worker.
|
|
| 15 |
from __future__ import annotations
|
| 16 |
|
| 17 |
import asyncio
|
|
|
|
|
|
|
| 18 |
from typing import Any
|
| 19 |
from uuid import uuid4
|
| 20 |
|
|
@@ -24,30 +37,30 @@ from openenv.core.env_server.types import State
|
|
| 24 |
try:
|
| 25 |
from ..models import (
|
| 26 |
ActionType,
|
| 27 |
-
|
| 28 |
-
|
| 29 |
RewardType,
|
| 30 |
)
|
| 31 |
-
from .grader import CodeReviewGrader
|
|
|
|
| 32 |
from .tasks import TASKS
|
| 33 |
except ImportError:
|
| 34 |
from models import ( # type: ignore[no-redef]
|
| 35 |
ActionType,
|
| 36 |
-
|
| 37 |
-
|
| 38 |
RewardType,
|
| 39 |
)
|
| 40 |
-
from server.grader import CodeReviewGrader # type: ignore[no-redef]
|
| 41 |
-
from server.
|
|
|
|
| 42 |
|
| 43 |
-
|
| 44 |
-
_ZERO_REWARD = RewardType(total=0.0, components={}, passed=False,
|
| 45 |
-
explanation="No signal this step.", step=0, terminal=False)
|
| 46 |
|
| 47 |
|
| 48 |
-
class
|
| 49 |
"""
|
| 50 |
-
|
| 51 |
|
| 52 |
Public interface is fully async. The sync wrappers (reset / step / state)
|
| 53 |
required by openenv's create_app are also provided; they delegate to the
|
|
@@ -76,23 +89,28 @@ class CodereviewagentEnvironment(Environment):
|
|
| 76 |
"review_decision": None,
|
| 77 |
"review_submitted": False,
|
| 78 |
"cumulative_reward": 0.0,
|
|
|
|
|
|
|
|
|
|
| 79 |
}
|
| 80 |
|
| 81 |
# ── Async-native interface (primary) ──────────────────────────────────
|
| 82 |
|
| 83 |
-
async def async_reset(self) ->
|
| 84 |
task_id = self._reset_count % len(TASKS)
|
|
|
|
| 85 |
self._reset_count += 1
|
| 86 |
self._episode_id = str(uuid4())
|
| 87 |
self._step_count = 0
|
| 88 |
-
|
|
|
|
| 89 |
self._grader = CodeReviewGrader(task)
|
| 90 |
self._ep = self._fresh_episode(task)
|
| 91 |
return self._make_obs(reward=0.0, done=False)
|
| 92 |
|
| 93 |
async def async_step(
|
| 94 |
-
self, action:
|
| 95 |
-
) -> tuple[
|
| 96 |
self._step_count += 1
|
| 97 |
task = self._ep["task"]
|
| 98 |
done = False
|
|
@@ -101,6 +119,9 @@ class CodereviewagentEnvironment(Environment):
|
|
| 101 |
if action.action_type == ActionType.ADD_COMMENT:
|
| 102 |
reward_obj = self._handle_add_comment(action)
|
| 103 |
|
|
|
|
|
|
|
|
|
|
| 104 |
elif action.action_type == ActionType.REQUEST_CHANGES:
|
| 105 |
reward_obj = self._handle_request_changes(action)
|
| 106 |
|
|
@@ -165,32 +186,29 @@ class CodereviewagentEnvironment(Environment):
|
|
| 165 |
|
| 166 |
# ── Sync wrappers (openenv / create_app compatibility) ────────────────
|
| 167 |
|
| 168 |
-
def reset(self) ->
|
| 169 |
try:
|
| 170 |
-
|
| 171 |
except RuntimeError:
|
| 172 |
return asyncio.run(self.async_reset())
|
| 173 |
-
# Called from inside a running loop (e.g. pytest-asyncio)
|
| 174 |
-
|
| 175 |
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
|
| 176 |
-
|
| 177 |
-
return fut.result()
|
| 178 |
|
| 179 |
-
def step(self, action:
|
| 180 |
"""
|
| 181 |
Sync step for openenv compatibility.
|
| 182 |
Returns only the Observation (reward is embedded in obs.reward).
|
| 183 |
Use async_step() for the full (obs, reward, done, info) tuple.
|
| 184 |
"""
|
| 185 |
try:
|
| 186 |
-
|
| 187 |
except RuntimeError:
|
| 188 |
obs, _, _, _ = asyncio.run(self.async_step(action))
|
| 189 |
return obs
|
| 190 |
-
import concurrent.futures
|
| 191 |
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
|
| 192 |
-
|
| 193 |
-
obs, _, _, _ = fut.result()
|
| 194 |
return obs
|
| 195 |
|
| 196 |
@property
|
|
@@ -199,7 +217,7 @@ class CodereviewagentEnvironment(Environment):
|
|
| 199 |
|
| 200 |
# ── Action handlers ───────────────────────────────────────────────────
|
| 201 |
|
| 202 |
-
def _handle_add_comment(self, action:
|
| 203 |
entry = {
|
| 204 |
"type": "comment",
|
| 205 |
"line": action.line_number,
|
|
@@ -224,6 +242,9 @@ class CodereviewagentEnvironment(Environment):
|
|
| 224 |
else:
|
| 225 |
explanation = "Comment recorded; no new issue matched."
|
| 226 |
|
|
|
|
|
|
|
|
|
|
| 227 |
return RewardType(
|
| 228 |
total=clamped,
|
| 229 |
components=breakdown,
|
|
@@ -233,7 +254,79 @@ class CodereviewagentEnvironment(Environment):
|
|
| 233 |
terminal=False,
|
| 234 |
)
|
| 235 |
|
| 236 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 237 |
self._ep["review_decision"] = "request_changes"
|
| 238 |
self._ep["review_comments"].append(
|
| 239 |
{"type": "request_changes", "text": action.comment}
|
|
@@ -304,9 +397,9 @@ class CodereviewagentEnvironment(Environment):
|
|
| 304 |
|
| 305 |
# ── Observation builder ───────────────────────────────────────────────
|
| 306 |
|
| 307 |
-
def _make_obs(self, reward: float, done: bool) ->
|
| 308 |
task = self._ep["task"]
|
| 309 |
-
return
|
| 310 |
code_snippet=task["code"],
|
| 311 |
task_description=task["description"],
|
| 312 |
file_name=task["file_name"],
|
|
@@ -319,9 +412,11 @@ class CodereviewagentEnvironment(Environment):
|
|
| 319 |
total_issues=len(task["issues"]),
|
| 320 |
done=done,
|
| 321 |
reward=round(max(-1.0, min(1.0, reward)), 4),
|
|
|
|
| 322 |
metadata={
|
| 323 |
"cumulative_reward": self._ep.get("cumulative_reward", 0.0),
|
| 324 |
"review_decision": self._ep.get("review_decision"),
|
| 325 |
"episode_id": self._episode_id,
|
|
|
|
| 326 |
},
|
| 327 |
)
|
|
|
|
| 6 |
2. step(a) → (Obs, RewardType, done, info) (execute one action)
|
| 7 |
3. state() → dict (full internal snapshot)
|
| 8 |
|
| 9 |
+
Tasks cycle automatically: 0 (ultra-easy) → 1 (easy) → … → 6 (causal chain) → 0 …
|
| 10 |
+
|
| 11 |
+
Dynamic world features (v3)
|
| 12 |
+
───────────────────────────
|
| 13 |
+
• Code mutation — each episode applies surface-level variable renames,
|
| 14 |
+
a line shift, and a constant nudge so the agent must
|
| 15 |
+
read the code rather than memorise tokens.
|
| 16 |
+
• GET_CONTEXT — the agent can spend a step probing a specific line to
|
| 17 |
+
receive the surrounding ±5 lines of context.
|
| 18 |
+
• Causal unlocks — finding certain issues appends a new context hint to
|
| 19 |
+
the observation, modelling real-world situations where
|
| 20 |
+
one discovery leads to deeper investigation.
|
| 21 |
|
| 22 |
Thread / task safety: each Environment instance owns its own state.
|
| 23 |
For concurrent GRPO rollouts spin up one instance per worker.
|
|
|
|
| 26 |
from __future__ import annotations
|
| 27 |
|
| 28 |
import asyncio
|
| 29 |
+
import concurrent.futures
|
| 30 |
+
import logging
|
| 31 |
from typing import Any
|
| 32 |
from uuid import uuid4
|
| 33 |
|
|
|
|
| 37 |
try:
|
| 38 |
from ..models import (
|
| 39 |
ActionType,
|
| 40 |
+
ProbeAction,
|
| 41 |
+
ProbeObservation,
|
| 42 |
RewardType,
|
| 43 |
)
|
| 44 |
+
from .grader import CodeReviewGrader, LINE_TOLERANCE
|
| 45 |
+
from .mutator import mutate_task
|
| 46 |
from .tasks import TASKS
|
| 47 |
except ImportError:
|
| 48 |
from models import ( # type: ignore[no-redef]
|
| 49 |
ActionType,
|
| 50 |
+
ProbeAction,
|
| 51 |
+
ProbeObservation,
|
| 52 |
RewardType,
|
| 53 |
)
|
| 54 |
+
from server.grader import CodeReviewGrader, LINE_TOLERANCE # type: ignore[no-redef]
|
| 55 |
+
from server.mutator import mutate_task # type: ignore[no-redef]
|
| 56 |
+
from server.tasks import TASKS # type: ignore[no-redef]
|
| 57 |
|
| 58 |
+
log = logging.getLogger(__name__)
|
|
|
|
|
|
|
| 59 |
|
| 60 |
|
| 61 |
+
class ProbeEnvironment(Environment):
|
| 62 |
"""
|
| 63 |
+
PRobe — Pull Request Investigation Environment.
|
| 64 |
|
| 65 |
Public interface is fully async. The sync wrappers (reset / step / state)
|
| 66 |
required by openenv's create_app are also provided; they delegate to the
|
|
|
|
| 89 |
"review_decision": None,
|
| 90 |
"review_submitted": False,
|
| 91 |
"cumulative_reward": 0.0,
|
| 92 |
+
# causal world-modeling state
|
| 93 |
+
"context_hints": [], # list[str] of unlocked hint texts
|
| 94 |
+
"hints_unlocked": set(), # set[str] of hint keys already fired
|
| 95 |
}
|
| 96 |
|
| 97 |
# ── Async-native interface (primary) ──────────────────────────────────
|
| 98 |
|
| 99 |
+
async def async_reset(self) -> ProbeObservation:
|
| 100 |
task_id = self._reset_count % len(TASKS)
|
| 101 |
+
seed = self._reset_count # unique seed per episode
|
| 102 |
self._reset_count += 1
|
| 103 |
self._episode_id = str(uuid4())
|
| 104 |
self._step_count = 0
|
| 105 |
+
# Apply surface mutation so the agent cannot memorise tokens
|
| 106 |
+
task = mutate_task(TASKS[task_id], seed=seed)
|
| 107 |
self._grader = CodeReviewGrader(task)
|
| 108 |
self._ep = self._fresh_episode(task)
|
| 109 |
return self._make_obs(reward=0.0, done=False)
|
| 110 |
|
| 111 |
async def async_step(
|
| 112 |
+
self, action: ProbeAction
|
| 113 |
+
) -> tuple[ProbeObservation, RewardType, bool, dict[str, Any]]:
|
| 114 |
self._step_count += 1
|
| 115 |
task = self._ep["task"]
|
| 116 |
done = False
|
|
|
|
| 119 |
if action.action_type == ActionType.ADD_COMMENT:
|
| 120 |
reward_obj = self._handle_add_comment(action)
|
| 121 |
|
| 122 |
+
elif action.action_type == ActionType.GET_CONTEXT:
|
| 123 |
+
reward_obj = self._handle_get_context(action)
|
| 124 |
+
|
| 125 |
elif action.action_type == ActionType.REQUEST_CHANGES:
|
| 126 |
reward_obj = self._handle_request_changes(action)
|
| 127 |
|
|
|
|
| 186 |
|
| 187 |
# ── Sync wrappers (openenv / create_app compatibility) ────────────────
|
| 188 |
|
| 189 |
+
def reset(self) -> ProbeObservation: # type: ignore[override]
|
| 190 |
try:
|
| 191 |
+
asyncio.get_running_loop()
|
| 192 |
except RuntimeError:
|
| 193 |
return asyncio.run(self.async_reset())
|
| 194 |
+
# Called from inside a running loop (e.g. pytest-asyncio) -- run in a
|
| 195 |
+
# fresh thread that has its own event loop.
|
| 196 |
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
|
| 197 |
+
return pool.submit(asyncio.run, self.async_reset()).result()
|
|
|
|
| 198 |
|
| 199 |
+
def step(self, action: ProbeAction) -> ProbeObservation: # type: ignore[override]
|
| 200 |
"""
|
| 201 |
Sync step for openenv compatibility.
|
| 202 |
Returns only the Observation (reward is embedded in obs.reward).
|
| 203 |
Use async_step() for the full (obs, reward, done, info) tuple.
|
| 204 |
"""
|
| 205 |
try:
|
| 206 |
+
asyncio.get_running_loop()
|
| 207 |
except RuntimeError:
|
| 208 |
obs, _, _, _ = asyncio.run(self.async_step(action))
|
| 209 |
return obs
|
|
|
|
| 210 |
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
|
| 211 |
+
obs, _, _, _ = pool.submit(asyncio.run, self.async_step(action)).result()
|
|
|
|
| 212 |
return obs
|
| 213 |
|
| 214 |
@property
|
|
|
|
| 217 |
|
| 218 |
# ── Action handlers ───────────────────────────────────────────────────
|
| 219 |
|
| 220 |
+
def _handle_add_comment(self, action: ProbeAction) -> RewardType:
|
| 221 |
entry = {
|
| 222 |
"type": "comment",
|
| 223 |
"line": action.line_number,
|
|
|
|
| 242 |
else:
|
| 243 |
explanation = "Comment recorded; no new issue matched."
|
| 244 |
|
| 245 |
+
# ── Causal unlock: check whether any newly found issue reveals context
|
| 246 |
+
self._unlock_causal_hints(new_finds)
|
| 247 |
+
|
| 248 |
return RewardType(
|
| 249 |
total=clamped,
|
| 250 |
components=breakdown,
|
|
|
|
| 254 |
terminal=False,
|
| 255 |
)
|
| 256 |
|
| 257 |
+
def _unlock_causal_hints(self, newly_found: list[str]) -> None:
|
| 258 |
+
"""Append context hint text for any issue that has an 'unlocks' key."""
|
| 259 |
+
task = self._ep["task"]
|
| 260 |
+
hint_map: dict[str, str] = task.get("context_hints", {})
|
| 261 |
+
for issue in task["issues"]:
|
| 262 |
+
unlock_key = issue.get("unlocks")
|
| 263 |
+
if (
|
| 264 |
+
unlock_key
|
| 265 |
+
and issue["id"] in newly_found
|
| 266 |
+
and unlock_key not in self._ep["hints_unlocked"]
|
| 267 |
+
and unlock_key in hint_map
|
| 268 |
+
):
|
| 269 |
+
self._ep["hints_unlocked"].add(unlock_key)
|
| 270 |
+
self._ep["context_hints"].append(hint_map[unlock_key])
|
| 271 |
+
|
| 272 |
+
def _handle_get_context(
|
| 273 |
+
self, action: ProbeAction
|
| 274 |
+
) -> RewardType:
|
| 275 |
+
"""
|
| 276 |
+
GET_CONTEXT — reveal ±5 lines around the requested line number.
|
| 277 |
+
|
| 278 |
+
Costs a small step penalty (-0.01) to discourage random probing,
|
| 279 |
+
but rewards focused investigation (line near an actual issue: 0.0
|
| 280 |
+
net cost — penalty waived).
|
| 281 |
+
"""
|
| 282 |
+
line_number = action.line_number
|
| 283 |
+
task = self._ep["task"]
|
| 284 |
+
code_lines = task["code"].split("\n")
|
| 285 |
+
|
| 286 |
+
if line_number is None:
|
| 287 |
+
return RewardType(
|
| 288 |
+
total=-0.02,
|
| 289 |
+
components={"invalid_context_probe": -0.02},
|
| 290 |
+
passed=False,
|
| 291 |
+
explanation="GET_CONTEXT requires a line_number.",
|
| 292 |
+
step=self._step_count,
|
| 293 |
+
terminal=False,
|
| 294 |
+
)
|
| 295 |
+
|
| 296 |
+
# Build snippet
|
| 297 |
+
start = max(0, line_number - 6)
|
| 298 |
+
end = min(len(code_lines), line_number + 5)
|
| 299 |
+
snippet_lines = [
|
| 300 |
+
f"{i + 1:3}: {code_lines[i]}" for i in range(start, end)
|
| 301 |
+
]
|
| 302 |
+
snippet = "\n".join(snippet_lines)
|
| 303 |
+
|
| 304 |
+
# Check if probed line is near a real issue (within LINE_TOLERANCE).
|
| 305 |
+
near_issue = any(
|
| 306 |
+
(iss["line_range"][0] - LINE_TOLERANCE) <= line_number <= (iss["line_range"][1] + LINE_TOLERANCE)
|
| 307 |
+
for iss in task["issues"]
|
| 308 |
+
)
|
| 309 |
+
penalty = 0.0 if near_issue else -0.01
|
| 310 |
+
|
| 311 |
+
# Store the context result in review history so the agent can see it
|
| 312 |
+
self._ep["review_comments"].append({
|
| 313 |
+
"type": "context_probe",
|
| 314 |
+
"line": line_number,
|
| 315 |
+
"context": snippet,
|
| 316 |
+
})
|
| 317 |
+
|
| 318 |
+
return RewardType(
|
| 319 |
+
total=penalty,
|
| 320 |
+
components={"context_probe_penalty": penalty},
|
| 321 |
+
passed=near_issue,
|
| 322 |
+
explanation=(
|
| 323 |
+
f"Context around line {line_number}:\n{snippet}"
|
| 324 |
+
),
|
| 325 |
+
step=self._step_count,
|
| 326 |
+
terminal=False,
|
| 327 |
+
)
|
| 328 |
+
|
| 329 |
+
def _handle_request_changes(self, action: ProbeAction) -> RewardType:
|
| 330 |
self._ep["review_decision"] = "request_changes"
|
| 331 |
self._ep["review_comments"].append(
|
| 332 |
{"type": "request_changes", "text": action.comment}
|
|
|
|
| 397 |
|
| 398 |
# ── Observation builder ───────────────────────────────────────────────
|
| 399 |
|
| 400 |
+
def _make_obs(self, reward: float, done: bool) -> ProbeObservation:
|
| 401 |
task = self._ep["task"]
|
| 402 |
+
return ProbeObservation(
|
| 403 |
code_snippet=task["code"],
|
| 404 |
task_description=task["description"],
|
| 405 |
file_name=task["file_name"],
|
|
|
|
| 412 |
total_issues=len(task["issues"]),
|
| 413 |
done=done,
|
| 414 |
reward=round(max(-1.0, min(1.0, reward)), 4),
|
| 415 |
+
context_hints=list(self._ep.get("context_hints", [])),
|
| 416 |
metadata={
|
| 417 |
"cumulative_reward": self._ep.get("cumulative_reward", 0.0),
|
| 418 |
"review_decision": self._ep.get("review_decision"),
|
| 419 |
"episode_id": self._episode_id,
|
| 420 |
+
"mutation_seed": self._ep["task"].get("_mutation_seed"),
|
| 421 |
},
|
| 422 |
)
|
server/__init__.py
CHANGED
|
@@ -4,8 +4,8 @@
|
|
| 4 |
# This source code is licensed under the BSD-style license found in the
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
-
"""
|
| 8 |
|
| 9 |
-
from .CodeReviewAgent_environment import
|
| 10 |
|
| 11 |
-
__all__ = ["
|
|
|
|
| 4 |
# This source code is licensed under the BSD-style license found in the
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
+
"""PRobe environment server components."""
|
| 8 |
|
| 9 |
+
from .CodeReviewAgent_environment import ProbeEnvironment
|
| 10 |
|
| 11 |
+
__all__ = ["ProbeEnvironment"]
|
server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc
CHANGED
|
Binary files a/server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc and b/server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc differ
|
|
|
server/__pycache__/__init__.cpython-314.pyc
CHANGED
|
Binary files a/server/__pycache__/__init__.cpython-314.pyc and b/server/__pycache__/__init__.cpython-314.pyc differ
|
|
|
server/__pycache__/grader.cpython-314.pyc
CHANGED
|
Binary files a/server/__pycache__/grader.cpython-314.pyc and b/server/__pycache__/grader.cpython-314.pyc differ
|
|
|
server/__pycache__/mutator.cpython-314.pyc
ADDED
|
Binary file (5.86 kB). View file
|
|
|
server/__pycache__/tasks.cpython-314.pyc
CHANGED
|
Binary files a/server/__pycache__/tasks.cpython-314.pyc and b/server/__pycache__/tasks.cpython-314.pyc differ
|
|
|
server/app.py
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
"""
|
| 2 |
-
Async FastAPI server for the
|
| 3 |
|
| 4 |
Endpoints:
|
| 5 |
POST /reset — start a new episode (HTTP session)
|
|
@@ -20,9 +20,11 @@ falls back to a minimal HTML redirect page.
|
|
| 20 |
from __future__ import annotations
|
| 21 |
|
| 22 |
import json
|
|
|
|
| 23 |
from contextlib import asynccontextmanager
|
| 24 |
from typing import Any
|
| 25 |
|
|
|
|
| 26 |
from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
|
| 27 |
from fastapi.responses import HTMLResponse
|
| 28 |
|
|
@@ -33,22 +35,24 @@ except Exception: # pragma: no cover
|
|
| 33 |
_OPENENV_AVAILABLE = False
|
| 34 |
|
| 35 |
try:
|
| 36 |
-
from ..models import
|
| 37 |
-
from .CodeReviewAgent_environment import
|
| 38 |
except ModuleNotFoundError:
|
| 39 |
-
from models import
|
| 40 |
-
from server.CodeReviewAgent_environment import
|
|
|
|
|
|
|
| 41 |
|
| 42 |
|
| 43 |
# ── Shared HTTP session env ───────────────────────────────────────────────────
|
| 44 |
|
| 45 |
-
_http_env:
|
| 46 |
|
| 47 |
|
| 48 |
@asynccontextmanager
|
| 49 |
async def lifespan(application: FastAPI):
|
| 50 |
global _http_env
|
| 51 |
-
_http_env =
|
| 52 |
yield
|
| 53 |
_http_env = None
|
| 54 |
|
|
@@ -58,7 +62,7 @@ async def lifespan(application: FastAPI):
|
|
| 58 |
class StepResponse:
|
| 59 |
def __init__(
|
| 60 |
self,
|
| 61 |
-
obs:
|
| 62 |
reward: RewardType,
|
| 63 |
done: bool,
|
| 64 |
info: dict[str, Any],
|
|
@@ -81,7 +85,7 @@ class StepResponse:
|
|
| 81 |
|
| 82 |
def _build_app() -> FastAPI:
|
| 83 |
application = FastAPI(
|
| 84 |
-
title="
|
| 85 |
description="OpenEnv code-review environment — async FastAPI server.",
|
| 86 |
version="2.0.0",
|
| 87 |
lifespan=lifespan,
|
|
@@ -91,19 +95,22 @@ def _build_app() -> FastAPI:
|
|
| 91 |
|
| 92 |
@application.post("/reset", summary="Start a new episode")
|
| 93 |
async def reset_endpoint() -> dict[str, Any]:
|
| 94 |
-
|
|
|
|
| 95 |
obs = await _http_env.async_reset()
|
| 96 |
return {"observation": obs.model_dump(), "reward": None, "done": False, "info": {}}
|
| 97 |
|
| 98 |
@application.post("/step", summary="Execute one action")
|
| 99 |
-
async def step_endpoint(action:
|
| 100 |
-
|
|
|
|
| 101 |
obs, reward, done, info = await _http_env.async_step(action)
|
| 102 |
return StepResponse(obs, reward, done, info).to_dict()
|
| 103 |
|
| 104 |
@application.get("/state", summary="Current episode state snapshot")
|
| 105 |
async def state_endpoint() -> dict[str, Any]:
|
| 106 |
-
|
|
|
|
| 107 |
return await _http_env.async_state()
|
| 108 |
|
| 109 |
@application.get("/health", summary="Liveness probe")
|
|
@@ -113,8 +120,8 @@ def _build_app() -> FastAPI:
|
|
| 113 |
@application.get("/schema", summary="Action and observation JSON schemas")
|
| 114 |
async def schema() -> dict[str, Any]:
|
| 115 |
return {
|
| 116 |
-
"action":
|
| 117 |
-
"observation":
|
| 118 |
"reward": RewardType.model_json_schema(),
|
| 119 |
}
|
| 120 |
|
|
@@ -123,7 +130,7 @@ def _build_app() -> FastAPI:
|
|
| 123 |
@application.websocket("/ws")
|
| 124 |
async def ws_endpoint(websocket: WebSocket) -> None:
|
| 125 |
await websocket.accept()
|
| 126 |
-
env =
|
| 127 |
try:
|
| 128 |
while True:
|
| 129 |
raw = await websocket.receive_text()
|
|
@@ -138,7 +145,7 @@ def _build_app() -> FastAPI:
|
|
| 138 |
|
| 139 |
elif cmd == "step":
|
| 140 |
try:
|
| 141 |
-
action =
|
| 142 |
except Exception as exc:
|
| 143 |
await websocket.send_json({"type": "error", "detail": str(exc)})
|
| 144 |
continue
|
|
@@ -170,9 +177,9 @@ def _build_app() -> FastAPI:
|
|
| 170 |
@application.get("/web", response_class=HTMLResponse, include_in_schema=False)
|
| 171 |
async def web_ui() -> str:
|
| 172 |
return """
|
| 173 |
-
<!doctype html><html><head><title>
|
| 174 |
-
<body>
|
| 175 |
-
<h2>
|
| 176 |
<p>API docs: <a href="/docs">/docs</a></p>
|
| 177 |
<p>Health: <a href="/health">/health</a></p>
|
| 178 |
<p>Schema: <a href="/schema">/schema</a></p>
|
|
@@ -185,8 +192,7 @@ def _build_app() -> FastAPI:
|
|
| 185 |
app = _build_app()
|
| 186 |
|
| 187 |
|
| 188 |
-
def main(host: str = "0.0.0.0", port: int = 8000) -> None:
|
| 189 |
-
import uvicorn
|
| 190 |
uvicorn.run(app, host=host, port=port)
|
| 191 |
|
| 192 |
|
|
|
|
| 1 |
"""
|
| 2 |
+
Async FastAPI server for the PRobe environment.
|
| 3 |
|
| 4 |
Endpoints:
|
| 5 |
POST /reset — start a new episode (HTTP session)
|
|
|
|
| 20 |
from __future__ import annotations
|
| 21 |
|
| 22 |
import json
|
| 23 |
+
import logging
|
| 24 |
from contextlib import asynccontextmanager
|
| 25 |
from typing import Any
|
| 26 |
|
| 27 |
+
import uvicorn
|
| 28 |
from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
|
| 29 |
from fastapi.responses import HTMLResponse
|
| 30 |
|
|
|
|
| 35 |
_OPENENV_AVAILABLE = False
|
| 36 |
|
| 37 |
try:
|
| 38 |
+
from ..models import ProbeAction, ProbeObservation, RewardType
|
| 39 |
+
from .CodeReviewAgent_environment import ProbeEnvironment
|
| 40 |
except ModuleNotFoundError:
|
| 41 |
+
from models import ProbeAction, ProbeObservation, RewardType # type: ignore
|
| 42 |
+
from server.CodeReviewAgent_environment import ProbeEnvironment # type: ignore
|
| 43 |
+
|
| 44 |
+
log = logging.getLogger(__name__)
|
| 45 |
|
| 46 |
|
| 47 |
# ── Shared HTTP session env ───────────────────────────────────────────────────
|
| 48 |
|
| 49 |
+
_http_env: ProbeEnvironment | None = None
|
| 50 |
|
| 51 |
|
| 52 |
@asynccontextmanager
|
| 53 |
async def lifespan(application: FastAPI):
|
| 54 |
global _http_env
|
| 55 |
+
_http_env = ProbeEnvironment()
|
| 56 |
yield
|
| 57 |
_http_env = None
|
| 58 |
|
|
|
|
| 62 |
class StepResponse:
|
| 63 |
def __init__(
|
| 64 |
self,
|
| 65 |
+
obs: ProbeObservation,
|
| 66 |
reward: RewardType,
|
| 67 |
done: bool,
|
| 68 |
info: dict[str, Any],
|
|
|
|
| 85 |
|
| 86 |
def _build_app() -> FastAPI:
|
| 87 |
application = FastAPI(
|
| 88 |
+
title="PRobe",
|
| 89 |
description="OpenEnv code-review environment — async FastAPI server.",
|
| 90 |
version="2.0.0",
|
| 91 |
lifespan=lifespan,
|
|
|
|
| 95 |
|
| 96 |
@application.post("/reset", summary="Start a new episode")
|
| 97 |
async def reset_endpoint() -> dict[str, Any]:
|
| 98 |
+
if _http_env is None:
|
| 99 |
+
raise HTTPException(status_code=503, detail="Environment not initialised")
|
| 100 |
obs = await _http_env.async_reset()
|
| 101 |
return {"observation": obs.model_dump(), "reward": None, "done": False, "info": {}}
|
| 102 |
|
| 103 |
@application.post("/step", summary="Execute one action")
|
| 104 |
+
async def step_endpoint(action: ProbeAction) -> dict[str, Any]:
|
| 105 |
+
if _http_env is None:
|
| 106 |
+
raise HTTPException(status_code=503, detail="Environment not initialised")
|
| 107 |
obs, reward, done, info = await _http_env.async_step(action)
|
| 108 |
return StepResponse(obs, reward, done, info).to_dict()
|
| 109 |
|
| 110 |
@application.get("/state", summary="Current episode state snapshot")
|
| 111 |
async def state_endpoint() -> dict[str, Any]:
|
| 112 |
+
if _http_env is None:
|
| 113 |
+
raise HTTPException(status_code=503, detail="Environment not initialised")
|
| 114 |
return await _http_env.async_state()
|
| 115 |
|
| 116 |
@application.get("/health", summary="Liveness probe")
|
|
|
|
| 120 |
@application.get("/schema", summary="Action and observation JSON schemas")
|
| 121 |
async def schema() -> dict[str, Any]:
|
| 122 |
return {
|
| 123 |
+
"action": ProbeAction.model_json_schema(),
|
| 124 |
+
"observation": ProbeObservation.model_json_schema(),
|
| 125 |
"reward": RewardType.model_json_schema(),
|
| 126 |
}
|
| 127 |
|
|
|
|
| 130 |
@application.websocket("/ws")
|
| 131 |
async def ws_endpoint(websocket: WebSocket) -> None:
|
| 132 |
await websocket.accept()
|
| 133 |
+
env = ProbeEnvironment()
|
| 134 |
try:
|
| 135 |
while True:
|
| 136 |
raw = await websocket.receive_text()
|
|
|
|
| 145 |
|
| 146 |
elif cmd == "step":
|
| 147 |
try:
|
| 148 |
+
action = ProbeAction(**msg["action"])
|
| 149 |
except Exception as exc:
|
| 150 |
await websocket.send_json({"type": "error", "detail": str(exc)})
|
| 151 |
continue
|
|
|
|
| 177 |
@application.get("/web", response_class=HTMLResponse, include_in_schema=False)
|
| 178 |
async def web_ui() -> str:
|
| 179 |
return """
|
| 180 |
+
<!doctype html><html><head><title>PRobe</title></head>
|
| 181 |
+
<body style="font-family:sans-serif;padding:2rem">
|
| 182 |
+
<h2>PRobe Environment</h2>
|
| 183 |
<p>API docs: <a href="/docs">/docs</a></p>
|
| 184 |
<p>Health: <a href="/health">/health</a></p>
|
| 185 |
<p>Schema: <a href="/schema">/schema</a></p>
|
|
|
|
| 192 |
app = _build_app()
|
| 193 |
|
| 194 |
|
| 195 |
+
def main(host: str = "0.0.0.0", port: int = 8000) -> None: # noqa: S104
|
|
|
|
| 196 |
uvicorn.run(app, host=host, port=port)
|
| 197 |
|
| 198 |
|
server/grader.py
CHANGED
|
@@ -1,27 +1,29 @@
|
|
| 1 |
"""
|
| 2 |
-
Deterministic grader for
|
| 3 |
|
| 4 |
Scoring design
|
| 5 |
--------------
|
| 6 |
During the episode (ADD_COMMENT actions):
|
| 7 |
-
+weight/total_weight *
|
| 8 |
-
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
+coverage *
|
| 12 |
-
+/-
|
| 13 |
-
+efficiency *
|
| 14 |
-
|
| 15 |
-
Maximum achievable total: ~1.0 Minimum:
|
| 16 |
-
|
| 17 |
-
Anti-exploit
|
| 18 |
-
A comment MUST satisfy
|
| 19 |
-
1. keyword_hit
|
| 20 |
-
2. line_hit
|
| 21 |
-
|
| 22 |
-
|
| 23 |
"""
|
| 24 |
|
|
|
|
|
|
|
| 25 |
from typing import Any
|
| 26 |
|
| 27 |
try:
|
|
@@ -29,15 +31,26 @@ try:
|
|
| 29 |
except ImportError:
|
| 30 |
from models import RewardType # type: ignore[no-redef]
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
|
| 35 |
class CodeReviewGrader:
|
|
|
|
|
|
|
| 36 |
def __init__(self, task: dict[str, Any]) -> None:
|
| 37 |
self.task = task
|
| 38 |
self.total_weight: float = sum(iss["weight"] for iss in task["issues"])
|
| 39 |
|
| 40 |
-
#
|
| 41 |
|
| 42 |
def score_comment(
|
| 43 |
self,
|
|
@@ -51,16 +64,19 @@ class CodeReviewGrader:
|
|
| 51 |
Returns:
|
| 52 |
(reward_delta, newly_found_issue_ids, component_breakdown)
|
| 53 |
|
| 54 |
-
Match condition (
|
| 55 |
-
|
|
|
|
| 56 |
"""
|
| 57 |
if not comment:
|
| 58 |
return 0.0, [], {}
|
| 59 |
|
| 60 |
comment_lower = comment.lower()
|
|
|
|
|
|
|
|
|
|
| 61 |
newly_found: list[str] = []
|
| 62 |
issue_credit: float = 0.0
|
| 63 |
-
false_positive_penalty: float = 0.0
|
| 64 |
|
| 65 |
for issue in self.task["issues"]:
|
| 66 |
if issue["id"] in already_found:
|
|
@@ -69,15 +85,15 @@ class CodeReviewGrader:
|
|
| 69 |
keyword_hit = any(kw.lower() in comment_lower for kw in issue["keywords"])
|
| 70 |
line_hit = self._line_in_range(line_number, issue["line_range"])
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
credit = (issue["weight"] / self.total_weight) * 0.60
|
| 75 |
newly_found.append(issue["id"])
|
| 76 |
issue_credit += credit
|
| 77 |
|
| 78 |
-
# Penalise substantive comments that matched nothing
|
| 79 |
-
|
| 80 |
-
|
|
|
|
| 81 |
|
| 82 |
total = round(issue_credit + false_positive_penalty, 4)
|
| 83 |
breakdown = {
|
|
@@ -86,7 +102,7 @@ class CodeReviewGrader:
|
|
| 86 |
}
|
| 87 |
return total, newly_found, breakdown
|
| 88 |
|
| 89 |
-
#
|
| 90 |
|
| 91 |
def final_score(
|
| 92 |
self,
|
|
@@ -98,9 +114,13 @@ class CodeReviewGrader:
|
|
| 98 |
) -> RewardType:
|
| 99 |
"""
|
| 100 |
Compute the terminal reward on SUBMIT_REVIEW.
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
| 102 |
"""
|
| 103 |
-
|
|
|
|
| 104 |
found_weight = sum(
|
| 105 |
iss["weight"]
|
| 106 |
for iss in self.task["issues"]
|
|
@@ -108,12 +128,18 @@ class CodeReviewGrader:
|
|
| 108 |
)
|
| 109 |
coverage = found_weight / self.total_weight if self.total_weight > 0 else 0.0
|
| 110 |
|
| 111 |
-
correct_decision = self.task.get("correct_decision", "request_changes")
|
| 112 |
-
decision_score =
|
|
|
|
|
|
|
| 113 |
|
| 114 |
efficiency = max(0.0, 1.0 - step_count / max_steps)
|
| 115 |
-
efficiency_bonus =
|
| 116 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
raw_total = coverage_bonus + decision_score + efficiency_bonus
|
| 119 |
clamped = round(max(-1.0, min(1.0, raw_total)), 4)
|
|
@@ -123,23 +149,24 @@ class CodeReviewGrader:
|
|
| 123 |
"decision_score": round(decision_score, 4),
|
| 124 |
"efficiency_bonus": efficiency_bonus,
|
| 125 |
}
|
|
|
|
| 126 |
explanation = (
|
| 127 |
-
f"Found {len(unique_found)}/{
|
| 128 |
f"(weighted coverage {coverage:.0%}). "
|
| 129 |
-
f"Decision
|
| 130 |
f"{'correct' if review_decision == correct_decision else 'incorrect'}. "
|
| 131 |
f"Used {step_count}/{max_steps} steps."
|
| 132 |
)
|
| 133 |
return RewardType(
|
| 134 |
total=clamped,
|
| 135 |
components=components,
|
| 136 |
-
passed=review_decision == correct_decision and coverage >=
|
| 137 |
explanation=explanation,
|
| 138 |
step=current_step,
|
| 139 |
terminal=True,
|
| 140 |
)
|
| 141 |
|
| 142 |
-
#
|
| 143 |
|
| 144 |
@staticmethod
|
| 145 |
def _line_in_range(
|
|
|
|
| 1 |
"""
|
| 2 |
+
Deterministic reward grader for PRobe tasks.
|
| 3 |
|
| 4 |
Scoring design
|
| 5 |
--------------
|
| 6 |
During the episode (ADD_COMMENT actions):
|
| 7 |
+
+ weight/total_weight * ISSUE_REWARD_POOL per newly found issue
|
| 8 |
+
- FALSE_POSITIVE_PENALTY per substantive unmatched comment
|
| 9 |
+
|
| 10 |
+
Terminal (SUBMIT_REVIEW):
|
| 11 |
+
+ coverage * COVERAGE_POOL weighted coverage bonus (max COVERAGE_POOL)
|
| 12 |
+
+/- DECISION_REWARD correct / incorrect final decision
|
| 13 |
+
+ efficiency * EFFICIENCY_POOL step-efficiency bonus when coverage >= COVERAGE_THRESHOLD
|
| 14 |
+
|
| 15 |
+
Maximum achievable total: ~1.0 Minimum: -1.0
|
| 16 |
+
|
| 17 |
+
Anti-exploit rules (v3):
|
| 18 |
+
A comment MUST satisfy ALL of:
|
| 19 |
+
1. keyword_hit -- at least one issue keyword appears in the comment text
|
| 20 |
+
2. line_hit -- comment line_number is within +/-LINE_TOLERANCE of the issue
|
| 21 |
+
3. substantive -- comment is longer than MIN_COMMENT_LENGTH characters
|
| 22 |
+
This prevents keyword-spam, wide-net line fishing, and trivial one-word matches.
|
| 23 |
"""
|
| 24 |
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
from typing import Any
|
| 28 |
|
| 29 |
try:
|
|
|
|
| 31 |
except ImportError:
|
| 32 |
from models import RewardType # type: ignore[no-redef]
|
| 33 |
|
| 34 |
+
# -- Grading hyper-parameters ------------------------------------------------
|
| 35 |
+
LINE_TOLERANCE: int = 2 # lines either side of an issue's declared range
|
| 36 |
+
MIN_COMMENT_LENGTH: int = 15 # chars -- comments shorter than this earn no credit
|
| 37 |
+
|
| 38 |
+
ISSUE_REWARD_POOL: float = 0.60 # max cumulative credit from ADD_COMMENT
|
| 39 |
+
COVERAGE_POOL: float = 0.20 # terminal coverage bonus ceiling
|
| 40 |
+
DECISION_REWARD: float = 0.10 # +/- for correct/incorrect final decision
|
| 41 |
+
EFFICIENCY_POOL: float = 0.10 # max terminal efficiency bonus
|
| 42 |
+
COVERAGE_THRESHOLD: float = 0.60 # min coverage to unlock efficiency bonus
|
| 43 |
+
FALSE_POSITIVE_PENALTY: float = -0.05 # per substantive unmatched comment
|
| 44 |
|
| 45 |
|
| 46 |
class CodeReviewGrader:
|
| 47 |
+
"""Scores agent actions against a task's ground-truth issue list."""
|
| 48 |
+
|
| 49 |
def __init__(self, task: dict[str, Any]) -> None:
|
| 50 |
self.task = task
|
| 51 |
self.total_weight: float = sum(iss["weight"] for iss in task["issues"])
|
| 52 |
|
| 53 |
+
# -- Per-comment scoring -------------------------------------------------
|
| 54 |
|
| 55 |
def score_comment(
|
| 56 |
self,
|
|
|
|
| 64 |
Returns:
|
| 65 |
(reward_delta, newly_found_issue_ids, component_breakdown)
|
| 66 |
|
| 67 |
+
Match condition (ALL required -- no shortcut)::
|
| 68 |
+
|
| 69 |
+
keyword_hit AND line_hit AND substantive
|
| 70 |
"""
|
| 71 |
if not comment:
|
| 72 |
return 0.0, [], {}
|
| 73 |
|
| 74 |
comment_lower = comment.lower()
|
| 75 |
+
# Compute once -- used for both the credit path and the penalty path.
|
| 76 |
+
substantive: bool = len(comment.strip()) > MIN_COMMENT_LENGTH
|
| 77 |
+
|
| 78 |
newly_found: list[str] = []
|
| 79 |
issue_credit: float = 0.0
|
|
|
|
| 80 |
|
| 81 |
for issue in self.task["issues"]:
|
| 82 |
if issue["id"] in already_found:
|
|
|
|
| 85 |
keyword_hit = any(kw.lower() in comment_lower for kw in issue["keywords"])
|
| 86 |
line_hit = self._line_in_range(line_number, issue["line_range"])
|
| 87 |
|
| 88 |
+
if keyword_hit and line_hit and substantive:
|
| 89 |
+
credit = (issue["weight"] / self.total_weight) * ISSUE_REWARD_POOL
|
|
|
|
| 90 |
newly_found.append(issue["id"])
|
| 91 |
issue_credit += credit
|
| 92 |
|
| 93 |
+
# Penalise substantive comments that matched nothing.
|
| 94 |
+
false_positive_penalty: float = (
|
| 95 |
+
FALSE_POSITIVE_PENALTY if (not newly_found and substantive) else 0.0
|
| 96 |
+
)
|
| 97 |
|
| 98 |
total = round(issue_credit + false_positive_penalty, 4)
|
| 99 |
breakdown = {
|
|
|
|
| 102 |
}
|
| 103 |
return total, newly_found, breakdown
|
| 104 |
|
| 105 |
+
# -- Terminal scoring ----------------------------------------------------
|
| 106 |
|
| 107 |
def final_score(
|
| 108 |
self,
|
|
|
|
| 114 |
) -> RewardType:
|
| 115 |
"""
|
| 116 |
Compute the terminal reward on SUBMIT_REVIEW.
|
| 117 |
+
|
| 118 |
+
Returns a fully-typed RewardType with a per-component breakdown.
|
| 119 |
+
De-duplicates issues_found with stable ordering so results are
|
| 120 |
+
deterministic regardless of insertion order.
|
| 121 |
"""
|
| 122 |
+
# sorted() gives stable ordering so results are reproducible.
|
| 123 |
+
unique_found: list[str] = sorted(set(issues_found))
|
| 124 |
found_weight = sum(
|
| 125 |
iss["weight"]
|
| 126 |
for iss in self.task["issues"]
|
|
|
|
| 128 |
)
|
| 129 |
coverage = found_weight / self.total_weight if self.total_weight > 0 else 0.0
|
| 130 |
|
| 131 |
+
correct_decision: str = self.task.get("correct_decision", "request_changes")
|
| 132 |
+
decision_score = (
|
| 133 |
+
DECISION_REWARD if review_decision == correct_decision else -DECISION_REWARD
|
| 134 |
+
)
|
| 135 |
|
| 136 |
efficiency = max(0.0, 1.0 - step_count / max_steps)
|
| 137 |
+
efficiency_bonus = (
|
| 138 |
+
round(EFFICIENCY_POOL * efficiency, 4)
|
| 139 |
+
if coverage >= COVERAGE_THRESHOLD
|
| 140 |
+
else 0.0
|
| 141 |
+
)
|
| 142 |
+
coverage_bonus = round(coverage * COVERAGE_POOL, 4)
|
| 143 |
|
| 144 |
raw_total = coverage_bonus + decision_score + efficiency_bonus
|
| 145 |
clamped = round(max(-1.0, min(1.0, raw_total)), 4)
|
|
|
|
| 149 |
"decision_score": round(decision_score, 4),
|
| 150 |
"efficiency_bonus": efficiency_bonus,
|
| 151 |
}
|
| 152 |
+
total_issues = len(self.task["issues"])
|
| 153 |
explanation = (
|
| 154 |
+
f"Found {len(unique_found)}/{total_issues} issues "
|
| 155 |
f"(weighted coverage {coverage:.0%}). "
|
| 156 |
+
f"Decision {review_decision!r} was "
|
| 157 |
f"{'correct' if review_decision == correct_decision else 'incorrect'}. "
|
| 158 |
f"Used {step_count}/{max_steps} steps."
|
| 159 |
)
|
| 160 |
return RewardType(
|
| 161 |
total=clamped,
|
| 162 |
components=components,
|
| 163 |
+
passed=review_decision == correct_decision and coverage >= COVERAGE_THRESHOLD,
|
| 164 |
explanation=explanation,
|
| 165 |
step=current_step,
|
| 166 |
terminal=True,
|
| 167 |
)
|
| 168 |
|
| 169 |
+
# -- Helper --------------------------------------------------------------
|
| 170 |
|
| 171 |
@staticmethod
|
| 172 |
def _line_in_range(
|
server/mutator.py
ADDED
|
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Code Mutation Engine -- makes the world dynamic.
|
| 3 |
+
|
| 4 |
+
Each call to ``mutate_task()`` returns a deep copy of a task with:
|
| 5 |
+
|
| 6 |
+
1. Variable renaming -- one identifier swapped for a synonym so the agent
|
| 7 |
+
cannot memorise exact token strings between episodes.
|
| 8 |
+
2. Line shifting -- an inert blank line inserted above the first issue,
|
| 9 |
+
shifting all issue line_ranges down by 1. The agent
|
| 10 |
+
must *read* the code each episode.
|
| 11 |
+
3. Constant variance -- numeric literals (e.g. range limits, sleep durations)
|
| 12 |
+
are nudged +/-1 so the agent sees a fresh surface
|
| 13 |
+
without changing the underlying bug.
|
| 14 |
+
|
| 15 |
+
Mutation is fully deterministic given a seed, so training runs are
|
| 16 |
+
reproducible while still being different across episodes.
|
| 17 |
+
|
| 18 |
+
Design principle
|
| 19 |
+
----------------
|
| 20 |
+
Mutations must NEVER change *whether* a bug exists or *which line category*
|
| 21 |
+
it falls in. They only change surface tokens and line positions so the agent
|
| 22 |
+
cannot exploit memorisation.
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
+
import copy
|
| 28 |
+
import random
|
| 29 |
+
import re
|
| 30 |
+
from typing import Any
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
# -- Variable synonym table --------------------------------------------------
|
| 34 |
+
# Maps original identifiers -> list of drop-in synonyms.
|
| 35 |
+
# Only single-token renames that do not affect semantics are listed.
|
| 36 |
+
|
| 37 |
+
_SYNONYMS: dict[str, list[str]] = {
|
| 38 |
+
"total": ["acc", "running_total", "summed"],
|
| 39 |
+
"numbers": ["values", "nums", "items"],
|
| 40 |
+
"result": ["output", "response", "ret"],
|
| 41 |
+
"data": ["payload", "records", "entries"],
|
| 42 |
+
"item": ["record", "entry", "obj"],
|
| 43 |
+
"items": ["records", "entries", "objects"],
|
| 44 |
+
"user": ["account", "principal", "member"],
|
| 45 |
+
"users": ["accounts", "principals", "members"],
|
| 46 |
+
"password": ["passwd", "secret", "credential"],
|
| 47 |
+
"username": ["user_name", "login", "uname"],
|
| 48 |
+
"command": ["cmd", "instruction", "directive"],
|
| 49 |
+
"filename": ["file_name", "fname", "path_name"],
|
| 50 |
+
"url": ["endpoint", "uri", "address"],
|
| 51 |
+
"attempt": ["try_num", "iteration", "retry_idx"],
|
| 52 |
+
"counter": ["count", "tally", "n"],
|
| 53 |
+
"session": ["conn", "http_session", "client"],
|
| 54 |
+
"results": ["findings", "collected", "gathered"],
|
| 55 |
+
"cache": ["store", "lookup", "memo"],
|
| 56 |
+
"transformed": ["processed", "mapped", "converted"],
|
| 57 |
+
}
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def mutate_task(base_task: dict[str, Any], seed: int) -> dict[str, Any]:
|
| 61 |
+
"""
|
| 62 |
+
Return a mutated deep-copy of *base_task* using *seed* for reproducibility.
|
| 63 |
+
|
| 64 |
+
The returned task is structurally identical to the original -- same keys,
|
| 65 |
+
same issue ids, same categories -- but with surface-level code changes and
|
| 66 |
+
adjusted line_ranges.
|
| 67 |
+
"""
|
| 68 |
+
rng = random.Random(seed)
|
| 69 |
+
task: dict[str, Any] = copy.deepcopy(base_task)
|
| 70 |
+
|
| 71 |
+
code: str = task["code"]
|
| 72 |
+
issues: list[dict[str, Any]] = task["issues"]
|
| 73 |
+
|
| 74 |
+
# -- 1. Variable rename --------------------------------------------------
|
| 75 |
+
candidates = [orig for orig in _SYNONYMS if re.search(rf"\b{orig}\b", code)]
|
| 76 |
+
if candidates:
|
| 77 |
+
original = rng.choice(candidates)
|
| 78 |
+
replacement = rng.choice(_SYNONYMS[original])
|
| 79 |
+
# Whole-word replace to avoid partial matches.
|
| 80 |
+
code = re.sub(rf"\b{original}\b", replacement, code)
|
| 81 |
+
# Keep the keyword list in sync so the grader still matches.
|
| 82 |
+
for issue in issues:
|
| 83 |
+
issue["keywords"] = [
|
| 84 |
+
replacement if kw == original else kw
|
| 85 |
+
for kw in issue["keywords"]
|
| 86 |
+
]
|
| 87 |
+
|
| 88 |
+
# -- 2. Line shift -- insert one blank line before the first issue --------
|
| 89 |
+
if issues:
|
| 90 |
+
first_line = min(iss["line_range"][0] for iss in issues)
|
| 91 |
+
# Convert 1-based line number to 0-based list index.
|
| 92 |
+
insert_before = max(0, first_line - 2)
|
| 93 |
+
lines = code.split("\n")
|
| 94 |
+
lines.insert(insert_before, "")
|
| 95 |
+
code = "\n".join(lines)
|
| 96 |
+
# Shift every issue line_range down by 1 to match the new positions.
|
| 97 |
+
for issue in issues:
|
| 98 |
+
start, end = issue["line_range"]
|
| 99 |
+
issue["line_range"] = (start + 1, end + 1)
|
| 100 |
+
|
| 101 |
+
# -- 3. Constant variance -- nudge one numeric literal -------------------
|
| 102 |
+
# Exclude numbers that appear only inside a comment on the same line,
|
| 103 |
+
# to avoid corrupting annotated line references.
|
| 104 |
+
numeric_matches = [
|
| 105 |
+
m
|
| 106 |
+
for m in re.finditer(r"\b([2-9]|[1-9]\d+)\b", code)
|
| 107 |
+
if not re.search(r"#[^\n]*" + re.escape(m.group()), code[: m.end()])
|
| 108 |
+
]
|
| 109 |
+
if numeric_matches:
|
| 110 |
+
chosen = rng.choice(numeric_matches)
|
| 111 |
+
original_val = int(chosen.group())
|
| 112 |
+
delta = rng.choice([-1, 1])
|
| 113 |
+
new_val = max(2, original_val + delta) # never go below 2
|
| 114 |
+
code = code[: chosen.start()] + str(new_val) + code[chosen.end() :]
|
| 115 |
+
|
| 116 |
+
task["code"] = code
|
| 117 |
+
task["issues"] = issues
|
| 118 |
+
# Tag the task so the environment can record mutation metadata.
|
| 119 |
+
task["_mutation_seed"] = seed
|
| 120 |
+
return task
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
__all__ = ["mutate_task"]
|
server/tasks.py
CHANGED
|
@@ -716,4 +716,228 @@ def admin_panel():
|
|
| 716 |
],
|
| 717 |
"correct_decision": "request_changes",
|
| 718 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 719 |
]
|
|
|
|
| 716 |
],
|
| 717 |
"correct_decision": "request_changes",
|
| 718 |
},
|
| 719 |
+
|
| 720 |
+
# ── Task 6: Causal Chain — Secrets Leak Investigation ────────────────────
|
| 721 |
+
#
|
| 722 |
+
# WORLD-MODELING DESIGN
|
| 723 |
+
# ─────────────────────
|
| 724 |
+
# This task implements a *causal observation chain*:
|
| 725 |
+
#
|
| 726 |
+
# Phase 1 (lines visible from the start)
|
| 727 |
+
# The agent sees a Flask service with two obvious surface issues.
|
| 728 |
+
# Finding issue A (hardcoded JWT secret) *unlocks* Phase 2 context.
|
| 729 |
+
#
|
| 730 |
+
# Phase 2 (revealed after issue A is found)
|
| 731 |
+
# A hidden DB schema snippet is appended to the observation, exposing
|
| 732 |
+
# a privilege-escalation path that only makes sense once the secret
|
| 733 |
+
# leak is understood. This rewards genuine causal reasoning:
|
| 734 |
+
# "the leaked secret lets an attacker forge admin tokens → they can
|
| 735 |
+
# reach the unguarded /admin/promote endpoint → full privilege
|
| 736 |
+
# escalation."
|
| 737 |
+
#
|
| 738 |
+
# Phase 3 (revealed after issue B is found)
|
| 739 |
+
# After the agent flags the missing rate-limit, the server's nginx
|
| 740 |
+
# config fragment is revealed, showing that /auth is also missing
|
| 741 |
+
# the global IP-allowlist — confirming the attack surface is wider
|
| 742 |
+
# than the code alone suggests.
|
| 743 |
+
#
|
| 744 |
+
# The chained field `"unlocks"` in each issue entry names the context_key
|
| 745 |
+
# that the environment injects into the observation when that issue is found.
|
| 746 |
+
# The environment layer reads this and appends the hint to `context_hints`.
|
| 747 |
+
{
|
| 748 |
+
"id": 6,
|
| 749 |
+
"name": "Causal Secrets Leak Investigation",
|
| 750 |
+
"difficulty": "hard",
|
| 751 |
+
"file_name": "auth_service.py",
|
| 752 |
+
"description": (
|
| 753 |
+
"Review this authentication service carefully. "
|
| 754 |
+
"Some issues unlock additional context about the wider system — "
|
| 755 |
+
"read every new hint you receive before continuing. "
|
| 756 |
+
"Use get_context on any suspicious line to reveal surrounding detail. "
|
| 757 |
+
"Identify all issues, then submit your review."
|
| 758 |
+
),
|
| 759 |
+
"max_steps": 35,
|
| 760 |
+
"code": """\
|
| 761 |
+
import jwt
|
| 762 |
+
import sqlite3
|
| 763 |
+
import time
|
| 764 |
+
from flask import Flask, request, jsonify
|
| 765 |
+
|
| 766 |
+
app = Flask(__name__)
|
| 767 |
+
|
| 768 |
+
# ---- configuration ----------------------------------------------------------
|
| 769 |
+
JWT_SECRET = "super-secret-jwt-key-do-not-share" # line 9: hardcoded secret
|
| 770 |
+
JWT_ALGORITHM = "HS256"
|
| 771 |
+
|
| 772 |
+
# ---- helpers ----------------------------------------------------------------
|
| 773 |
+
|
| 774 |
+
def create_token(user_id: int, role: str) -> str:
|
| 775 |
+
payload = {
|
| 776 |
+
"sub": user_id,
|
| 777 |
+
"role": role,
|
| 778 |
+
"exp": time.time() + 3600,
|
| 779 |
+
}
|
| 780 |
+
return jwt.encode(payload, JWT_SECRET, algorithm=JWT_ALGORITHM)
|
| 781 |
+
|
| 782 |
+
|
| 783 |
+
def verify_token(token: str) -> dict:
|
| 784 |
+
# line 23: algorithm not pinned — accepts ["none"] attack if lib < 2.0
|
| 785 |
+
return jwt.decode(token, JWT_SECRET, algorithms=["HS256", "none"])
|
| 786 |
+
|
| 787 |
+
|
| 788 |
+
# ---- routes -----------------------------------------------------------------
|
| 789 |
+
|
| 790 |
+
@app.route("/auth", methods=["POST"])
|
| 791 |
+
def authenticate():
|
| 792 |
+
\"\"\"Issue a JWT for valid credentials.\"\"\"
|
| 793 |
+
body = request.get_json(force=True)
|
| 794 |
+
uname = body.get("username", "")
|
| 795 |
+
pwd = body.get("password", "")
|
| 796 |
+
# line 33: no rate limiting → brute-force possible
|
| 797 |
+
conn = sqlite3.connect("users.db")
|
| 798 |
+
cursor = conn.cursor()
|
| 799 |
+
# line 37: f-string SQL → injection
|
| 800 |
+
cursor.execute(f"SELECT id, role FROM users WHERE username='{uname}' AND password='{pwd}'")
|
| 801 |
+
row = cursor.fetchone()
|
| 802 |
+
conn.close()
|
| 803 |
+
if row:
|
| 804 |
+
return jsonify({"token": create_token(row[0], row[1])})
|
| 805 |
+
return jsonify({"error": "invalid credentials"}), 401
|
| 806 |
+
|
| 807 |
+
|
| 808 |
+
@app.route("/admin/promote", methods=["POST"])
|
| 809 |
+
def promote_user():
|
| 810 |
+
\"\"\"Promote a user to admin — JWT required.\"\"\"
|
| 811 |
+
token = request.headers.get("Authorization", "").replace("Bearer ", "")
|
| 812 |
+
try:
|
| 813 |
+
claims = verify_token(token)
|
| 814 |
+
except Exception:
|
| 815 |
+
return jsonify({"error": "unauthorized"}), 401
|
| 816 |
+
# line 51: role taken directly from token — no DB re-validation
|
| 817 |
+
if claims.get("role") == "admin":
|
| 818 |
+
target = request.json.get("user_id")
|
| 819 |
+
conn = sqlite3.connect("users.db")
|
| 820 |
+
conn.execute(f"UPDATE users SET role='admin' WHERE id={target}") # line 55: injection
|
| 821 |
+
conn.commit()
|
| 822 |
+
conn.close()
|
| 823 |
+
return jsonify({"promoted": target})
|
| 824 |
+
return jsonify({"error": "forbidden"}), 403
|
| 825 |
+
""",
|
| 826 |
+
# ── Ground-truth issues ───────────────────────────────────────────
|
| 827 |
+
"issues": [
|
| 828 |
+
{
|
| 829 |
+
"id": "hardcoded_jwt_secret",
|
| 830 |
+
"description": "JWT_SECRET is hard-coded; anyone with source access can forge tokens",
|
| 831 |
+
"line_range": (9, 9),
|
| 832 |
+
"keywords": [
|
| 833 |
+
"hardcoded", "hard-coded", "jwt_secret", "secret", "jwt",
|
| 834 |
+
"environment variable", "env var", "os.environ", "forge",
|
| 835 |
+
"hardcode", "token secret",
|
| 836 |
+
],
|
| 837 |
+
"category": "security",
|
| 838 |
+
"severity": "critical",
|
| 839 |
+
"weight": 1.0,
|
| 840 |
+
# Finding this issue unlocks the DB schema context hint
|
| 841 |
+
"unlocks": "db_schema_hint",
|
| 842 |
+
},
|
| 843 |
+
{
|
| 844 |
+
"id": "jwt_none_algorithm",
|
| 845 |
+
"description": (
|
| 846 |
+
"jwt.decode accepts 'none' algorithm — attacker can craft an "
|
| 847 |
+
"unsigned token and bypass signature verification"
|
| 848 |
+
),
|
| 849 |
+
"line_range": (23, 24),
|
| 850 |
+
"keywords": [
|
| 851 |
+
"none", "algorithm", "alg", "unsigned", "bypass",
|
| 852 |
+
"jwt", "signature", "verify", "none algorithm",
|
| 853 |
+
],
|
| 854 |
+
"category": "security",
|
| 855 |
+
"severity": "critical",
|
| 856 |
+
"weight": 1.0,
|
| 857 |
+
},
|
| 858 |
+
{
|
| 859 |
+
"id": "no_rate_limit",
|
| 860 |
+
"description": "/auth endpoint has no rate limiting — susceptible to brute-force",
|
| 861 |
+
"line_range": (33, 34),
|
| 862 |
+
"keywords": [
|
| 863 |
+
"rate limit", "rate-limit", "brute force", "brute-force",
|
| 864 |
+
"throttle", "throttling", "flood", "limit", "attempts",
|
| 865 |
+
],
|
| 866 |
+
"category": "security",
|
| 867 |
+
"severity": "error",
|
| 868 |
+
"weight": 0.75,
|
| 869 |
+
# Finding this unlocks the nginx config hint
|
| 870 |
+
"unlocks": "nginx_config_hint",
|
| 871 |
+
},
|
| 872 |
+
{
|
| 873 |
+
"id": "sql_injection_auth",
|
| 874 |
+
"description": "f-string interpolation in SQL query on /auth → injection",
|
| 875 |
+
"line_range": (37, 38),
|
| 876 |
+
"keywords": [
|
| 877 |
+
"sql injection", "sql", "injection", "f-string", "parameterized",
|
| 878 |
+
"sanitize", "escape", "prepared statement", "placeholder",
|
| 879 |
+
],
|
| 880 |
+
"category": "security",
|
| 881 |
+
"severity": "critical",
|
| 882 |
+
"weight": 1.0,
|
| 883 |
+
},
|
| 884 |
+
{
|
| 885 |
+
"id": "role_from_token_only",
|
| 886 |
+
"description": (
|
| 887 |
+
"Role is read directly from the JWT payload without re-checking the DB — "
|
| 888 |
+
"a forged or stale token grants permanent privilege"
|
| 889 |
+
),
|
| 890 |
+
"line_range": (51, 52),
|
| 891 |
+
"keywords": [
|
| 892 |
+
"role", "token", "db", "database", "re-check", "revalidat",
|
| 893 |
+
"stale", "privilege", "escalation", "claims", "payload",
|
| 894 |
+
"not verified", "trust",
|
| 895 |
+
],
|
| 896 |
+
"category": "security",
|
| 897 |
+
"severity": "critical",
|
| 898 |
+
"weight": 1.0,
|
| 899 |
+
},
|
| 900 |
+
{
|
| 901 |
+
"id": "sql_injection_promote",
|
| 902 |
+
"description": "f-string SQL in /admin/promote UPDATE query → second-order injection",
|
| 903 |
+
"line_range": (55, 55),
|
| 904 |
+
"keywords": [
|
| 905 |
+
"sql injection", "sql", "injection", "f-string", "parameterized",
|
| 906 |
+
"prepared statement", "placeholder", "update", "second order",
|
| 907 |
+
],
|
| 908 |
+
"category": "security",
|
| 909 |
+
"severity": "critical",
|
| 910 |
+
"weight": 1.0,
|
| 911 |
+
},
|
| 912 |
+
],
|
| 913 |
+
"correct_decision": "request_changes",
|
| 914 |
+
# ── Causal context hints — revealed progressively ─────────────────
|
| 915 |
+
# Each value is injected into the observation once the triggering
|
| 916 |
+
# issue is found. The agent must incorporate this new information
|
| 917 |
+
# into its ongoing world model.
|
| 918 |
+
"context_hints": {
|
| 919 |
+
"db_schema_hint": (
|
| 920 |
+
"=== UNLOCKED: Database Schema (users.db) ===\n"
|
| 921 |
+
" CREATE TABLE users (\n"
|
| 922 |
+
" id INTEGER PRIMARY KEY,\n"
|
| 923 |
+
" username TEXT UNIQUE NOT NULL,\n"
|
| 924 |
+
" password TEXT NOT NULL, -- stored as plaintext!\n"
|
| 925 |
+
" role TEXT DEFAULT 'viewer' -- 'viewer' | 'editor' | 'admin'\n"
|
| 926 |
+
" );\n"
|
| 927 |
+
"NOTE: The /admin/promote endpoint can elevate any user to 'admin'. "
|
| 928 |
+
"Combined with a forged JWT (from the leaked secret), an attacker "
|
| 929 |
+
"can reach this endpoint with admin claims and promote themselves."
|
| 930 |
+
),
|
| 931 |
+
"nginx_config_hint": (
|
| 932 |
+
"=== UNLOCKED: nginx reverse-proxy config (nginx.conf excerpt) ===\n"
|
| 933 |
+
" location /auth {\n"
|
| 934 |
+
" proxy_pass http://auth_service:5000;\n"
|
| 935 |
+
" # no ip_allowlist, no limit_req_zone\n"
|
| 936 |
+
" }\n"
|
| 937 |
+
"NOTE: The nginx layer adds no rate-limiting or IP filtering "
|
| 938 |
+
"in front of /auth, confirming the brute-force surface is "
|
| 939 |
+
"fully exposed to the internet."
|
| 940 |
+
),
|
| 941 |
+
},
|
| 942 |
+
},
|
| 943 |
]
|
tests/__init__.py
ADDED
|
File without changes
|
tests/__pycache__/__init__.cpython-314.pyc
ADDED
|
Binary file (162 Bytes). View file
|
|
|
tests/__pycache__/test_dynamic_world.cpython-314-pytest-9.0.3.pyc
ADDED
|
Binary file (48.8 kB). View file
|
|
|
tests/__pycache__/test_grader.cpython-314-pytest-9.0.3.pyc
ADDED
|
Binary file (47.6 kB). View file
|
|
|
tests/test_dynamic_world.py
ADDED
|
@@ -0,0 +1,344 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Tests for the dynamic world features:
|
| 3 |
+
- server/mutator.py (code mutation engine)
|
| 4 |
+
- Task 6 (causal chain / progressive observation)
|
| 5 |
+
- GET_CONTEXT action (line-context probing)
|
| 6 |
+
- Causal unlock chain (context_hints injected into observation)
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import sys
|
| 10 |
+
import os
|
| 11 |
+
import copy
|
| 12 |
+
|
| 13 |
+
import pytest
|
| 14 |
+
|
| 15 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
| 16 |
+
|
| 17 |
+
from server.mutator import mutate_task
|
| 18 |
+
from server.tasks import TASKS
|
| 19 |
+
from server.grader import CodeReviewGrader
|
| 20 |
+
|
| 21 |
+
# ---------------------------------------------------------------------------
|
| 22 |
+
# Helpers
|
| 23 |
+
# ---------------------------------------------------------------------------
|
| 24 |
+
|
| 25 |
+
TASK6 = TASKS[6] # causal chain task
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def _grader(task):
|
| 29 |
+
return CodeReviewGrader(task)
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
# ===========================================================================
|
| 33 |
+
# MUTATOR TESTS
|
| 34 |
+
# ===========================================================================
|
| 35 |
+
|
| 36 |
+
class TestMutator:
|
| 37 |
+
|
| 38 |
+
def test_returns_deep_copy(self):
|
| 39 |
+
"""mutate_task must not modify the original TASKS entry."""
|
| 40 |
+
original_code = TASKS[1]["code"]
|
| 41 |
+
_ = mutate_task(TASKS[1], seed=0)
|
| 42 |
+
assert TASKS[1]["code"] == original_code
|
| 43 |
+
|
| 44 |
+
def test_mutation_seed_tag(self):
|
| 45 |
+
"""Mutated task carries _mutation_seed matching the supplied seed."""
|
| 46 |
+
t = mutate_task(TASKS[1], seed=42)
|
| 47 |
+
assert t["_mutation_seed"] == 42
|
| 48 |
+
|
| 49 |
+
def test_different_seeds_differ(self):
|
| 50 |
+
"""Two different seeds should (almost always) produce different code."""
|
| 51 |
+
t1 = mutate_task(TASKS[1], seed=0)
|
| 52 |
+
t2 = mutate_task(TASKS[1], seed=1)
|
| 53 |
+
# At minimum the blank-line insert shifts are different; codes differ
|
| 54 |
+
assert t1["code"] != TASKS[1]["code"] or t2["code"] != TASKS[1]["code"]
|
| 55 |
+
|
| 56 |
+
def test_same_seed_is_deterministic(self):
|
| 57 |
+
"""Same seed must always produce identical output."""
|
| 58 |
+
t1 = mutate_task(TASKS[2], seed=99)
|
| 59 |
+
t2 = mutate_task(TASKS[2], seed=99)
|
| 60 |
+
assert t1["code"] == t2["code"]
|
| 61 |
+
assert t1["issues"] == t2["issues"]
|
| 62 |
+
|
| 63 |
+
def test_line_shift_applied(self):
|
| 64 |
+
"""Line shift must move every issue line_range down by exactly 1."""
|
| 65 |
+
original = copy.deepcopy(TASKS[1])
|
| 66 |
+
mutated = mutate_task(TASKS[1], seed=7)
|
| 67 |
+
orig_ranges = [iss["line_range"] for iss in original["issues"]]
|
| 68 |
+
mut_ranges = [iss["line_range"] for iss in mutated["issues"]]
|
| 69 |
+
for orig_r, mut_r in zip(orig_ranges, mut_ranges):
|
| 70 |
+
assert mut_r[0] == orig_r[0] + 1
|
| 71 |
+
assert mut_r[1] == orig_r[1] + 1
|
| 72 |
+
|
| 73 |
+
def test_issue_count_preserved(self):
|
| 74 |
+
"""Mutation must not add or remove issues."""
|
| 75 |
+
for task in TASKS[:6]: # skip task 6 here, tested separately
|
| 76 |
+
mutated = mutate_task(task, seed=5)
|
| 77 |
+
assert len(mutated["issues"]) == len(task["issues"])
|
| 78 |
+
|
| 79 |
+
def test_issue_ids_preserved(self):
|
| 80 |
+
"""Issue ids must be unchanged after mutation."""
|
| 81 |
+
original_ids = [i["id"] for i in TASKS[2]["issues"]]
|
| 82 |
+
mutated_ids = [i["id"] for i in mutate_task(TASKS[2], seed=3)["issues"]]
|
| 83 |
+
assert original_ids == mutated_ids
|
| 84 |
+
|
| 85 |
+
def test_grader_still_matches_after_mutation(self):
|
| 86 |
+
"""
|
| 87 |
+
The grader must still award credit after mutation.
|
| 88 |
+
Use the off-by-one issue in task 1 — keyword 'range' is always present
|
| 89 |
+
and line_range shifts by exactly 1.
|
| 90 |
+
"""
|
| 91 |
+
mutated = mutate_task(TASKS[1], seed=10)
|
| 92 |
+
g = _grader(mutated)
|
| 93 |
+
off_by_one = next(i for i in mutated["issues"] if i["id"] == "off_by_one")
|
| 94 |
+
target_line = off_by_one["line_range"][0]
|
| 95 |
+
|
| 96 |
+
score, found, _ = g.score_comment(
|
| 97 |
+
line_number=target_line,
|
| 98 |
+
comment="off-by-one error: range(len + 1) causes IndexError on the last iteration",
|
| 99 |
+
already_found=[],
|
| 100 |
+
)
|
| 101 |
+
assert "off_by_one" in found
|
| 102 |
+
assert score > 0.0
|
| 103 |
+
|
| 104 |
+
def test_correct_decision_preserved(self):
|
| 105 |
+
"""correct_decision must be unchanged by mutation."""
|
| 106 |
+
for task in TASKS:
|
| 107 |
+
mutated = mutate_task(task, seed=1)
|
| 108 |
+
assert mutated["correct_decision"] == task["correct_decision"]
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
# ===========================================================================
|
| 112 |
+
# TASK 6 STRUCTURE TESTS
|
| 113 |
+
# ===========================================================================
|
| 114 |
+
|
| 115 |
+
class TestTask6Structure:
|
| 116 |
+
|
| 117 |
+
def test_task6_exists(self):
|
| 118 |
+
assert len(TASKS) >= 7, "Task 6 (causal chain) must exist in TASKS"
|
| 119 |
+
|
| 120 |
+
def test_task6_has_context_hints(self):
|
| 121 |
+
assert "context_hints" in TASK6
|
| 122 |
+
assert len(TASK6["context_hints"]) >= 2
|
| 123 |
+
|
| 124 |
+
def test_task6_unlock_keys_present(self):
|
| 125 |
+
"""Every 'unlocks' key in an issue must exist in context_hints dict."""
|
| 126 |
+
hints = TASK6["context_hints"]
|
| 127 |
+
for issue in TASK6["issues"]:
|
| 128 |
+
key = issue.get("unlocks")
|
| 129 |
+
if key:
|
| 130 |
+
assert key in hints, f"Issue {issue['id']} unlocks '{key}' but key not in context_hints"
|
| 131 |
+
|
| 132 |
+
def test_task6_total_weight_positive(self):
|
| 133 |
+
g = _grader(TASK6)
|
| 134 |
+
assert g.total_weight > 0.0
|
| 135 |
+
|
| 136 |
+
def test_task6_has_chained_issues(self):
|
| 137 |
+
"""At least two issues must have an 'unlocks' field."""
|
| 138 |
+
unlocking = [i for i in TASK6["issues"] if i.get("unlocks")]
|
| 139 |
+
assert len(unlocking) >= 2
|
| 140 |
+
|
| 141 |
+
def test_task6_correct_decision(self):
|
| 142 |
+
assert TASK6["correct_decision"] == "request_changes"
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
# ===========================================================================
|
| 146 |
+
# CAUSAL UNLOCK CHAIN TESTS (environment layer)
|
| 147 |
+
# ===========================================================================
|
| 148 |
+
|
| 149 |
+
class TestCausalUnlock:
|
| 150 |
+
"""
|
| 151 |
+
Test the unlock mechanic via the environment's _unlock_causal_hints helper
|
| 152 |
+
and _handle_add_comment pipeline.
|
| 153 |
+
"""
|
| 154 |
+
|
| 155 |
+
def _make_env(self):
|
| 156 |
+
"""Return a fresh environment instance fast-forwarded to task 6."""
|
| 157 |
+
import asyncio
|
| 158 |
+
try:
|
| 159 |
+
from server.CodeReviewAgent_environment import ProbeEnvironment
|
| 160 |
+
except ImportError:
|
| 161 |
+
from CodeReviewAgent_environment import ProbeEnvironment # type: ignore
|
| 162 |
+
|
| 163 |
+
env = ProbeEnvironment()
|
| 164 |
+
# force-set episode to task 6 (bypass cycling for test speed)
|
| 165 |
+
from server.mutator import mutate_task as _mt
|
| 166 |
+
task = _mt(TASK6, seed=0)
|
| 167 |
+
from server.grader import CodeReviewGrader as _G
|
| 168 |
+
env._grader = _G(task)
|
| 169 |
+
env._ep = env._fresh_episode(task)
|
| 170 |
+
return env
|
| 171 |
+
|
| 172 |
+
def test_no_hints_at_start(self):
|
| 173 |
+
env = self._make_env()
|
| 174 |
+
assert env._ep["context_hints"] == []
|
| 175 |
+
|
| 176 |
+
def test_unlock_fires_after_finding_trigger_issue(self):
|
| 177 |
+
"""Finding hardcoded_jwt_secret must append db_schema_hint."""
|
| 178 |
+
env = self._make_env()
|
| 179 |
+
jwt_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "hardcoded_jwt_secret")
|
| 180 |
+
target_line = jwt_issue["line_range"][0]
|
| 181 |
+
|
| 182 |
+
env._step_count = 1
|
| 183 |
+
reward = env._handle_add_comment(
|
| 184 |
+
type("A", (), {
|
| 185 |
+
"line_number": target_line,
|
| 186 |
+
"comment": "JWT_SECRET is hardcoded — must be loaded from environment variable to prevent token forgery",
|
| 187 |
+
"severity": type("S", (), {"value": "critical"})(),
|
| 188 |
+
"category": type("C", (), {"value": "security"})(),
|
| 189 |
+
})()
|
| 190 |
+
)
|
| 191 |
+
assert "hardcoded_jwt_secret" in env._ep["issues_found"]
|
| 192 |
+
assert len(env._ep["context_hints"]) == 1
|
| 193 |
+
assert "db_schema_hint" in env._ep["hints_unlocked"]
|
| 194 |
+
assert "Database Schema" in env._ep["context_hints"][0]
|
| 195 |
+
|
| 196 |
+
def test_unlock_fires_only_once(self):
|
| 197 |
+
"""The same hint must not be appended twice even if issue found again."""
|
| 198 |
+
env = self._make_env()
|
| 199 |
+
jwt_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "hardcoded_jwt_secret")
|
| 200 |
+
target_line = jwt_issue["line_range"][0]
|
| 201 |
+
|
| 202 |
+
for _ in range(3):
|
| 203 |
+
env._step_count += 1
|
| 204 |
+
env._handle_add_comment(
|
| 205 |
+
type("A", (), {
|
| 206 |
+
"line_number": target_line,
|
| 207 |
+
"comment": "JWT_SECRET is hardcoded — must be loaded from environment variable",
|
| 208 |
+
"severity": type("S", (), {"value": "critical"})(),
|
| 209 |
+
"category": type("C", (), {"value": "security"})(),
|
| 210 |
+
})()
|
| 211 |
+
)
|
| 212 |
+
assert len(env._ep["context_hints"]) == 1
|
| 213 |
+
|
| 214 |
+
def test_second_unlock_fires_independently(self):
|
| 215 |
+
"""Finding no_rate_limit must append nginx_config_hint independently."""
|
| 216 |
+
env = self._make_env()
|
| 217 |
+
rate_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "no_rate_limit")
|
| 218 |
+
target_line = rate_issue["line_range"][0]
|
| 219 |
+
|
| 220 |
+
env._step_count = 1
|
| 221 |
+
env._handle_add_comment(
|
| 222 |
+
type("A", (), {
|
| 223 |
+
"line_number": target_line,
|
| 224 |
+
"comment": "No rate limiting on /auth endpoint — susceptible to brute-force attacks",
|
| 225 |
+
"severity": type("S", (), {"value": "error"})(),
|
| 226 |
+
"category": type("C", (), {"value": "security"})(),
|
| 227 |
+
})()
|
| 228 |
+
)
|
| 229 |
+
assert "nginx_config_hint" in env._ep["hints_unlocked"]
|
| 230 |
+
assert any("nginx" in h.lower() for h in env._ep["context_hints"])
|
| 231 |
+
|
| 232 |
+
def test_both_unlocks_can_fire_in_same_episode(self):
|
| 233 |
+
"""Both hints can be unlocked within one episode."""
|
| 234 |
+
env = self._make_env()
|
| 235 |
+
task = env._ep["task"]
|
| 236 |
+
|
| 237 |
+
jwt_issue = next(i for i in task["issues"] if i["id"] == "hardcoded_jwt_secret")
|
| 238 |
+
rate_issue = next(i for i in task["issues"] if i["id"] == "no_rate_limit")
|
| 239 |
+
|
| 240 |
+
for step, (issue, kw) in enumerate([
|
| 241 |
+
(jwt_issue, "JWT_SECRET is hardcoded — must be loaded from environment variable to prevent forgery"),
|
| 242 |
+
(rate_issue, "No rate limiting on /auth endpoint — susceptible to brute-force attacks"),
|
| 243 |
+
], start=1):
|
| 244 |
+
env._step_count = step
|
| 245 |
+
env._handle_add_comment(
|
| 246 |
+
type("A", (), {
|
| 247 |
+
"line_number": issue["line_range"][0],
|
| 248 |
+
"comment": kw,
|
| 249 |
+
"severity": type("S", (), {"value": "critical"})(),
|
| 250 |
+
"category": type("C", (), {"value": "security"})(),
|
| 251 |
+
})()
|
| 252 |
+
)
|
| 253 |
+
|
| 254 |
+
assert len(env._ep["context_hints"]) == 2
|
| 255 |
+
assert env._ep["hints_unlocked"] == {"db_schema_hint", "nginx_config_hint"}
|
| 256 |
+
|
| 257 |
+
def test_context_hints_appear_in_observation(self):
|
| 258 |
+
"""context_hints list must be non-empty in the observation after an unlock."""
|
| 259 |
+
env = self._make_env()
|
| 260 |
+
jwt_issue = next(i for i in env._ep["task"]["issues"] if i["id"] == "hardcoded_jwt_secret")
|
| 261 |
+
env._step_count = 1
|
| 262 |
+
env._handle_add_comment(
|
| 263 |
+
type("A", (), {
|
| 264 |
+
"line_number": jwt_issue["line_range"][0],
|
| 265 |
+
"comment": "JWT_SECRET is hardcoded — must be loaded from environment variable",
|
| 266 |
+
"severity": type("S", (), {"value": "critical"})(),
|
| 267 |
+
"category": type("C", (), {"value": "security"})(),
|
| 268 |
+
})()
|
| 269 |
+
)
|
| 270 |
+
obs = env._make_obs(reward=0.0, done=False)
|
| 271 |
+
assert len(obs.context_hints) == 1
|
| 272 |
+
assert "Database Schema" in obs.context_hints[0]
|
| 273 |
+
|
| 274 |
+
|
| 275 |
+
# ===========================================================================
|
| 276 |
+
# GET_CONTEXT ACTION TESTS
|
| 277 |
+
# ===========================================================================
|
| 278 |
+
|
| 279 |
+
class TestGetContext:
|
| 280 |
+
|
| 281 |
+
def _make_env(self):
|
| 282 |
+
try:
|
| 283 |
+
from server.CodeReviewAgent_environment import ProbeEnvironment
|
| 284 |
+
except ImportError:
|
| 285 |
+
from CodeReviewAgent_environment import ProbeEnvironment # type: ignore
|
| 286 |
+
from server.mutator import mutate_task as _mt
|
| 287 |
+
from server.grader import CodeReviewGrader as _G
|
| 288 |
+
env = ProbeEnvironment()
|
| 289 |
+
task = _mt(TASKS[1], seed=0)
|
| 290 |
+
env._grader = _G(task)
|
| 291 |
+
env._ep = env._fresh_episode(task)
|
| 292 |
+
return env
|
| 293 |
+
|
| 294 |
+
def test_get_context_near_issue_no_penalty(self):
|
| 295 |
+
"""Probing a line near a real issue must cost 0.0."""
|
| 296 |
+
env = self._make_env()
|
| 297 |
+
issue_line = env._ep["task"]["issues"][0]["line_range"][0]
|
| 298 |
+
env._step_count = 1
|
| 299 |
+
reward = env._handle_get_context(
|
| 300 |
+
type("A", (), {"line_number": issue_line})()
|
| 301 |
+
)
|
| 302 |
+
assert reward.total == 0.0
|
| 303 |
+
assert reward.passed is True
|
| 304 |
+
|
| 305 |
+
def test_get_context_far_from_issue_costs_penalty(self):
|
| 306 |
+
"""Probing a line far from any issue must cost -0.01."""
|
| 307 |
+
env = self._make_env()
|
| 308 |
+
env._step_count = 1
|
| 309 |
+
reward = env._handle_get_context(
|
| 310 |
+
type("A", (), {"line_number": 999})()
|
| 311 |
+
)
|
| 312 |
+
assert reward.total == pytest.approx(-0.01, abs=0.001)
|
| 313 |
+
assert reward.passed is False
|
| 314 |
+
|
| 315 |
+
def test_get_context_no_line_number_penalised(self):
|
| 316 |
+
"""GET_CONTEXT with no line_number must return -0.02."""
|
| 317 |
+
env = self._make_env()
|
| 318 |
+
env._step_count = 1
|
| 319 |
+
reward = env._handle_get_context(
|
| 320 |
+
type("A", (), {"line_number": None})()
|
| 321 |
+
)
|
| 322 |
+
assert reward.total == pytest.approx(-0.02, abs=0.001)
|
| 323 |
+
|
| 324 |
+
def test_get_context_snippet_stored_in_history(self):
|
| 325 |
+
"""The context probe must be recorded in review_comments."""
|
| 326 |
+
env = self._make_env()
|
| 327 |
+
env._step_count = 1
|
| 328 |
+
env._handle_get_context(
|
| 329 |
+
type("A", (), {"line_number": 4})()
|
| 330 |
+
)
|
| 331 |
+
probes = [c for c in env._ep["review_comments"] if c.get("type") == "context_probe"]
|
| 332 |
+
assert len(probes) == 1
|
| 333 |
+
assert probes[0]["line"] == 4
|
| 334 |
+
assert "context" in probes[0]
|
| 335 |
+
|
| 336 |
+
def test_get_context_snippet_contains_requested_line(self):
|
| 337 |
+
"""The returned snippet must reference the requested line number."""
|
| 338 |
+
env = self._make_env()
|
| 339 |
+
env._step_count = 1
|
| 340 |
+
reward = env._handle_get_context(
|
| 341 |
+
type("A", (), {"line_number": 4})()
|
| 342 |
+
)
|
| 343 |
+
# explanation contains the formatted snippet with line numbers
|
| 344 |
+
assert "4:" in reward.explanation or "4 :" in reward.explanation
|
tests/test_grader.py
ADDED
|
@@ -0,0 +1,397 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Tests for CodeReviewGrader — validates all 5 RL attack scenarios plus
|
| 3 |
+
edge cases for the three anti-exploit fixes made in grader.py.
|
| 4 |
+
|
| 5 |
+
Attack targets (from the task spec):
|
| 6 |
+
Lazy / vague output → 0.00 – 0.15
|
| 7 |
+
Average output → 0.30 – 0.50
|
| 8 |
+
Good output → 0.60 – 0.80
|
| 9 |
+
Perfect output → 0.85 – 1.00
|
| 10 |
+
Wrong bug reported → penalty / 0.00
|
| 11 |
+
|
| 12 |
+
Coverage:
|
| 13 |
+
1. Lazy attack
|
| 14 |
+
2. Vague attack
|
| 15 |
+
3. Wrong-bug / hallucination attack
|
| 16 |
+
4. Perfect output
|
| 17 |
+
5. Base-model (average) output
|
| 18 |
+
6. LINE_TOLERANCE boundary (fix 1)
|
| 19 |
+
7. Minimum comment length guard (fix 2)
|
| 20 |
+
8. False-positive penalty value (fix 3)
|
| 21 |
+
9. final_score — full coverage + correct decision
|
| 22 |
+
10. final_score — zero coverage + wrong decision
|
| 23 |
+
11. final_score — partial coverage
|
| 24 |
+
12. Duplicate SUBMIT_REVIEW penalty (environment layer)
|
| 25 |
+
13. already_found deduplication
|
| 26 |
+
14. None / empty comment guard
|
| 27 |
+
"""
|
| 28 |
+
|
| 29 |
+
import sys
|
| 30 |
+
import os
|
| 31 |
+
|
| 32 |
+
import pytest
|
| 33 |
+
|
| 34 |
+
# Ensure the project root (containing the `server` package) is on the path
|
| 35 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
| 36 |
+
|
| 37 |
+
from server.grader import CodeReviewGrader, LINE_TOLERANCE
|
| 38 |
+
from server.tasks import TASKS
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
# ── Fixtures ──────────────────────────────────────────────────────────────────
|
| 42 |
+
|
| 43 |
+
@pytest.fixture
|
| 44 |
+
def task0():
|
| 45 |
+
"""Ultra-easy bootstrap task (2 issues, equal weight 1.0 each)."""
|
| 46 |
+
return TASKS[0]
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
@pytest.fixture
|
| 50 |
+
def task1():
|
| 51 |
+
"""Easy task (3 issues)."""
|
| 52 |
+
return TASKS[1]
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
@pytest.fixture
|
| 56 |
+
def grader0(task0):
|
| 57 |
+
return CodeReviewGrader(task0)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
@pytest.fixture
|
| 61 |
+
def grader1(task1):
|
| 62 |
+
return CodeReviewGrader(task1)
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
# ── Sanity ────────────────────────────────────────────────────────────────────
|
| 66 |
+
|
| 67 |
+
def test_line_tolerance_value():
|
| 68 |
+
"""LINE_TOLERANCE must be 2 after the anti-exploit fix."""
|
| 69 |
+
assert LINE_TOLERANCE == 2
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
# ── 1. Lazy attack ────────────────────────────────────────────────────────────
|
| 73 |
+
|
| 74 |
+
def test_lazy_attack_no_credit(grader0):
|
| 75 |
+
"""Generic comment with no matching keyword earns only false-positive penalty."""
|
| 76 |
+
score, found, _ = grader0.score_comment(
|
| 77 |
+
line_number=4,
|
| 78 |
+
# deliberately avoids all task-0 keywords (off-by-one, index, range,
|
| 79 |
+
# bug, security, password, credential, hardcoded, env, secret, etc.)
|
| 80 |
+
comment="This function could probably be improved with some refactoring.",
|
| 81 |
+
already_found=[],
|
| 82 |
+
)
|
| 83 |
+
assert found == []
|
| 84 |
+
assert score <= 0.0 # pure false-positive penalty, no credit
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def test_lazy_attack_wrong_line(grader0):
|
| 88 |
+
"""Keyword present but line number far from issue — no credit awarded."""
|
| 89 |
+
score, found, _ = grader0.score_comment(
|
| 90 |
+
line_number=99, # far from issue at line 4
|
| 91 |
+
comment="off-by-one indexerror range",
|
| 92 |
+
already_found=[],
|
| 93 |
+
)
|
| 94 |
+
assert found == []
|
| 95 |
+
assert score < 0.0 # false-positive penalty applied
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
# ── 2. Vague attack ───────────────────────────────────────────────────────────
|
| 99 |
+
|
| 100 |
+
def test_vague_attack_category_only(grader0):
|
| 101 |
+
"""Mentioning category ('bug') on correct line but no specific keyword — no credit."""
|
| 102 |
+
score, found, _ = grader0.score_comment(
|
| 103 |
+
line_number=4,
|
| 104 |
+
comment="This code has a logical issue.",
|
| 105 |
+
already_found=[],
|
| 106 |
+
)
|
| 107 |
+
assert found == []
|
| 108 |
+
assert score <= 0.0
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
# ── 3. Wrong-bug / hallucination attack ──────────────────────────────────────
|
| 112 |
+
|
| 113 |
+
def test_wrong_bug_on_correct_line_wrong_keyword(grader0):
|
| 114 |
+
"""Hallucinated keyword on the correct line must not earn credit."""
|
| 115 |
+
score, found, _ = grader0.score_comment(
|
| 116 |
+
line_number=4,
|
| 117 |
+
comment="This has a performance bottleneck and memory leak issue here.",
|
| 118 |
+
already_found=[],
|
| 119 |
+
)
|
| 120 |
+
# 'performance' / 'memory' are not in bootstrap_off_by_one keywords
|
| 121 |
+
assert found == []
|
| 122 |
+
assert score <= 0.0
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
def test_wrong_bug_wrong_line_right_keyword(grader0):
|
| 126 |
+
"""Right keyword, wrong line — line_hit must block the credit."""
|
| 127 |
+
score, found, _ = grader0.score_comment(
|
| 128 |
+
line_number=50, # nowhere near line 4 or 11
|
| 129 |
+
comment="off-by-one indexerror range len + 1",
|
| 130 |
+
already_found=[],
|
| 131 |
+
)
|
| 132 |
+
assert found == []
|
| 133 |
+
assert score <= 0.0
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
# ── 4. Perfect output ─────────────────────────────────────────────────────────
|
| 137 |
+
|
| 138 |
+
def test_perfect_comment_task0_issue1(grader0):
|
| 139 |
+
"""Exact keyword + exact line → full credit for issue 1."""
|
| 140 |
+
score, found, breakdown = grader0.score_comment(
|
| 141 |
+
line_number=4,
|
| 142 |
+
comment="Off-by-one error: range(len(data) + 1) causes IndexError on the last iteration.",
|
| 143 |
+
already_found=[],
|
| 144 |
+
)
|
| 145 |
+
assert "bootstrap_off_by_one" in found
|
| 146 |
+
assert breakdown["issue_credit"] == pytest.approx(0.30, abs=0.01)
|
| 147 |
+
assert score > 0.0
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
def test_perfect_comment_task0_issue2(grader0):
|
| 151 |
+
"""Exact keyword + exact line → full credit for issue 2."""
|
| 152 |
+
score, found, _ = grader0.score_comment(
|
| 153 |
+
line_number=11,
|
| 154 |
+
comment="Hardcoded password / credential in source — move to environment variable.",
|
| 155 |
+
already_found=[],
|
| 156 |
+
)
|
| 157 |
+
assert "bootstrap_hardcoded_cred" in found
|
| 158 |
+
assert score > 0.0
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def test_perfect_final_score_task0(grader0):
|
| 162 |
+
"""Full coverage + correct decision gives max terminal reward.
|
| 163 |
+
|
| 164 |
+
final_score() is the TERMINAL component only (coverage 0.20 + decision 0.10
|
| 165 |
+
+ efficiency 0.10 = max 0.40). The per-comment 0.60 accumulates separately
|
| 166 |
+
during the episode via score_comment(). Assert the realistic terminal range.
|
| 167 |
+
"""
|
| 168 |
+
reward = grader0.final_score(
|
| 169 |
+
issues_found=["bootstrap_off_by_one", "bootstrap_hardcoded_cred"],
|
| 170 |
+
review_decision="request_changes",
|
| 171 |
+
step_count=4,
|
| 172 |
+
max_steps=6,
|
| 173 |
+
current_step=4,
|
| 174 |
+
)
|
| 175 |
+
# coverage_bonus=0.20 + decision_score=0.10 + efficiency_bonus>0 → ~0.33-0.40
|
| 176 |
+
assert reward.total >= 0.30
|
| 177 |
+
assert reward.components["coverage_bonus"] == pytest.approx(0.20, abs=0.01)
|
| 178 |
+
assert reward.components["decision_score"] == pytest.approx(0.10, abs=0.001)
|
| 179 |
+
assert reward.passed is True
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
# ── 5. Base-model (average) output ───────────────────────────────────────────
|
| 183 |
+
|
| 184 |
+
def test_base_model_finds_one_of_two(grader0):
|
| 185 |
+
"""Agent that finds 1/2 issues correctly should score in the average range."""
|
| 186 |
+
# Step 1: correct comment finding issue 1
|
| 187 |
+
score1, found1, _ = grader0.score_comment(
|
| 188 |
+
line_number=4,
|
| 189 |
+
comment="range(len(data) + 1) has an off-by-one bug causing IndexError.",
|
| 190 |
+
already_found=[],
|
| 191 |
+
)
|
| 192 |
+
# Step 2: vague comment on issue 2 line — no keyword match
|
| 193 |
+
score2, found2, _ = grader0.score_comment(
|
| 194 |
+
line_number=11,
|
| 195 |
+
comment="This line looks like it might have an issue with the connection string.",
|
| 196 |
+
already_found=found1,
|
| 197 |
+
)
|
| 198 |
+
reward = grader0.final_score(
|
| 199 |
+
issues_found=found1 + found2,
|
| 200 |
+
review_decision="request_changes",
|
| 201 |
+
step_count=4,
|
| 202 |
+
max_steps=6,
|
| 203 |
+
current_step=4,
|
| 204 |
+
)
|
| 205 |
+
# 50 % coverage → coverage_bonus=0.10, correct_decision=+0.10 → 0.20 total
|
| 206 |
+
# Well below the 0.85 perfect ceiling, above 0.10 lazy floor
|
| 207 |
+
assert 0.15 <= reward.total <= 0.55
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
# ── 6. LINE_TOLERANCE boundary ────────────────────────────────────────────────
|
| 211 |
+
|
| 212 |
+
def test_line_just_inside_tolerance(grader0):
|
| 213 |
+
"""line_number at start - LINE_TOLERANCE must still match."""
|
| 214 |
+
issue_start = TASKS[0]["issues"][0]["line_range"][0] # 4
|
| 215 |
+
score, found, _ = grader0.score_comment(
|
| 216 |
+
line_number=issue_start - LINE_TOLERANCE, # exactly at boundary
|
| 217 |
+
comment="off-by-one indexerror range(len + 1) causes crash here",
|
| 218 |
+
already_found=[],
|
| 219 |
+
)
|
| 220 |
+
assert "bootstrap_off_by_one" in found
|
| 221 |
+
|
| 222 |
+
|
| 223 |
+
def test_line_just_outside_tolerance(grader0):
|
| 224 |
+
"""line_number at start - LINE_TOLERANCE - 1 must NOT match."""
|
| 225 |
+
issue_start = TASKS[0]["issues"][0]["line_range"][0] # 4
|
| 226 |
+
score, found, _ = grader0.score_comment(
|
| 227 |
+
line_number=issue_start - LINE_TOLERANCE - 1, # one beyond boundary
|
| 228 |
+
comment="off-by-one indexerror range(len + 1) causes crash here",
|
| 229 |
+
already_found=[],
|
| 230 |
+
)
|
| 231 |
+
assert found == []
|
| 232 |
+
assert score <= 0.0
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
# ── 7. Minimum comment length guard ──────────────────────────────────────────
|
| 236 |
+
|
| 237 |
+
def test_short_keyword_comment_no_credit(grader0):
|
| 238 |
+
"""A comment ≤ 15 chars containing a matching keyword must NOT earn credit."""
|
| 239 |
+
score, found, _ = grader0.score_comment(
|
| 240 |
+
line_number=4,
|
| 241 |
+
comment="indexerror", # 10 chars — below 15-char threshold
|
| 242 |
+
already_found=[],
|
| 243 |
+
)
|
| 244 |
+
assert found == []
|
| 245 |
+
# short comment → neither credit nor false-positive penalty
|
| 246 |
+
assert score == 0.0
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
def test_short_comment_no_false_positive_penalty(grader0):
|
| 250 |
+
"""A short comment that matches nothing must NOT be penalised (too trivial)."""
|
| 251 |
+
score, found, _ = grader0.score_comment(
|
| 252 |
+
line_number=99,
|
| 253 |
+
comment="hmm", # 3 chars
|
| 254 |
+
already_found=[],
|
| 255 |
+
)
|
| 256 |
+
assert found == []
|
| 257 |
+
assert score == 0.0
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
def test_borderline_length_comment(grader0):
|
| 261 |
+
"""A 16-char comment (just above threshold) with keyword + correct line earns credit."""
|
| 262 |
+
score, found, _ = grader0.score_comment(
|
| 263 |
+
line_number=4,
|
| 264 |
+
comment="off-by-one range!", # 17 chars, > 15
|
| 265 |
+
already_found=[],
|
| 266 |
+
)
|
| 267 |
+
assert "bootstrap_off_by_one" in found
|
| 268 |
+
assert score > 0.0
|
| 269 |
+
|
| 270 |
+
|
| 271 |
+
# ── 8. False-positive penalty value ──────────────────────────────────────────
|
| 272 |
+
|
| 273 |
+
def test_false_positive_penalty_magnitude(grader0):
|
| 274 |
+
"""Each wrong substantive comment must cost exactly -0.05."""
|
| 275 |
+
score, found, breakdown = grader0.score_comment(
|
| 276 |
+
line_number=99,
|
| 277 |
+
comment="This line has a performance issue with the loop structure.",
|
| 278 |
+
already_found=[],
|
| 279 |
+
)
|
| 280 |
+
assert found == []
|
| 281 |
+
assert breakdown["false_positive_penalty"] == pytest.approx(-0.05, abs=0.001)
|
| 282 |
+
|
| 283 |
+
|
| 284 |
+
def test_multiple_false_positives_accumulate(grader0):
|
| 285 |
+
"""Two wrong comments should each attract -0.05 independently."""
|
| 286 |
+
s1, _, bd1 = grader0.score_comment(
|
| 287 |
+
line_number=99,
|
| 288 |
+
comment="This line has a performance issue with the loop structure.",
|
| 289 |
+
already_found=[],
|
| 290 |
+
)
|
| 291 |
+
s2, _, bd2 = grader0.score_comment(
|
| 292 |
+
line_number=88,
|
| 293 |
+
comment="There is a design problem with this database call here.",
|
| 294 |
+
already_found=[],
|
| 295 |
+
)
|
| 296 |
+
assert bd1["false_positive_penalty"] == pytest.approx(-0.05, abs=0.001)
|
| 297 |
+
assert bd2["false_positive_penalty"] == pytest.approx(-0.05, abs=0.001)
|
| 298 |
+
# Combined penalty is -0.10 — within the -0.1 to -0.2 spec for 2 wrong claims
|
| 299 |
+
assert s1 + s2 == pytest.approx(-0.10, abs=0.001)
|
| 300 |
+
|
| 301 |
+
|
| 302 |
+
# ── 9. final_score — full coverage + correct decision ─────────────────────────
|
| 303 |
+
|
| 304 |
+
def test_final_score_full_coverage_correct_decision(grader1):
|
| 305 |
+
"""100% coverage + correct decision → max terminal reward ~0.37-0.40."""
|
| 306 |
+
all_ids = [iss["id"] for iss in TASKS[1]["issues"]]
|
| 307 |
+
reward = grader1.final_score(
|
| 308 |
+
issues_found=all_ids,
|
| 309 |
+
review_decision="request_changes",
|
| 310 |
+
step_count=5,
|
| 311 |
+
max_steps=15,
|
| 312 |
+
current_step=5,
|
| 313 |
+
)
|
| 314 |
+
assert reward.total >= 0.30
|
| 315 |
+
assert reward.passed is True
|
| 316 |
+
assert reward.terminal is True
|
| 317 |
+
assert reward.components["coverage_bonus"] == pytest.approx(0.20, abs=0.01)
|
| 318 |
+
assert reward.components["decision_score"] == pytest.approx(0.10, abs=0.001)
|
| 319 |
+
|
| 320 |
+
|
| 321 |
+
# ── 10. final_score — zero coverage + wrong decision ─────────────────────────
|
| 322 |
+
|
| 323 |
+
def test_final_score_zero_coverage_wrong_decision(grader1):
|
| 324 |
+
reward = grader1.final_score(
|
| 325 |
+
issues_found=[],
|
| 326 |
+
review_decision="approve", # wrong — should be request_changes
|
| 327 |
+
step_count=15,
|
| 328 |
+
max_steps=15,
|
| 329 |
+
current_step=15,
|
| 330 |
+
)
|
| 331 |
+
assert reward.total <= 0.0
|
| 332 |
+
assert reward.passed is False
|
| 333 |
+
assert reward.components["decision_score"] == pytest.approx(-0.10, abs=0.001)
|
| 334 |
+
assert reward.components["coverage_bonus"] == pytest.approx(0.0, abs=0.001)
|
| 335 |
+
|
| 336 |
+
|
| 337 |
+
# ── 11. final_score — partial coverage ───────────────────────────────────────
|
| 338 |
+
|
| 339 |
+
def test_final_score_partial_coverage(grader1):
|
| 340 |
+
"""Finding 1 out of 3 issues (weight 1.0 / 2.5 total) with correct decision."""
|
| 341 |
+
reward = grader1.final_score(
|
| 342 |
+
issues_found=["off_by_one"], # weight 1.0 out of 2.5 total
|
| 343 |
+
review_decision="request_changes",
|
| 344 |
+
step_count=10,
|
| 345 |
+
max_steps=15,
|
| 346 |
+
current_step=10,
|
| 347 |
+
)
|
| 348 |
+
# coverage = 1.0/2.5 = 0.40 → coverage_bonus = 0.08
|
| 349 |
+
# decision_score = +0.10
|
| 350 |
+
# efficiency_bonus = 0.0 (coverage < 0.60)
|
| 351 |
+
# total = 0.18
|
| 352 |
+
assert 0.10 <= reward.total <= 0.30
|
| 353 |
+
assert reward.passed is False # coverage < 60 %
|
| 354 |
+
|
| 355 |
+
|
| 356 |
+
# ── 12. Already-found deduplication ──────────────────────────────────────────
|
| 357 |
+
|
| 358 |
+
def test_already_found_not_double_credited(grader0):
|
| 359 |
+
"""An issue already in already_found must not be credited again."""
|
| 360 |
+
score, found, _ = grader0.score_comment(
|
| 361 |
+
line_number=4,
|
| 362 |
+
comment="off-by-one indexerror range(len + 1) causes crash on last item",
|
| 363 |
+
already_found=["bootstrap_off_by_one"], # pre-marked as found
|
| 364 |
+
)
|
| 365 |
+
assert "bootstrap_off_by_one" not in found
|
| 366 |
+
assert score <= 0.0 # false-positive penalty since nothing was matched
|
| 367 |
+
|
| 368 |
+
|
| 369 |
+
# ── 13. None / empty comment guard ───────────────────────────────────────────
|
| 370 |
+
|
| 371 |
+
def test_none_comment_returns_zero(grader0):
|
| 372 |
+
score, found, breakdown = grader0.score_comment(
|
| 373 |
+
line_number=4,
|
| 374 |
+
comment=None,
|
| 375 |
+
already_found=[],
|
| 376 |
+
)
|
| 377 |
+
assert score == 0.0
|
| 378 |
+
assert found == []
|
| 379 |
+
assert breakdown == {}
|
| 380 |
+
|
| 381 |
+
|
| 382 |
+
def test_empty_comment_returns_zero(grader0):
|
| 383 |
+
score, found, _ = grader0.score_comment(
|
| 384 |
+
line_number=4,
|
| 385 |
+
comment="",
|
| 386 |
+
already_found=[],
|
| 387 |
+
)
|
| 388 |
+
assert score == 0.0
|
| 389 |
+
assert found == []
|
| 390 |
+
|
| 391 |
+
|
| 392 |
+
# ── 14. Task weight totals are non-zero (guards __init__) ────────────────────
|
| 393 |
+
|
| 394 |
+
def test_all_task_total_weights_positive():
|
| 395 |
+
for task in TASKS:
|
| 396 |
+
grader = CodeReviewGrader(task)
|
| 397 |
+
assert grader.total_weight > 0.0, f"Task {task['id']} has zero total weight"
|
uv.lock
CHANGED
|
@@ -882,6 +882,7 @@ dependencies = [
|
|
| 882 |
{ name = "gradio-client" },
|
| 883 |
{ name = "typer" },
|
| 884 |
]
|
|
|
|
| 885 |
wheels = [
|
| 886 |
{ url = "https://files.pythonhosted.org/packages/30/2d/afff2ee87e75d8eb85c92bb8cf0e15b05c23c2ebd8fd8dec781d8601ed7f/hf_gradio-0.4.1-py3-none-any.whl", hash = "sha256:76b8cb8be6abe62d74c1ad2d35b42f0629db89aa9e1a8d033cecfe7c856eeab3", size = 4482, upload-time = "2026-04-17T19:53:31.827Z" },
|
| 887 |
]
|
|
@@ -1571,32 +1572,6 @@ wheels = [
|
|
| 1571 |
{ url = "https://files.pythonhosted.org/packages/12/cf/03675d8bd8ecbf4445504d8071adab19f5f993676795708e36402ab38263/openapi_pydantic-0.5.1-py3-none-any.whl", hash = "sha256:a3a09ef4586f5bd760a8df7f43028b60cafb6d9f61de2acba9574766255ab146", size = 96381, upload-time = "2025-01-08T19:29:25.275Z" },
|
| 1572 |
]
|
| 1573 |
|
| 1574 |
-
[[package]]
|
| 1575 |
-
name = "openenv-codereviewagent"
|
| 1576 |
-
version = "0.1.0"
|
| 1577 |
-
source = { editable = "." }
|
| 1578 |
-
dependencies = [
|
| 1579 |
-
{ name = "openai" },
|
| 1580 |
-
{ name = "openenv-core", extra = ["core"] },
|
| 1581 |
-
{ name = "python-dotenv" },
|
| 1582 |
-
]
|
| 1583 |
-
|
| 1584 |
-
[package.optional-dependencies]
|
| 1585 |
-
dev = [
|
| 1586 |
-
{ name = "pytest" },
|
| 1587 |
-
{ name = "pytest-cov" },
|
| 1588 |
-
]
|
| 1589 |
-
|
| 1590 |
-
[package.metadata]
|
| 1591 |
-
requires-dist = [
|
| 1592 |
-
{ name = "openai", specifier = ">=1.0.0" },
|
| 1593 |
-
{ name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
|
| 1594 |
-
{ name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
|
| 1595 |
-
{ name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.0.0" },
|
| 1596 |
-
{ name = "python-dotenv", specifier = ">=1.2.2" },
|
| 1597 |
-
]
|
| 1598 |
-
provides-extras = ["dev"]
|
| 1599 |
-
|
| 1600 |
[[package]]
|
| 1601 |
name = "openenv-core"
|
| 1602 |
version = "0.2.3"
|
|
@@ -1632,6 +1607,44 @@ core = [
|
|
| 1632 |
{ name = "websockets" },
|
| 1633 |
]
|
| 1634 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1635 |
[[package]]
|
| 1636 |
name = "opentelemetry-api"
|
| 1637 |
version = "1.41.0"
|
|
|
|
| 882 |
{ name = "gradio-client" },
|
| 883 |
{ name = "typer" },
|
| 884 |
]
|
| 885 |
+
sdist = { url = "https://files.pythonhosted.org/packages/ce/86/c9694b7cfada5780e75769e60dc161a161f4dd7fc91b61db5e3a3338bef9/hf_gradio-0.4.1.tar.gz", hash = "sha256:a017d942618f0d495a58ee4563047fa04bef614c00e0cb789a9a6d0633cffa7b", size = 6560, upload-time = "2026-04-22T14:01:32.334Z" }
|
| 886 |
wheels = [
|
| 887 |
{ url = "https://files.pythonhosted.org/packages/30/2d/afff2ee87e75d8eb85c92bb8cf0e15b05c23c2ebd8fd8dec781d8601ed7f/hf_gradio-0.4.1-py3-none-any.whl", hash = "sha256:76b8cb8be6abe62d74c1ad2d35b42f0629db89aa9e1a8d033cecfe7c856eeab3", size = 4482, upload-time = "2026-04-17T19:53:31.827Z" },
|
| 888 |
]
|
|
|
|
| 1572 |
{ url = "https://files.pythonhosted.org/packages/12/cf/03675d8bd8ecbf4445504d8071adab19f5f993676795708e36402ab38263/openapi_pydantic-0.5.1-py3-none-any.whl", hash = "sha256:a3a09ef4586f5bd760a8df7f43028b60cafb6d9f61de2acba9574766255ab146", size = 96381, upload-time = "2025-01-08T19:29:25.275Z" },
|
| 1573 |
]
|
| 1574 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1575 |
[[package]]
|
| 1576 |
name = "openenv-core"
|
| 1577 |
version = "0.2.3"
|
|
|
|
| 1607 |
{ name = "websockets" },
|
| 1608 |
]
|
| 1609 |
|
| 1610 |
+
[[package]]
|
| 1611 |
+
name = "openenv-probe"
|
| 1612 |
+
version = "0.1.0"
|
| 1613 |
+
source = { editable = "." }
|
| 1614 |
+
dependencies = [
|
| 1615 |
+
{ name = "openai" },
|
| 1616 |
+
{ name = "openenv-core", extra = ["core"] },
|
| 1617 |
+
{ name = "python-dotenv" },
|
| 1618 |
+
]
|
| 1619 |
+
|
| 1620 |
+
[package.optional-dependencies]
|
| 1621 |
+
dev = [
|
| 1622 |
+
{ name = "pytest" },
|
| 1623 |
+
{ name = "pytest-cov" },
|
| 1624 |
+
]
|
| 1625 |
+
|
| 1626 |
+
[package.dev-dependencies]
|
| 1627 |
+
dev = [
|
| 1628 |
+
{ name = "pytest" },
|
| 1629 |
+
{ name = "pytest-cov" },
|
| 1630 |
+
]
|
| 1631 |
+
|
| 1632 |
+
[package.metadata]
|
| 1633 |
+
requires-dist = [
|
| 1634 |
+
{ name = "openai", specifier = ">=1.0.0" },
|
| 1635 |
+
{ name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
|
| 1636 |
+
{ name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
|
| 1637 |
+
{ name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.0.0" },
|
| 1638 |
+
{ name = "python-dotenv", specifier = ">=1.2.2" },
|
| 1639 |
+
]
|
| 1640 |
+
provides-extras = ["dev"]
|
| 1641 |
+
|
| 1642 |
+
[package.metadata.requires-dev]
|
| 1643 |
+
dev = [
|
| 1644 |
+
{ name = "pytest", specifier = ">=9.0.3" },
|
| 1645 |
+
{ name = "pytest-cov", specifier = ">=7.1.0" },
|
| 1646 |
+
]
|
| 1647 |
+
|
| 1648 |
[[package]]
|
| 1649 |
name = "opentelemetry-api"
|
| 1650 |
version = "1.41.0"
|