Spaces:

mahithakur
/

PRobe

Runtime error

App Files Files Community

Thakur, Mahipal commited on Apr 23

Commit

62f5d41

0 Parent(s):

Initial modification

Browse files

Files changed (27) hide show

README.md +184 -0
__init__.py +16 -0
__pycache__/__init__.cpython-314.pyc +0 -0
__pycache__/client.cpython-314.pyc +0 -0
__pycache__/models.cpython-314.pyc +0 -0
client.py +77 -0
models.py +97 -0
openenv.yaml +99 -0
openenv_CodeReviewAgent.egg-info/PKG-INFO +11 -0
openenv_CodeReviewAgent.egg-info/SOURCES.txt +16 -0
openenv_CodeReviewAgent.egg-info/dependency_links.txt +1 -0
openenv_CodeReviewAgent.egg-info/entry_points.txt +2 -0
openenv_CodeReviewAgent.egg-info/requires.txt +7 -0
openenv_CodeReviewAgent.egg-info/top_level.txt +1 -0
pyproject.toml +40 -0
server/CodeReviewAgent_environment.py +327 -0
server/Dockerfile +80 -0
server/__init__.py +11 -0
server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc +0 -0
server/__pycache__/__init__.cpython-314.pyc +0 -0
server/__pycache__/grader.cpython-314.pyc +0 -0
server/__pycache__/tasks.cpython-314.pyc +0 -0
server/app.py +198 -0
server/grader.py +152 -0
server/requirements.txt +6 -0
server/tasks.py +719 -0
uv.lock +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,184 @@

+---
+title: CodeReviewAgent Environment
+emoji: 🔍
+colorFrom: blue
+colorTo: green
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+  - code-review
+  - rl-training
+  - grpo
+---
+# CodeReviewAgent — OpenEnv Environment
+> **OpenEnv Hackathon 2026 · Theme #3.1 — World Modeling (Professional Tasks)**
+An RL training environment where an LLM learns to perform structured **pull-request code reviews** on real Python source files.  The agent must identify bugs, security vulnerabilities, performance bottlenecks, and design issues — and submit a structured review with line-level comments.
+---
+## Problem Motivation
+LLMs can already *do* code review, but they do it inconsistently: they miss critical security bugs, produce noisy false positives, and fail to categorise issues by severity.
+This environment provides a **reward signal** that directly measures review quality, enabling GRPO-style RL to close that gap in a measurable, repeatable way.
+---
+## Environment Design
+### Tasks (5 total)
+| ID | Difficulty | File | Issues | Domain |
+|----|-----------|------|--------|--------|
+| 0  | Easy      | `utils.py` | 3 | Logic bugs, off-by-one, dead code |
+| 1  | Medium    | `auth.py` | 5 | SQL injection, MD5, eval(), hardcoded creds |
+| 2  | Hard      | `data_pipeline.py` | 7 | N+1, SSL bypass, thread leak, OOM cache |
+| 3  | Medium    | `async_worker.py` | 5 | Race condition, missing await, resource leak |
+| 4  | Hard      | `api_server.py` | 6 | Command injection, path traversal, pickle RCE |
+Tasks cycle automatically on each `reset()` call.
+### Observation
+```python
+{
+  "code_snippet":     str,   # Python source to review
+  "task_description": str,   # What to look for
+  "file_name":        str,
+  "task_id":          int,   # 0–4
+  "task_difficulty":  str,   # easy / medium / hard
+  "review_history":   list,  # actions taken so far this episode
+  "step_count":       int,
+  "max_steps":        int,
+  "issues_found_count": int,
+  "total_issues":     int,
+  "done":             bool,
+  "reward":           float,
+}
+```
+### Actions
+| action_type | Required fields | Effect |
+|-------------|----------------|--------|
+| `add_comment` | `line_number`, `comment`, `severity`, `category` | Annotate a line; partial reward if it matches a ground-truth issue |
+| `request_changes` | `comment` | Signal PR needs work |
+| `approve` | — | Approve PR (penalised if issues remain) |
+| `submit_review` | — | Finalise review; terminal reward |
+### Reward Function
+```
+Per-step (ADD_COMMENT):
+  + weight/total_weight × 0.60    per newly found issue (max 0.60 cumulative)
+  − 0.02                          per false-positive (substantive comment, no match)
+Terminal (SUBMIT_REVIEW):
+  + coverage × 0.20               weighted issue coverage bonus (max 0.20)
+  + 0.10 / −0.10                  correct / incorrect final decision
+  + efficiency × 0.10             step-efficiency bonus when coverage ≥ 60%
+Maximum achievable: ~1.0
+```
+Grading uses **keyword + line-range matching** (±3 lines tolerance) against hand-labelled ground-truth issues — no LLM judge needed, fully deterministic.
+---
+## Training
+### GRPO (single-turn format)
+For efficient LLM training the environment is also exposed in a **single-turn format**: the model receives the full code and must output a **JSON array** of all issues in one response. The same keyword-matching reward function scores the output.
+```python
+# Input prompt
+{"role": "system", "content": "You are an expert code reviewer. Output a JSON array of issues..."}
+{"role": "user",   "content": "File: auth.py\n```python\n...\n```\nProvide your review:"}
+# Expected output
+[{"line": 5, "category": "security", "severity": "critical",
+  "comment": "Hardcoded DB_PASSWORD should be loaded from environment variable"},
+ ...]
+```
+### Files
+| File | Purpose |
+|------|---------|
+| `train_grpo.py` | Standalone GRPO training script (TRL, full-precision or LoRA) |
+| `train_grpo_colab.ipynb` | Colab notebook — T4 GPU, Unsloth 4-bit, plots included |
+| `baseline.py` | GPT-4o-mini baseline for comparison |
+### Quick Start
+```bash
+# Run baseline
+export OPENAI_API_KEY=sk-...
+python baseline.py
+# Run reward smoke test (no GPU needed)
+python train_grpo.py --test
+# Train (requires GPU + trl>=0.12)
+pip install trl datasets accelerate unsloth
+python train_grpo.py
+```
+### Colab Training
+Open `train_grpo_colab.ipynb` in Google Colab (T4 runtime).
+All install, training, evaluation, and plotting cells are included.
+---
+## Results
+*(Fill in after training run)*
+| Model | Avg Reward | Task-0 | Task-1 | Task-2 | Task-3 | Task-4 |
+|-------|-----------|--------|--------|--------|--------|--------|
+| GPT-4o-mini (baseline) | — | — | — | — | — | — |
+| Qwen2.5-1.5B (untrained) | — | — | — | — | — | — |
+| Qwen2.5-1.5B (GRPO 3 epochs) | — | — | — | — | — | — |
+Training curves: `training_curves.png` · Per-task rewards: `per_task_reward.png`
+---
+## Project Structure
+```
+CodeReviewAgent/
+├── openenv.yaml                    # OpenEnv manifest
+├── pyproject.toml
+├── models.py                       # Action + Observation types
+├── client.py                       # OpenEnv client
+└── server/
+    ├── app.py                      # FastAPI server
+    ├── CodeReviewAgent_environment.py
+    ├── grader.py                   # Deterministic reward grader
+    ├── tasks.py                    # 5 ground-truth tasks
+    └── Dockerfile
+train_grpo.py                       # GRPO training script
+train_grpo_colab.ipynb              # Colab notebook
+baseline.py                         # GPT-4o-mini baseline
+```
+---
+## API
+The environment server exposes standard OpenEnv HTTP + WebSocket endpoints:
+- `POST /reset` — start a new episode
+- `POST /step` — execute an action
+- `GET  /state` — current episode state
+- `WS   /ws` — persistent low-latency session
+- `GET  /web` — interactive web UI
+- `GET  /docs` — Swagger / OpenAPI docs

__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Codereviewagent Environment."""
+from .client import CodereviewagentEnv
+from .models import CodereviewagentAction, CodereviewagentObservation
+__all__ = [
+    "CodereviewagentAction",
+    "CodereviewagentObservation",
+    "CodereviewagentEnv",
+]

__pycache__/__init__.cpython-314.pyc ADDED Viewed

Binary file (382 Bytes). View file

__pycache__/client.cpython-314.pyc ADDED Viewed

Binary file (4.37 kB). View file

__pycache__/models.cpython-314.pyc ADDED Viewed

Binary file (6.52 kB). View file

client.py ADDED Viewed

	@@ -0,0 +1,77 @@

+"""CodeReviewAgent Environment Client."""
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+from .models import CodereviewagentAction, CodereviewagentObservation
+class CodereviewagentEnv(
+    EnvClient[CodereviewagentAction, CodereviewagentObservation, State]
+):
+    """
+    Client for the CodeReviewAgent environment.
+    Maintains a persistent WebSocket connection to the server.
+    Example:
+        >>> with CodereviewagentEnv(base_url="http://localhost:8000") as env:
+        ...     result = env.reset()
+        ...     print(result.observation.task_description)
+        ...
+        ...     action = CodereviewagentAction(
+        ...         action_type="add_comment",
+        ...         line_number=4,
+        ...         comment="Off-by-one: range(len+1) causes IndexError",
+        ...         severity="error",
+        ...         category="bug",
+        ...     )
+        ...     result = env.step(action)
+        ...     print(result.reward)
+    """
+    def _step_payload(self, action: CodereviewagentAction) -> Dict:
+        payload = {"action_type": action.action_type.value}
+        if action.line_number is not None:
+            payload["line_number"] = action.line_number
+        if action.comment is not None:
+            payload["comment"] = action.comment
+        if action.severity is not None:
+            payload["severity"] = action.severity.value
+        if action.category is not None:
+            payload["category"] = action.category.value
+        return payload
+    def _parse_result(
+        self, payload: Dict
+    ) -> StepResult[CodereviewagentObservation]:
+        obs_data = payload.get("observation", {})
+        observation = CodereviewagentObservation(
+            code_snippet=obs_data.get("code_snippet", ""),
+            task_description=obs_data.get("task_description", ""),
+            file_name=obs_data.get("file_name", ""),
+            task_id=obs_data.get("task_id", 0),
+            task_difficulty=obs_data.get("task_difficulty", "easy"),
+            review_history=obs_data.get("review_history", []),
+            step_count=obs_data.get("step_count", 0),
+            max_steps=obs_data.get("max_steps", 20),
+            issues_found_count=obs_data.get("issues_found_count", 0),
+            total_issues=obs_data.get("total_issues", 0),
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )

models.py ADDED Viewed

	@@ -0,0 +1,97 @@

+"""
+Data models for the CodeReviewAgent Environment.
+An agent reviews Python source files, identifies bugs, security issues,
+and design problems, then submits a structured review.
+"""
+from enum import Enum
+from typing import Any
+from openenv.core.env_server.types import Action, Observation
+from pydantic import BaseModel, ConfigDict, Field
+class ActionType(str, Enum):
+    ADD_COMMENT = "add_comment"
+    REQUEST_CHANGES = "request_changes"
+    APPROVE = "approve"
+    SUBMIT_REVIEW = "submit_review"
+class Severity(str, Enum):
+    INFO = "info"
+    WARNING = "warning"
+    ERROR = "error"
+    CRITICAL = "critical"
+class IssueCategory(str, Enum):
+    BUG = "bug"
+    SECURITY = "security"
+    PERFORMANCE = "performance"
+    STYLE = "style"
+    DESIGN = "design"
+class RewardType(BaseModel):
+    """
+    Structured reward returned by step().
+    total       : final clamped score in [-1.0, 1.0]
+    components  : named sub-scores before clamping (may sum outside [-1, 1])
+    passed      : True when the action was a clear positive signal
+    explanation : human-readable breakdown for logging / debugging
+    step        : environment step this reward was issued at
+    terminal    : True only on the SUBMIT_REVIEW step
+    """
+    model_config = ConfigDict(frozen=True)
+    total: float = Field(..., ge=-1.0, le=1.0)
+    components: dict[str, float] = Field(default_factory=dict)
+    passed: bool = Field(False)
+    explanation: str = Field("")
+    step: int = Field(0)
+    terminal: bool = Field(False)
+class CodereviewagentAction(Action):
+    """
+    - ADD_COMMENT    : annotate a specific line with a review comment
+    - REQUEST_CHANGES: mark the PR as needing changes
+    - APPROVE        : approve the PR (only when no significant issues remain)
+    - SUBMIT_REVIEW  : finalize and submit the review (ends the episode)
+    """
+    action_type: ActionType = Field(..., description="Type of review action")
+    line_number: int | None = Field(None, description="Source line being commented on")
+    comment: str | None = Field(None, description="Review comment text")
+    severity: Severity | None = Field(None, description="Issue severity level")
+    category: IssueCategory | None = Field(None, description="Issue category")
+class CodereviewagentObservation(Observation):
+    """
+    Contains the code to review, task instructions, and the running
+    review history so the agent can track what it has already flagged.
+    The `reward` field mirrors the most recent step reward for convenience;
+    the authoritative reward is the RewardType returned by step().
+    """
+    code_snippet: str = Field(default="", description="Python source code to review")
+    task_description: str = Field(default="", description="Review instructions and goals")
+    file_name: str = Field(default="", description="Name of the file being reviewed")
+    task_id: int = Field(default=0, description="Current task index")
+    task_difficulty: str = Field(default="ultra-easy", description="Task difficulty label")
+    review_history: list[dict[str, Any]] = Field(
+        default_factory=list,
+        description="Ordered list of actions taken so far this episode",
+    )
+    step_count: int = Field(default=0, description="Steps taken in current episode")
+    max_steps: int = Field(default=6, description="Step budget for this task")
+    issues_found_count: int = Field(default=0, description="Number of issues identified so far")
+    total_issues: int = Field(default=0, description="Total issues in this task")
+    done: bool = Field(default=False, description="Whether the episode has ended")
+    reward: float = Field(default=0.0, description="Most recent step reward (mirror of RewardType.total)")
+    metadata: dict[str, Any] = Field(default_factory=dict, description="Extra episode metadata")

openenv.yaml ADDED Viewed

	@@ -0,0 +1,99 @@

+spec_version: 1
+name: CodeReviewAgent
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+description: >
+  Code review environment where an agent reviews Python source files,
+  identifies bugs, security vulnerabilities, performance bottlenecks,
+  and design issues, then submits a structured review with comments
+  and a final decision (request_changes or approve).
+tasks:
+  - id: 0
+    name: Basic Bug Detection
+    difficulty: easy
+    description: Identify logical bugs in a simple Python utility module
+    max_steps: 15
+    issues: 3
+  - id: 1
+    name: Security Vulnerability Review
+    difficulty: medium
+    description: Find security vulnerabilities in an authentication module
+    max_steps: 20
+    issues: 5
+  - id: 2
+    name: Full Architecture and Performance Review
+    difficulty: hard
+    description: >
+      Comprehensive review of a data pipeline for bugs, security,
+      performance, and design issues
+    max_steps: 30
+    issues: 7
+  - id: 3
+    name: Async Worker Review
+    difficulty: medium
+    description: Find concurrency bugs and resource leaks in an async worker
+    max_steps: 20
+    issues: 5
+  - id: 4
+    name: Flask API Security Review
+    difficulty: hard
+    description: >
+      Comprehensive security review of a Flask REST API for injection flaws,
+      path traversal, insecure deserialization, and missing access controls
+    max_steps: 30
+    issues: 6
+observation:
+  type: object
+  fields:
+    code_snippet: {type: string, description: "Python source to review"}
+    task_description: {type: string, description: "Review instructions"}
+    file_name: {type: string}
+    task_id: {type: integer, range: [0, 4]}
+    task_difficulty: {type: string, values: [easy, medium, hard]}
+    review_history: {type: array, description: "Actions taken so far"}
+    step_count: {type: integer}
+    max_steps: {type: integer}
+    issues_found_count: {type: integer}
+    total_issues: {type: integer}
+    done: {type: boolean}
+    reward: {type: number}
+action:
+  type: object
+  fields:
+    action_type:
+      type: enum
+      values: [add_comment, request_changes, approve, submit_review]
+    line_number: {type: integer, required: false}
+    comment: {type: string, required: false}
+    severity:
+      type: enum
+      values: [info, warning, error, critical]
+      required: false
+    category:
+      type: enum
+      values: [bug, security, performance, style, design]
+      required: false
+reward_design:
+  range: [-1.0, 1.0]
+  per_step:
+    issue_found: "up to 0.60 total (weight/total_weight × 0.60 per issue)"
+    false_positive: -0.02
+    correct_request_changes: +0.05
+    bad_approval: -0.15
+  terminal:
+    coverage_bonus: "coverage × 0.20  (max +0.20)"
+    decision_correct: +0.10
+    decision_incorrect: -0.10
+    efficiency_bonus: "up to +0.10 when coverage ≥ 60%"
+  max_achievable: ~1.0

openenv_CodeReviewAgent.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,11 @@

+Metadata-Version: 2.4
+Name: openenv-CodeReviewAgent
+Version: 0.1.0
+Summary: Codereviewagent environment for OpenEnv
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.2
+Requires-Dist: openai>=1.0.0
+Requires-Dist: python-dotenv>=1.2.2
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_CodeReviewAgent.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+README.md
+pyproject.toml
+./__init__.py
+./client.py
+./models.py
+openenv_CodeReviewAgent.egg-info/PKG-INFO
+openenv_CodeReviewAgent.egg-info/SOURCES.txt
+openenv_CodeReviewAgent.egg-info/dependency_links.txt
+openenv_CodeReviewAgent.egg-info/entry_points.txt
+openenv_CodeReviewAgent.egg-info/requires.txt
+openenv_CodeReviewAgent.egg-info/top_level.txt
+server/CodeReviewAgent_environment.py
+server/__init__.py
+server/app.py
+server/grader.py
+server/tasks.py

openenv_CodeReviewAgent.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_CodeReviewAgent.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = CodeReviewAgent.server.app:main

openenv_CodeReviewAgent.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+openenv-core[core]>=0.2.2
+openai>=1.0.0
+python-dotenv>=1.2.2
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_CodeReviewAgent.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ CodeReviewAgent

pyproject.toml ADDED Viewed

	@@ -0,0 +1,40 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-CodeReviewAgent"
+version = "0.1.0"
+description = "Codereviewagent environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.2",
+    # Environment-specific dependencies
+    "openai>=1.0.0",
+    "python-dotenv>=1.2.2",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m CodeReviewAgent.server.app
+server = "CodeReviewAgent.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["CodeReviewAgent", "CodeReviewAgent.server"]
+package-dir = { "CodeReviewAgent" = ".", "CodeReviewAgent.server" = "server" }

server/CodeReviewAgent_environment.py ADDED Viewed

	@@ -0,0 +1,327 @@

+"""
+CodeReviewAgent Environment — async-native implementation.
+Episode lifecycle:
+  1. reset()  → ObservationType              (starts a new episode)
+  2. step(a)  → (Obs, RewardType, done, info) (execute one action)
+  3. state()  → dict                          (full internal snapshot)
+Tasks cycle automatically: 0 (ultra-easy) → 1 (easy) → … → 5 (hard flask) → 0 …
+Thread / task safety: each Environment instance owns its own state.
+For concurrent GRPO rollouts spin up one instance per worker.
+"""
+from __future__ import annotations
+import asyncio
+from typing import Any
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+try:
+    from ..models import (
+        ActionType,
+        CodereviewagentAction,
+        CodereviewagentObservation,
+        RewardType,
+    )
+    from .grader import CodeReviewGrader
+    from .tasks import TASKS
+except ImportError:
+    from models import (  # type: ignore[no-redef]
+        ActionType,
+        CodereviewagentAction,
+        CodereviewagentObservation,
+        RewardType,
+    )
+    from server.grader import CodeReviewGrader  # type: ignore[no-redef]
+    from server.tasks import TASKS  # type: ignore[no-redef]
+# Sentinel reward returned on non-terminal steps that produce no signal
+_ZERO_REWARD = RewardType(total=0.0, components={}, passed=False,
+                           explanation="No signal this step.", step=0, terminal=False)
+class CodereviewagentEnvironment(Environment):
+    """
+    OpenEnv-compliant code-review environment.
+    Public interface is fully async.  The sync wrappers (reset / step / state)
+    required by openenv's create_app are also provided; they delegate to the
+    async versions via asyncio.run() so they are safe to call from sync
+    contexts (e.g. tests without an event loop, openenv HTTP wrappers).
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    # ── Construction ──────────────────────────────────────────────────────
+    def __init__(self) -> None:
+        self._episode_id: str = str(uuid4())
+        self._step_count: int = 0
+        self._reset_count: int = 0
+        task = TASKS[0]
+        self._grader: CodeReviewGrader = CodeReviewGrader(task)
+        self._ep: dict[str, Any] = self._fresh_episode(task)
+    @staticmethod
+    def _fresh_episode(task: dict[str, Any]) -> dict[str, Any]:
+        return {
+            "task": task,
+            "review_comments": [],
+            "issues_found": [],
+            "review_decision": None,
+            "review_submitted": False,
+            "cumulative_reward": 0.0,
+        }
+    # ── Async-native interface (primary) ──────────────────────────────────
+    async def async_reset(self) -> CodereviewagentObservation:
+        task_id = self._reset_count % len(TASKS)
+        self._reset_count += 1
+        self._episode_id = str(uuid4())
+        self._step_count = 0
+        task = TASKS[task_id]
+        self._grader = CodeReviewGrader(task)
+        self._ep = self._fresh_episode(task)
+        return self._make_obs(reward=0.0, done=False)
+    async def async_step(
+        self, action: CodereviewagentAction
+    ) -> tuple[CodereviewagentObservation, RewardType, bool, dict[str, Any]]:
+        self._step_count += 1
+        task = self._ep["task"]
+        done = False
+        reward_obj: RewardType
+        if action.action_type == ActionType.ADD_COMMENT:
+            reward_obj = self._handle_add_comment(action)
+        elif action.action_type == ActionType.REQUEST_CHANGES:
+            reward_obj = self._handle_request_changes(action)
+        elif action.action_type == ActionType.APPROVE:
+            reward_obj = self._handle_approve()
+        elif action.action_type == ActionType.SUBMIT_REVIEW:
+            reward_obj, done = self._handle_submit_review()
+        else:
+            reward_obj = RewardType(
+                total=-0.05,
+                components={"illegal_action": -0.05},
+                passed=False,
+                explanation=f"Unknown action type: {action.action_type}",
+                step=self._step_count,
+                terminal=False,
+            )
+        # Step-budget exhaustion
+        if not done and self._step_count >= task["max_steps"]:
+            # merge budget penalty into existing reward
+            penalised = max(-1.0, reward_obj.total - 0.05)
+            components = {**reward_obj.components, "step_budget_penalty": -0.05}
+            reward_obj = RewardType(
+                total=round(penalised, 4),
+                components=components,
+                passed=False,
+                explanation=reward_obj.explanation + " [Step limit reached.]",
+                step=self._step_count,
+                terminal=True,
+            )
+            done = True
+        self._ep["cumulative_reward"] = round(
+            self._ep["cumulative_reward"] + reward_obj.total, 4
+        )
+        obs = self._make_obs(reward=reward_obj.total, done=done)
+        info = {
+            "episode_id": self._episode_id,
+            "cumulative_reward": self._ep["cumulative_reward"],
+            "issues_found": list(self._ep["issues_found"]),
+            "review_decision": self._ep.get("review_decision"),
+        }
+        return obs, reward_obj, done, info
+    async def async_state(self) -> dict[str, Any]:
+        task = self._ep["task"]
+        return {
+            "episode_id": self._episode_id,
+            "step_count": self._step_count,
+            "task_id": task["id"],
+            "task_difficulty": task["difficulty"],
+            "task_name": task["name"],
+            "issues_found": list(self._ep["issues_found"]),
+            "total_issues": len(task["issues"]),
+            "review_decision": self._ep.get("review_decision"),
+            "review_submitted": self._ep.get("review_submitted", False),
+            "cumulative_reward": self._ep.get("cumulative_reward", 0.0),
+            "max_steps": task["max_steps"],
+        }
+    # ── Sync wrappers (openenv / create_app compatibility) ────────────────
+    def reset(self) -> CodereviewagentObservation:  # type: ignore[override]
+        try:
+            loop = asyncio.get_running_loop()
+        except RuntimeError:
+            return asyncio.run(self.async_reset())
+        # Called from inside a running loop (e.g. pytest-asyncio) — run directly
+        import concurrent.futures
+        with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+            fut = pool.submit(asyncio.run, self.async_reset())
+            return fut.result()
+    def step(self, action: CodereviewagentAction) -> CodereviewagentObservation:  # type: ignore[override]
+        """
+        Sync step for openenv compatibility.
+        Returns only the Observation (reward is embedded in obs.reward).
+        Use async_step() for the full (obs, reward, done, info) tuple.
+        """
+        try:
+            loop = asyncio.get_running_loop()
+        except RuntimeError:
+            obs, _, _, _ = asyncio.run(self.async_step(action))
+            return obs
+        import concurrent.futures
+        with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+            fut = pool.submit(asyncio.run, self.async_step(action))
+            obs, _, _, _ = fut.result()
+            return obs
+    @property
+    def state(self) -> State:  # type: ignore[override]
+        return State(episode_id=self._episode_id, step_count=self._step_count)
+    # ── Action handlers ───────────────────────────────────────────────────
+    def _handle_add_comment(self, action: CodereviewagentAction) -> RewardType:
+        entry = {
+            "type": "comment",
+            "line": action.line_number,
+            "text": action.comment,
+            "severity": action.severity.value if action.severity else None,
+            "category": action.category.value if action.category else None,
+        }
+        self._ep["review_comments"].append(entry)
+        score, new_finds, breakdown = self._grader.score_comment(
+            line_number=action.line_number,
+            comment=action.comment,
+            already_found=self._ep["issues_found"],
+        )
+        self._ep["issues_found"].extend(new_finds)
+        clamped = round(max(-1.0, min(1.0, score)), 4)
+        if new_finds:
+            explanation = f"Identified issue(s): {new_finds}"
+        elif score < 0:
+            explanation = "False-positive comment — matched no known issue."
+        else:
+            explanation = "Comment recorded; no new issue matched."
+        return RewardType(
+            total=clamped,
+            components=breakdown,
+            passed=bool(new_finds),
+            explanation=explanation,
+            step=self._step_count,
+            terminal=False,
+        )
+    def _handle_request_changes(self, action: CodereviewagentAction) -> RewardType:
+        self._ep["review_decision"] = "request_changes"
+        self._ep["review_comments"].append(
+            {"type": "request_changes", "text": action.comment}
+        )
+        if self._ep["issues_found"]:
+            return RewardType(
+                total=0.05,
+                components={"decision_bonus": 0.05},
+                passed=True,
+                explanation="REQUEST_CHANGES after finding issues — correct.",
+                step=self._step_count,
+                terminal=False,
+            )
+        return RewardType(
+            total=-0.05,
+            components={"premature_decision_penalty": -0.05},
+            passed=False,
+            explanation="REQUEST_CHANGES with no issues found yet.",
+            step=self._step_count,
+            terminal=False,
+        )
+    def _handle_approve(self) -> RewardType:
+        self._ep["review_decision"] = "approve"
+        total_issues = len(self._ep["task"]["issues"])
+        found = len(set(self._ep["issues_found"]))
+        if total_issues > 0 and found < total_issues * 0.5:
+            return RewardType(
+                total=-0.15,
+                components={"bad_approval_penalty": -0.15},
+                passed=False,
+                explanation=f"APPROVE with only {found}/{total_issues} issues found.",
+                step=self._step_count,
+                terminal=False,
+            )
+        return RewardType(
+            total=0.02,
+            components={"approval_credit": 0.02},
+            passed=True,
+            explanation="APPROVE recorded.",
+            step=self._step_count,
+            terminal=False,
+        )
+    def _handle_submit_review(self) -> tuple[RewardType, bool]:
+        if self._ep.get("review_submitted"):
+            return (
+                RewardType(
+                    total=-0.05,
+                    components={"duplicate_submit_penalty": -0.05},
+                    passed=False,
+                    explanation="Review already submitted.",
+                    step=self._step_count,
+                    terminal=False,
+                ),
+                False,
+            )
+        self._ep["review_submitted"] = True
+        task = self._ep["task"]
+        reward_obj = self._grader.final_score(
+            issues_found=list(set(self._ep["issues_found"])),
+            review_decision=self._ep.get("review_decision"),
+            step_count=self._step_count,
+            max_steps=task["max_steps"],
+            current_step=self._step_count,
+        )
+        return reward_obj, True
+    # ── Observation builder ───────────────────────────────────────────────
+    def _make_obs(self, reward: float, done: bool) -> CodereviewagentObservation:
+        task = self._ep["task"]
+        return CodereviewagentObservation(
+            code_snippet=task["code"],
+            task_description=task["description"],
+            file_name=task["file_name"],
+            task_id=task["id"],
+            task_difficulty=task["difficulty"],
+            review_history=list(self._ep.get("review_comments", [])),
+            step_count=self._step_count,
+            max_steps=task["max_steps"],
+            issues_found_count=len(set(self._ep.get("issues_found", []))),
+            total_issues=len(task["issues"]),
+            done=done,
+            reward=round(max(-1.0, min(1.0, reward)), 4),
+            metadata={
+                "cumulative_reward": self._ep.get("cumulative_reward", 0.0),
+                "review_decision": self._ep.get("review_decision"),
+                "episode_id": self._episode_id,
+            },
+        )

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,80 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=CodeReviewAgent
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Codereviewagent environment server components."""
+from .CodeReviewAgent_environment import CodereviewagentEnvironment
+__all__ = ["CodereviewagentEnvironment"]

server/__pycache__/CodeReviewAgent_environment.cpython-314.pyc ADDED Viewed

Binary file (16.7 kB). View file

server/__pycache__/__init__.cpython-314.pyc ADDED Viewed

Binary file (333 Bytes). View file

server/__pycache__/grader.cpython-314.pyc ADDED Viewed

Binary file (8.04 kB). View file

server/__pycache__/tasks.cpython-314.pyc ADDED Viewed

Binary file (18.7 kB). View file

server/app.py ADDED Viewed

	@@ -0,0 +1,198 @@

+"""
+Async FastAPI server for the CodeReviewAgent environment.
+Endpoints:
+  POST /reset              — start a new episode (HTTP session)
+  POST /step               — execute one action
+  GET  /state              — current episode snapshot
+  GET  /health             — liveness probe
+  GET  /schema             — action / observation schema
+  WS   /ws                 — WebSocket session (own env per connection)
+HTTP endpoints share a single env instance (sequential use).
+WebSocket endpoints each spin up an isolated env instance, enabling
+concurrent GRPO rollouts.
+OpenEnv web interface is mounted at /web via create_app if available;
+falls back to a minimal HTML redirect page.
+"""
+from __future__ import annotations
+import json
+from contextlib import asynccontextmanager
+from typing import Any
+from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
+from fastapi.responses import HTMLResponse
+try:
+    from openenv.core.env_server.http_server import create_app as _create_openenv_app
+    _OPENENV_AVAILABLE = True
+except Exception:  # pragma: no cover
+    _OPENENV_AVAILABLE = False
+try:
+    from ..models import CodereviewagentAction, CodereviewagentObservation, RewardType
+    from .CodeReviewAgent_environment import CodereviewagentEnvironment
+except ModuleNotFoundError:
+    from models import CodereviewagentAction, CodereviewagentObservation, RewardType  # type: ignore
+    from server.CodeReviewAgent_environment import CodereviewagentEnvironment  # type: ignore
+# ── Shared HTTP session env ───────────────────────────────────────────────────
+_http_env: CodereviewagentEnvironment | None = None
+@asynccontextmanager
+async def lifespan(application: FastAPI):
+    global _http_env
+    _http_env = CodereviewagentEnvironment()
+    yield
+    _http_env = None
+# ── Response shapes ───────────────────────────────────────────────────────────
+class StepResponse:
+    def __init__(
+        self,
+        obs: CodereviewagentObservation,
+        reward: RewardType,
+        done: bool,
+        info: dict[str, Any],
+    ) -> None:
+        self.obs = obs
+        self.reward = reward
+        self.done = done
+        self.info = info
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "observation": self.obs.model_dump(),
+            "reward": self.reward.model_dump(),
+            "done": self.done,
+            "info": self.info,
+        }
+# ── App factory ───────────────────────────────────────────────────────────────
+def _build_app() -> FastAPI:
+    application = FastAPI(
+        title="CodeReviewAgent",
+        description="OpenEnv code-review environment — async FastAPI server.",
+        version="2.0.0",
+        lifespan=lifespan,
+    )
+    # ── HTTP endpoints ────────────────────────────────────────────────────
+    @application.post("/reset", summary="Start a new episode")
+    async def reset_endpoint() -> dict[str, Any]:
+        assert _http_env is not None
+        obs = await _http_env.async_reset()
+        return {"observation": obs.model_dump(), "reward": None, "done": False, "info": {}}
+    @application.post("/step", summary="Execute one action")
+    async def step_endpoint(action: CodereviewagentAction) -> dict[str, Any]:
+        assert _http_env is not None
+        obs, reward, done, info = await _http_env.async_step(action)
+        return StepResponse(obs, reward, done, info).to_dict()
+    @application.get("/state", summary="Current episode state snapshot")
+    async def state_endpoint() -> dict[str, Any]:
+        assert _http_env is not None
+        return await _http_env.async_state()
+    @application.get("/health", summary="Liveness probe")
+    async def health() -> dict[str, str]:
+        return {"status": "ok"}
+    @application.get("/schema", summary="Action and observation JSON schemas")
+    async def schema() -> dict[str, Any]:
+        return {
+            "action": CodereviewagentAction.model_json_schema(),
+            "observation": CodereviewagentObservation.model_json_schema(),
+            "reward": RewardType.model_json_schema(),
+        }
+    # ── WebSocket endpoint (one env per connection) ───────────────────────
+    @application.websocket("/ws")
+    async def ws_endpoint(websocket: WebSocket) -> None:
+        await websocket.accept()
+        env = CodereviewagentEnvironment()
+        try:
+            while True:
+                raw = await websocket.receive_text()
+                msg = json.loads(raw)
+                cmd = msg.get("command")
+                if cmd == "reset":
+                    obs = await env.async_reset()
+                    await websocket.send_json(
+                        {"type": "reset", "observation": obs.model_dump()}
+                    )
+                elif cmd == "step":
+                    try:
+                        action = CodereviewagentAction(**msg["action"])
+                    except Exception as exc:
+                        await websocket.send_json({"type": "error", "detail": str(exc)})
+                        continue
+                    obs, reward, done, info = await env.async_step(action)
+                    await websocket.send_json(
+                        {
+                            "type": "step",
+                            "observation": obs.model_dump(),
+                            "reward": reward.model_dump(),
+                            "done": done,
+                            "info": info,
+                        }
+                    )
+                elif cmd == "state":
+                    state = await env.async_state()
+                    await websocket.send_json({"type": "state", "state": state})
+                else:
+                    await websocket.send_json(
+                        {"type": "error", "detail": f"Unknown command: {cmd}"}
+                    )
+        except WebSocketDisconnect:
+            pass
+    # ── Web UI ────────────────────────────────────────────────────────────
+    @application.get("/web", response_class=HTMLResponse, include_in_schema=False)
+    async def web_ui() -> str:
+        return """
+        <!doctype html><html><head><title>CodeReviewAgent</title></head>
+        <body>
+        <h2>CodeReviewAgent Environment</h2>
+        <p>API docs: <a href="/docs">/docs</a></p>
+        <p>Health: <a href="/health">/health</a></p>
+        <p>Schema: <a href="/schema">/schema</a></p>
+        </body></html>
+        """
+    return application
+app = _build_app()
+def main(host: str = "0.0.0.0", port: int = 8000) -> None:
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port)

server/grader.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""
+Deterministic grader for CodeReviewAgent tasks.
+Scoring design
+--------------
+During the episode (ADD_COMMENT actions):
+  +weight/total_weight * 0.60   per newly found issue (max 0.60 cumulative)
+  -0.02                         per false-positive (substantive comment, no match)
+Final (SUBMIT_REVIEW):
+  +coverage * 0.20              weighted coverage bonus   (max  0.20)
+  +/-0.10                       correct / incorrect final decision
+  +efficiency * 0.10            step-efficiency bonus when coverage >= 60%
+Maximum achievable total: ~1.0   Minimum: −1.0
+Anti-exploit rule (enforced since v2):
+  A comment MUST satisfy BOTH:
+    1. keyword_hit  — at least one issue keyword appears in the comment text
+    2. line_hit     — comment line_number is within ±LINE_TOLERANCE of the issue
+  `category` match is NOT sufficient on its own.  This closes the keyword-spam
+  exploit where a model dumps all known keywords on a single line.
+"""
+from typing import Any
+try:
+    from ..models import RewardType
+except ImportError:
+    from models import RewardType  # type: ignore[no-redef]
+LINE_TOLERANCE: int = 3  # lines either side of an issue's declared range
+class CodeReviewGrader:
+    def __init__(self, task: dict[str, Any]) -> None:
+        self.task = task
+        self.total_weight: float = sum(iss["weight"] for iss in task["issues"])
+    # ── Per-comment scoring ───────────────────────────────────────────────
+    def score_comment(
+        self,
+        line_number: int | None,
+        comment: str | None,
+        already_found: list[str],
+    ) -> tuple[float, list[str], dict[str, float]]:
+        """
+        Score an ADD_COMMENT action.
+        Returns:
+            (reward_delta, newly_found_issue_ids, component_breakdown)
+        Match condition (BOTH required — no shortcut):
+            keyword_hit  AND  line_hit
+        """
+        if not comment:
+            return 0.0, [], {}
+        comment_lower = comment.lower()
+        newly_found: list[str] = []
+        issue_credit: float = 0.0
+        false_positive_penalty: float = 0.0
+        for issue in self.task["issues"]:
+            if issue["id"] in already_found:
+                continue
+            keyword_hit = any(kw.lower() in comment_lower for kw in issue["keywords"])
+            line_hit = self._line_in_range(line_number, issue["line_range"])
+            # BOTH conditions required — no cat_hit shortcut
+            if keyword_hit and line_hit:
+                credit = (issue["weight"] / self.total_weight) * 0.60
+                newly_found.append(issue["id"])
+                issue_credit += credit
+        # Penalise substantive comments that matched nothing
+        if not newly_found and comment and len(comment.strip()) > 15:
+            false_positive_penalty = -0.02
+        total = round(issue_credit + false_positive_penalty, 4)
+        breakdown = {
+            "issue_credit": round(issue_credit, 4),
+            "false_positive_penalty": round(false_positive_penalty, 4),
+        }
+        return total, newly_found, breakdown
+    # ── Terminal scoring ──────────────────────────────────────────────────
+    def final_score(
+        self,
+        issues_found: list[str],
+        review_decision: str | None,
+        step_count: int,
+        max_steps: int,
+        current_step: int = 0,
+    ) -> RewardType:
+        """
+        Compute the terminal reward on SUBMIT_REVIEW.
+        Returns a fully typed RewardType with component breakdown.
+        """
+        unique_found = list(set(issues_found))
+        found_weight = sum(
+            iss["weight"]
+            for iss in self.task["issues"]
+            if iss["id"] in unique_found
+        )
+        coverage = found_weight / self.total_weight if self.total_weight > 0 else 0.0
+        correct_decision = self.task.get("correct_decision", "request_changes")
+        decision_score = 0.10 if review_decision == correct_decision else -0.10
+        efficiency = max(0.0, 1.0 - step_count / max_steps)
+        efficiency_bonus = round(0.10 * efficiency, 4) if coverage >= 0.60 else 0.0
+        coverage_bonus = round(coverage * 0.20, 4)
+        raw_total = coverage_bonus + decision_score + efficiency_bonus
+        clamped = round(max(-1.0, min(1.0, raw_total)), 4)
+        components = {
+            "coverage_bonus": coverage_bonus,
+            "decision_score": round(decision_score, 4),
+            "efficiency_bonus": efficiency_bonus,
+        }
+        explanation = (
+            f"Found {len(unique_found)}/{len(self.task['issues'])} issues "
+            f"(weighted coverage {coverage:.0%}). "
+            f"Decision '{review_decision}' was "
+            f"{'correct' if review_decision == correct_decision else 'incorrect'}. "
+            f"Used {step_count}/{max_steps} steps."
+        )
+        return RewardType(
+            total=clamped,
+            components=components,
+            passed=review_decision == correct_decision and coverage >= 0.60,
+            explanation=explanation,
+            step=current_step,
+            terminal=True,
+        )
+    # ── Helper ────────────────────────────────────────────────────────────
+    @staticmethod
+    def _line_in_range(
+        line_number: int | None,
+        line_range: tuple[int, int],
+    ) -> bool:
+        if line_number is None:
+            return False
+        start, end = line_range
+        return (start - LINE_TOLERANCE) <= line_number <= (end + LINE_TOLERANCE)

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0

server/tasks.py ADDED Viewed

	@@ -0,0 +1,719 @@

+"""
+Task definitions for the CodeReviewAgent environment.
+Six tasks across four difficulty tiers. Each task defines:
+  - code: Python source to review
+  - issues: list of ground-truth issues with grading metadata
+  - correct_decision: expected final review decision
+Difficulty ladder:
+  0  ultra-easy  — hints embedded in comments; bootstraps GRPO positive trajectories
+  1  easy        — 3 clean logic bugs, no hints
+  2  medium      — 5 security issues in an auth module
+  3  hard        — 7 mixed issues in a data pipeline
+  4  medium      — 5 async concurrency bugs
+  5  hard        — 6 Flask API security issues
+"""
+from typing import Any
+TASKS: list[dict[str, Any]] = [
+    # ── Task 0: Ultra-easy (bootstrap) ───────────────────────────────────────
+    # DESIGN INTENT: both issues have their category name spelled out in a code
+    # comment directly above them.  A frozen weak model that simply reads the
+    # comments and echoes them back should reliably score > 0.  This task exists
+    # solely to guarantee that GRPO has at least a few positive trajectories from
+    # training step 1.
+    {
+        "id": 0,
+        "name": "Bootstrap: Obvious Issues",
+        "difficulty": "ultra-easy",
+        "file_name": "bootstrap.py",
+        "description": (
+            "Review this short Python module. "
+            "The comments above each function hint at the kind of issue present. "
+            "Add a comment for each bug you find (line number, severity, category), "
+            "call request_changes, then submit."
+        ),
+        "max_steps": 6,
+        "code": """\
+# BUG: this loop has an off-by-one error — it iterates one index too far
+def sum_items(data):
+    total = 0
+    for i in range(len(data) + 1):   # line 4: causes IndexError on last iteration
+        total += data[i]
+    return total
+# SECURITY: hardcoded credential — move to environment variable
+def connect_db():
+    db_password = "s3cr3t_prod_pw"   # line 11: hardcoded credential in source
+    return f"postgresql://admin:{db_password}@localhost/mydb"
+""",
+        "issues": [
+            {
+                "id": "bootstrap_off_by_one",
+                "description": "Off-by-one: range(len+1) causes IndexError on the last iteration",
+                "line_range": (4, 4),
+                "keywords": [
+                    "off-by-one", "off by one", "bug", "index", "indexerror",
+                    "range", "+ 1", "len + 1", "out of bounds",
+                ],
+                "category": "bug",
+                "severity": "error",
+                "weight": 1.0,
+            },
+            {
+                "id": "bootstrap_hardcoded_cred",
+                "description": "Hardcoded password in source should be an environment variable",
+                "line_range": (11, 11),
+                "keywords": [
+                    "hardcoded", "hard-coded", "security", "credential", "password",
+                    "secret", "env", "environment variable", "os.environ",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+        ],
+        "correct_decision": "request_changes",
+    },
+    # ── Task 1: Easy ─────────────────────────────────────────────────────────
+    {
+        "id": 1,
+        "name": "Basic Bug Detection",
+        "difficulty": "easy",
+        "file_name": "utils.py",
+        "description": (
+            "Review this Python utility module. "
+            "Identify any bugs, logical errors, or code quality issues. "
+            "Add a comment for each issue you find (include line number, severity, "
+            "and category), then submit your review."
+        ),
+        "max_steps": 15,
+        "code": """\
+def calculate_average(numbers):
+    \"\"\"Calculate the average of a list of numbers.\"\"\"
+    total = 0
+    for i in range(len(numbers) + 1):  # line 4
+        total += numbers[i]
+    average = total / len(numbers)
+    unused_result = sorted(numbers)  # line 7
+    return average
+def find_max(items):
+    \"\"\"Return the maximum value in a list.\"\"\"
+    if len(items) == 0:
+        return None
+    max_val = items[0]
+    for item in items:
+        if item > max_val:
+            max_val == item  # line 17: should be =, not ==
+    return max_val
+def is_palindrome(s):
+    \"\"\"Check if a string is a palindrome.\"\"\"
+    return s == s[::-1]
+""",
+        "issues": [
+            {
+                "id": "off_by_one",
+                "description": "Off-by-one: range(len+1) causes IndexError on the last iteration",
+                "line_range": (4, 5),
+                "keywords": [
+                    "off-by-one", "off by one", "range", "index", "indexerror",
+                    "out of bounds", "len + 1", "+ 1", "index out",
+                ],
+                "category": "bug",
+                "severity": "error",
+                "weight": 1.0,
+            },
+            {
+                "id": "unused_variable",
+                "description": "unused_result is assigned but never used",
+                "line_range": (7, 7),
+                "keywords": [
+                    "unused", "unused_result", "never used", "dead code",
+                    "not used", "unnecessary",
+                ],
+                "category": "style",
+                "severity": "info",
+                "weight": 0.5,
+            },
+            {
+                "id": "assignment_not_update",
+                "description": "max_val == item uses == (comparison) instead of = (assignment); max is never updated",
+                "line_range": (17, 17),
+                "keywords": [
+                    "==", "assignment", "comparison", "max_val", "never update",
+                    "not updating", "wrong operator", "should be =", "max never",
+                ],
+                "category": "bug",
+                "severity": "error",
+                "weight": 1.0,
+            },
+        ],
+        "correct_decision": "request_changes",
+    },
+    # ── Task 2: Medium ───────────────────────────────────────────────────────
+    {
+        "id": 2,
+        "name": "Security Vulnerability Review",
+        "difficulty": "medium",
+        "file_name": "auth.py",
+        "description": (
+            "Review this authentication module for security vulnerabilities. "
+            "Pay careful attention to credential handling, input sanitization, "
+            "and cryptographic choices. Annotate every issue with its severity "
+            "and category, then submit your review."
+        ),
+        "max_steps": 20,
+        "code": """\
+import sqlite3
+import hashlib
+import os
+DB_PASSWORD = "super_secret_123"   # line 5
+ADMIN_TOKEN = "tok_admin_abc123"   # line 6
+def authenticate_user(username, password):
+    \"\"\"Authenticate a user against the database.\"\"\"
+    conn = sqlite3.connect('app.db')
+    cursor = conn.cursor()
+    # line 12: f-string interpolation → SQL injection
+    query = f"SELECT * FROM users WHERE username = '{username}' AND password = '{password}'"
+    cursor.execute(query)
+    user = cursor.fetchone()
+    conn.close()
+    return user is not None
+def hash_password(password):
+    \"\"\"Hash a password for storage.\"\"\"
+    return hashlib.md5(password.encode()).hexdigest()  # line 21
+def execute_admin_command(command):
+    \"\"\"Execute an admin maintenance command.\"\"\"
+    result = eval(command)   # line 25
+    return result
+def get_user_data(user_id):
+    \"\"\"Fetch user profile from internal service.\"\"\"
+    import requests
+    url = f"https://internal-api/users/{user_id}"
+    response = requests.get(url, verify=False)  # line 32
+    return response.json()
+""",
+        "issues": [
+            {
+                "id": "hardcoded_credentials",
+                "description": "Credentials hard-coded in source (lines 5-6)",
+                "line_range": (5, 6),
+                "keywords": [
+                    "hardcoded", "hard-coded", "hard coded", "hardcode",
+                    "db_password", "admin_token", "plaintext credential",
+                    "environment variable", "env var", "os.environ",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "sql_injection",
+                "description": "SQL injection via unsanitised f-string interpolation",
+                "line_range": (12, 14),
+                "keywords": [
+                    "sql injection", "sql", "injection", "f-string", "parameterized",
+                    "sanitize", "escape", "prepared statement", "placeholder",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "weak_hashing",
+                "description": "MD5 is cryptographically broken for password storage",
+                "line_range": (21, 21),
+                "keywords": [
+                    "md5", "weak", "bcrypt", "argon2", "pbkdf2", "scrypt",
+                    "cryptographic", "password hashing", "hash", "broken",
+                ],
+                "category": "security",
+                "severity": "error",
+                "weight": 0.75,
+            },
+            {
+                "id": "arbitrary_code_execution",
+                "description": "eval() on untrusted input allows arbitrary code execution",
+                "line_range": (25, 25),
+                "keywords": [
+                    "eval", "arbitrary code", "code execution", "rce",
+                    "remote code", "dangerous", "unsafe",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "ssl_verification_disabled",
+                "description": "verify=False disables TLS cert validation, enabling MITM attacks",
+                "line_range": (32, 32),
+                "keywords": [
+                    "ssl", "verify", "certificate", "mitm",
+                    "man-in-the-middle", "tls", "verify=false", "cert",
+                ],
+                "category": "security",
+                "severity": "error",
+                "weight": 0.75,
+            },
+        ],
+        "correct_decision": "request_changes",
+    },
+    # ── Task 3: Hard ─────────────────────────────────────────────────────────
+    {
+        "id": 3,
+        "name": "Full Architecture and Performance Review",
+        "difficulty": "hard",
+        "file_name": "data_pipeline.py",
+        "description": (
+            "Perform a comprehensive review of this data pipeline. "
+            "Identify bugs, security vulnerabilities, performance bottlenecks, "
+            "and architectural design issues. Each comment should clearly explain "
+            "the problem and suggest a fix. Submit your review when done."
+        ),
+        "max_steps": 30,
+        "code": """\
+import requests
+import json
+import time
+from threading import Thread
+API_KEY = "sk-prod-abc123def456"   # line 6
+class DataPipeline:
+    def __init__(self, endpoint):
+        self.endpoint = endpoint
+        self.results = []
+        self.cache = {}   # line 13: unbounded
+    def fetch_batch(self, item_ids):
+        \"\"\"Fetch items from the API.\"\"\"
+        items = []
+        for item_id in item_ids:   # line 17: N+1 pattern
+            response = requests.get(
+                f"{self.endpoint}/items/{item_id}",
+                headers={"Authorization": f"Bearer {API_KEY}"},
+                verify=False,   # line 22
+            )
+            items.append(response.json())
+        return items
+    def process_items(self, items):
+        \"\"\"Transform items for storage.\"\"\"
+        results = []
+        for i in range(len(items)):   # line 28: use enumerate
+            item = items[i]
+            transformed = {
+                "id": item["id"],          # line 31: KeyError not handled
+                "value": item["value"] * 2,
+                "label": item.get("label", "unknown"),
+            }
+            results.append(transformed)
+            self.cache[item["id"]] = transformed   # line 36
+        return results
+    def run_async(self, func, *args):
+        \"\"\"Run function in a background thread.\"\"\"
+        t = Thread(target=func, args=args)
+        t.start()
+        # line 43: thread not tracked or joined — resource leak
+    def save_results(self, results, output_path):
+        \"\"\"Persist results to disk.\"\"\"
+        with open(output_path, "w") as f:
+            json.dump(results, f)
+    def retry_failed(self, failed_ids, max_retries=10):   # line 50
+        \"\"\"Re-fetch items that previously failed.\"\"\"
+        for item_id in failed_ids:
+            for attempt in range(max_retries):
+                try:
+                    result = requests.get(
+                        f"{self.endpoint}/items/{item_id}"
+                    )
+                    if result.status_code == 200:
+                        self.results.append(result.json())
+                        break
+                except Exception:
+                    time.sleep(1)   # line 60: no exponential backoff
+""",
+        "issues": [
+            {
+                "id": "hardcoded_api_key",
+                "description": "API key hard-coded in source instead of an environment variable",
+                "line_range": (6, 6),
+                "keywords": [
+                    "hardcoded", "hard-coded", "hardcode", "api key", "api_key",
+                    "environment variable", "env var", "os.environ", "sk-prod",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "n_plus_one_requests",
+                "description": "One HTTP request per item (N+1 pattern); should use a bulk/batch endpoint",
+                "line_range": (17, 24),
+                "keywords": [
+                    "n+1", "n plus 1", "batch", "bulk", "loop",
+                    "individual request", "serial", "one request per",
+                ],
+                "category": "performance",
+                "severity": "error",
+                "weight": 1.0,
+            },
+            {
+                "id": "ssl_disabled",
+                "description": "SSL certificate verification disabled (verify=False)",
+                "line_range": (22, 22),
+                "keywords": [
+                    "ssl", "verify", "certificate", "tls",
+                    "mitm", "verify=false", "cert",
+                ],
+                "category": "security",
+                "severity": "error",
+                "weight": 0.75,
+            },
+            {
+                "id": "missing_key_error_handling",
+                "description": "Direct dict access item['id'] / item['value'] raises KeyError on unexpected payloads",
+                "line_range": (31, 32),
+                "keywords": [
+                    "keyerror", "key error", "error handling", "missing key",
+                    "exception", "try", ".get(", "dict access",
+                ],
+                "category": "bug",
+                "severity": "warning",
+                "weight": 0.75,
+            },
+            {
+                "id": "unbounded_cache",
+                "description": "self.cache grows without bound; will cause OOM on large inputs",
+                "line_range": (13, 13),
+                "keywords": [
+                    "unbounded", "memory leak", "cache size", "limit",
+                    "lru", "eviction", "grow", "oom", "memory",
+                ],
+                "category": "design",
+                "severity": "warning",
+                "weight": 0.75,
+            },
+            {
+                "id": "thread_not_joined",
+                "description": "Thread is started but never stored or joined — silent resource/exception leak",
+                "line_range": (40, 43),
+                "keywords": [
+                    "thread", "join", "track", "resource leak",
+                    "daemon", "not joined", "not tracked",
+                ],
+                "category": "bug",
+                "severity": "error",
+                "weight": 1.0,
+            },
+            {
+                "id": "no_exponential_backoff",
+                "description": "Retry loop sleeps 1 s flat; needs exponential backoff to avoid hammering the API",
+                "line_range": (50, 60),
+                "keywords": [
+                    "backoff", "exponential", "retry", "sleep", "rate limit",
+                    "jitter", "aggressive",
+                ],
+                "category": "design",
+                "severity": "warning",
+                "weight": 0.5,
+            },
+        ],
+        "correct_decision": "request_changes",
+    },
+    # ── Task 4: Medium — Async Concurrency ───────────────────────────────
+    {
+        "id": 4,
+        "name": "Async Worker Review",
+        "difficulty": "medium",
+        "file_name": "async_worker.py",
+        "description": (
+            "Review this async worker module for concurrency bugs, "
+            "resource leaks, and exception-handling problems. "
+            "Comment on every issue with its line number, severity, "
+            "and category, then submit your review."
+        ),
+        "max_steps": 20,
+        "code": """\
+import asyncio
+import aiohttp
+_counter = 0           # line 3: shared mutable state, not thread/task-safe
+async def fetch_url(url: str) -> dict:
+    \"\"\"Fetch a URL and return JSON.\"\"\"
+    session = aiohttp.ClientSession()   # line 7: session never closed → resource leak
+    async with session.get(url) as resp:
+        return await resp.json()
+async def increment_and_fetch(url: str) -> dict:
+    \"\"\"Increment shared counter then fetch.\"\"\"
+    global _counter
+    _counter += 1          # line 15: race condition — not atomic in concurrent tasks
+    data = fetch_url(url)  # line 16: missing await → returns coroutine, not result
+    return data
+async def run_all(urls: list) -> list:
+    \"\"\"Run all fetches concurrently.\"\"\"
+    tasks = [increment_and_fetch(u) for u in urls]
+    results = []
+    for coro in tasks:
+        try:
+            result = await coro
+            results.append(result)
+        except Exception:
+            pass           # line 27: swallows all exceptions silently
+    return results
+async def retry_fetch(url: str, retries: int = 3) -> dict:
+    \"\"\"Fetch with retry logic.\"\"\"
+    for attempt in range(retries):
+        try:
+            return await fetch_url(url)
+        except Exception as e:
+            if attempt == retries - 1:
+                raise
+            await asyncio.sleep(1)  # line 38: flat sleep, no exponential backoff
+""",
+        "issues": [
+            {
+                "id": "shared_mutable_state",
+                "description": "Module-level _counter mutated by concurrent tasks without a lock",
+                "line_range": (3, 3),
+                "keywords": [
+                    "shared", "race condition", "thread-safe", "task-safe",
+                    "atomic", "lock", "asyncio.lock", "concurrent", "global",
+                    "mutable", "not safe",
+                ],
+                "category": "bug",
+                "severity": "error",
+                "weight": 1.0,
+            },
+            {
+                "id": "unclosed_session",
+                "description": "aiohttp.ClientSession created inside function is never closed → resource leak",
+                "line_range": (7, 9),
+                "keywords": [
+                    "session", "not closed", "resource leak", "close", "context manager",
+                    "async with", "clientsession", "leak", "aiohttp",
+                ],
+                "category": "bug",
+                "severity": "error",
+                "weight": 1.0,
+            },
+            {
+                "id": "missing_await",
+                "description": "fetch_url(url) called without await — returns unawaited coroutine",
+                "line_range": (16, 16),
+                "keywords": [
+                    "await", "missing await", "coroutine", "not awaited", "unawaited",
+                    "returns coroutine",
+                ],
+                "category": "bug",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "silent_exception",
+                "description": "bare except: pass swallows all exceptions, hiding errors",
+                "line_range": (27, 27),
+                "keywords": [
+                    "swallow", "silent", "bare except", "exception", "pass",
+                    "ignore", "hidden", "suppress", "logging",
+                ],
+                "category": "design",
+                "severity": "warning",
+                "weight": 0.75,
+            },
+            {
+                "id": "no_backoff",
+                "description": "Retry sleep is flat 1 s; should use exponential backoff with jitter",
+                "line_range": (38, 38),
+                "keywords": [
+                    "backoff", "exponential", "jitter", "retry", "sleep",
+                    "flat", "rate limit",
+                ],
+                "category": "design",
+                "severity": "warning",
+                "weight": 0.5,
+            },
+        ],
+        "correct_decision": "request_changes",
+    },
+    # ── Task 5: Hard — Flask API Vulnerabilities ──────────────────────────
+    {
+        "id": 5,
+        "name": "Flask API Security Review",
+        "difficulty": "hard",
+        "file_name": "api_server.py",
+        "description": (
+            "Perform a thorough security review of this Flask REST API. "
+            "Look for injection flaws, path traversal, insecure deserialization, "
+            "sensitive data exposure, and missing access controls. "
+            "Comment on every issue, then submit your review."
+        ),
+        "max_steps": 30,
+        "code": """\
+import os
+import pickle
+import subprocess
+import logging
+from flask import Flask, request, jsonify, send_file
+app = Flask(__name__)
+SECRET_KEY = "flask-secret-hardcoded"   # line 8
+logging.basicConfig(level=logging.DEBUG)
+@app.route("/run", methods=["POST"])
+def run_command():
+    \"\"\"Run a system command and return output.\"\"\"
+    cmd = request.json.get("command", "")
+    # line 15: unsanitised shell command → OS command injection
+    result = subprocess.check_output(cmd, shell=True, text=True)
+    return jsonify({"output": result})
+@app.route("/files", methods=["GET"])
+def get_file():
+    \"\"\"Serve a file from the data directory.\"\"\"
+    filename = request.args.get("name", "")
+    # line 23: no path normalisation → path traversal
+    path = os.path.join("/app/data", filename)
+    return send_file(path)
+@app.route("/load", methods=["POST"])
+def load_object():
+    \"\"\"Deserialise a user-supplied payload.\"\"\"
+    data = request.get_data()
+    # line 30: pickle.loads on untrusted data → arbitrary code execution
+    obj = pickle.loads(data)
+    return jsonify({"type": str(type(obj))})
+@app.route("/login", methods=["POST"])
+def login():
+    \"\"\"Authenticate and return a token.\"\"\"
+    username = request.json.get("username")
+    password = request.json.get("password")
+    # line 38: credentials logged at DEBUG level
+    logging.debug(f"Login attempt: username={username} password={password}")
+    if username == "admin" and password == SECRET_KEY:
+        return jsonify({"token": SECRET_KEY})   # line 41: secret returned in response
+    return jsonify({"error": "unauthorized"}), 401
+@app.route("/admin", methods=["GET"])
+def admin_panel():
+    \"\"\"Return admin data — no auth check.\"\"\"
+    # line 47: no authentication or authorisation check
+    return jsonify({"users": ["alice", "bob", "admin"], "config": {"debug": True}})
+""",
+        "issues": [
+            {
+                "id": "hardcoded_secret",
+                "description": "Flask SECRET_KEY hard-coded in source; should come from env var",
+                "line_range": (8, 8),
+                "keywords": [
+                    "hardcoded", "hard-coded", "secret_key", "environment variable",
+                    "env var", "os.environ", "secret", "hardcode",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 0.75,
+            },
+            {
+                "id": "command_injection",
+                "description": "subprocess.check_output with shell=True and unsanitised user input → OS command injection",
+                "line_range": (15, 16),
+                "keywords": [
+                    "command injection", "shell injection", "shell=true", "subprocess",
+                    "os injection", "arbitrary command", "unsanitised", "sanitize",
+                    "injection",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "path_traversal",
+                "description": "No path normalisation allows ../../../etc/passwd-style traversal",
+                "line_range": (23, 24),
+                "keywords": [
+                    "path traversal", "directory traversal", "path normaliz",
+                    "os.path.abspath", "realpath", "../", "dot dot",
+                    "escape", "filename", "traversal",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "insecure_deserialization",
+                "description": "pickle.loads on untrusted user data allows arbitrary code execution",
+                "line_range": (30, 31),
+                "keywords": [
+                    "pickle", "deserialization", "deserialisation", "arbitrary code",
+                    "untrusted", "rce", "remote code", "insecure deserialization",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+            {
+                "id": "credentials_in_logs",
+                "description": "Plaintext username and password written to DEBUG log",
+                "line_range": (38, 38),
+                "keywords": [
+                    "log", "logging", "credential", "password", "sensitive",
+                    "plaintext", "debug", "leak", "exposure",
+                ],
+                "category": "security",
+                "severity": "error",
+                "weight": 0.75,
+            },
+            {
+                "id": "missing_auth_check",
+                "description": "Admin endpoint has no authentication or authorisation guard",
+                "line_range": (47, 47),
+                "keywords": [
+                    "auth", "authentication", "authorization", "authorisation",
+                    "access control", "no check", "unprotected", "unauthenticated",
+                    "missing auth",
+                ],
+                "category": "security",
+                "severity": "critical",
+                "weight": 1.0,
+            },
+        ],
+        "correct_decision": "request_changes",
+    },
+]

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff