Spaces:

Tejasghatule
/

code-revieww-env

Sleeping

App Files Files Community

codemaverick2 commited on Mar 28

Commit

ff9fcbd

0 Parent(s):

Code Review Environment OpenEnv hackathon submission

Browse files

Files changed (21) hide show

.env.example +10 -0
.gitignore +9 -0
Dockerfile +32 -0
README.md +269 -0
client.py +154 -0
demo.py +154 -0
inference.py +231 -0
models.py +184 -0
openenv.yaml +11 -0
pyproject.toml +25 -0
requirements.txt +7 -0
server/__init__.py +1 -0
server/app.py +306 -0
server/environment.py +310 -0
server/graders.py +170 -0
tasks/__init__.py +3 -0
tasks/data.py +434 -0
tests/__init__.py +1 -0
tests/test_environment.py +314 -0
tests/test_graders.py +215 -0
uv.lock +0 -0

.env.example ADDED Viewed

	@@ -0,0 +1,10 @@

+# Copy to .env and fill in values
+# Required for baseline.py LLM inference
+OPENAI_API_KEY=sk-...
+# Optional: override the environment URL for baseline.py
+ENV_URL=http://localhost:7860
+# Optional: override the model for baseline.py
+BASELINE_MODEL=gpt-4o-mini

.gitignore ADDED Viewed

	@@ -0,0 +1,9 @@

+__pycache__/
+*.pyc
+*.pyo
+.pytest_cache/
+*.egg-info/
+dist/
+build/
+.env
+.DS_Store

Dockerfile ADDED Viewed

	@@ -0,0 +1,32 @@

+# Code Review Environment — Docker Image
+#
+# Build:  docker build -t code-review-env .
+# Run:    docker run -p 7860:7860 code-review-env
+# Test:   curl http://localhost:7860/health
+FROM python:3.11-slim
+# Create non-root user (HF Spaces requirement)
+RUN useradd -m -u 1000 appuser
+WORKDIR /app
+# Install dependencies first (caching layer)
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY --chown=appuser:appuser . .
+# Switch to non-root user
+USER appuser
+# Expose port (HF Spaces uses 7860)
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+  CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')" || exit 1
+# Start server
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

README.md ADDED Viewed

	@@ -0,0 +1,269 @@

+---
+title: Code Review Environment
+emoji: 🔍
+colorFrom: blue
+colorTo: purple
+sdk: docker
+pinned: false
+tags:
+  - openenv
+  - code-review
+  - security-audit
+  - reinforcement-learning
+---
+# Code Review Environment
+An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible environment for training and evaluating AI agents on code review and security auditing tasks.
+The agent inspects code files, flags bugs and vulnerabilities with precise line numbers and severity ratings, and receives graded feedback — enabling reinforcement learning from human-quality code review signal.
+## Why This Environment
+Code review is one of the highest-value tasks in software engineering. Every professional software team does it daily. Training AI agents to perform thorough, accurate code reviews is commercially valuable and technically challenging:
+- **Precise reasoning required**: agent must count lines, understand language semantics, reason about control flow
+- **Real impact**: bugs found → prevented production incidents; vulnerabilities found → prevented security breaches
+- **Natural difficulty progression**: obvious logic errors → subtle security vulnerabilities → complex architectural issues
+- **Clear grading**: issues exist at specific lines with specific types — objective F1-based scoring
+## Action Space
+```json
+{
+  "action_type": "flag_issue | clear_flag | request_hint | submit_review",
+  "line_number": 6,
+  "filename": "utils.py",
+  "issue_type": "bug | security | performance | logic",
+  "severity": "low | medium | high | critical",
+  "description": "Description of the issue",
+  "fix_suggestion": "How to fix it (optional)"
+}
+```
+| Action | Description | Reward |
+|--------|-------------|--------|
+| `flag_issue` | Mark a line as containing an issue | +0.10 if correct, −0.05 if wrong |
+| `clear_flag` | Remove a previously flagged issue | +0.03 if was FP, −0.03 if was TP |
+| `request_hint` | Get a hint about what to look for | −0.01 |
+| `submit_review` | Finalize and receive graded score | Final F1 score |
+## Observation Space
+```json
+{
+  "task_id": "bug-detection",
+  "task_description": "Review this Python utility module...",
+  "code_files": {"utils.py": "def calculate_average(numbers):\n..."},
+  "language": "python",
+  "flagged_issues": [...],
+  "step_count": 3,
+  "max_steps": 15,
+  "hints_remaining": 2,
+  "feedback": "Good catch! Issue flagged at utils.py:6 [+0.10 reward]",
+  "current_score": 0.333,
+  "done": false,
+  "reward": 0.1
+}
+```
+Note: `code_files` is only populated in the first observation (after `reset()`). Subsequent step observations omit it to keep payloads small.
+## Tasks
+### Task 1: `bug-detection` — Easy
+Identify 3 logical bugs in a Python utility module (`utils.py`).
+| Line | Issue | Severity |
+|------|-------|----------|
+| 6 | Off-by-one error: `range(len(numbers) + 1)` causes `IndexError` | High |
+| 13 | Binary search upper bound: `len(arr)` should be `len(arr) - 1` | Medium |
+| 33 | Word count initializes new entries to `0` instead of `1` | Low |
+**Max steps:** 15
+### Task 2: `security-audit` — Medium
+Audit a Flask web application (`app.py`) for OWASP Top-10 vulnerabilities.
+| Line | Issue | Severity |
+|------|-------|----------|
+| 8 | Hardcoded `SECRET_KEY` in source | High |
+| 9 | Hardcoded `DB_PASSWORD` in source | High |
+| 19 | SQL injection via f-string query | Critical |
+| 27 | XSS via unsanitized `render_template_string` | High |
+| 34 | Path traversal via `os.path.join` | High |
+| 40 | Missing authentication on admin endpoint | Critical |
+| 51 | Command injection via `shell=True` | Critical |
+**Max steps:** 20
+### Task 3: `comprehensive-review` — Hard
+Comprehensive review of a Django e-commerce API across two files (`views.py`, `models.py`).
+| File | Line | Issue | Severity |
+|------|------|-------|----------|
+| views.py | 21 | N+1 query in order creation loop | High |
+| views.py | 26 | Race condition — stock check not atomic | Critical |
+| views.py | 29 | Order created outside transaction | High |
+| views.py | 47 | No max cap on `per_page` parameter | Medium |
+| views.py | 66 | MD5 for payment verification (broken crypto) | Medium |
+| views.py | 67 | Timing attack in payment hash comparison | Medium |
+| models.py | 8 | Plaintext password storage | Critical |
+| models.py | 16 | `FloatField` for monetary values | Medium |
+| models.py | 18 | `BinaryField` with pickled data (RCE risk) | High |
+**Max steps:** 30
+## Scoring
+```
+final_score = 0.70 × F1 + 0.30 × severity_accuracy
+where:
+  F1 = 2 × precision × recall / (precision + recall)
+  precision = correct_flags / total_flags
+  recall = correct_flags / total_gt_issues
+  severity_accuracy = avg(1 − |flag_sev_rank − gt_sev_rank| × 0.34) for matched issues
+Matching tolerance: ±2 lines, same filename, compatible issue type
+```
+## API Endpoints
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| `POST` | `/reset` | Start new episode. Body: `{"task_id": "bug-detection", "seed": 42}` |
+| `POST` | `/step` | Take action. Body: ReviewAction JSON |
+| `GET` | `/state` | Get current episode state |
+| `GET` | `/health` | Health check → `{"status": "healthy"}` |
+| `GET` | `/tasks` | List all tasks + action schema |
+| `POST` | `/grader` | Grade findings: `{"task_id": "...", "flagged_issues": [...]}` |
+| `POST` | `/baseline` | Run keyword heuristic on all tasks |
+| `WS` | `/ws` | WebSocket session (OpenEnv standard) |
+| `GET` | `/docs` | Swagger UI |
+## Setup & Usage
+### Local (uvicorn)
+```bash
+git clone https://github.com/CodeMaverick2/code-review-env
+cd code-review-env
+pip install -r requirements.txt
+uvicorn server.app:app --host 0.0.0.0 --port 7860
+```
+### Docker
+```bash
+docker build -t code-review-env .
+docker run -p 7860:7860 code-review-env
+```
+### Quick test
+```bash
+curl http://localhost:7860/health
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "bug-detection"}'
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "flag_issue", "line_number": 6, "filename": "utils.py", "issue_type": "bug", "severity": "high", "description": "Off-by-one"}'
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "submit_review"}'
+```
+### Python client
+```python
+from client import CodeReviewEnv, ReviewAction
+with CodeReviewEnv("http://localhost:7860").sync() as env:
+    result = env.reset(task_id="bug-detection")
+    print(result.observation.code_files["utils.py"])
+    result = env.step(ReviewAction(
+        action_type="flag_issue",
+        line_number=6,
+        filename="utils.py",
+        issue_type="bug",
+        severity="high",
+        description="Off-by-one error in range()"
+    ))
+    print(result.observation.feedback)
+    result = env.step(ReviewAction(action_type="submit_review"))
+    print(f"Final score: {result.reward:.3f}")
+```
+### Inference script
+```bash
+# No API key needed — uses built-in keyword heuristic
+python inference.py
+# With LLM (OpenAI-compatible API)
+export API_BASE_URL=https://openrouter.ai/api/v1
+export MODEL_NAME=openai/gpt-4o-mini
+export HF_TOKEN=sk-...
+python inference.py
+```
+### Demo
+```bash
+python demo.py
+python demo.py --task security-audit
+python demo.py --task comprehensive-review
+```
+### Tests
+```bash
+pip install pytest
+pytest tests/ -v
+```
+## Baseline Scores
+| Task | Keyword heuristic | GPT-4o-mini |
+|------|-------------------|-------------|
+| bug-detection | 1.00 | ~0.52 |
+| security-audit | 0.75 | ~0.59 |
+| comprehensive-review | 0.67 | ~0.17 |
+| **Overall** | **0.81** | **~0.43** |
+Keyword heuristic runs via `inference.py` with no API key. LLM scores use `API_BASE_URL` + `HF_TOKEN`.
+## Project Structure
+```
+code-review-env/
+├── README.md
+├── openenv.yaml          ← OpenEnv manifest
+├── Dockerfile            ← Container (HF Spaces, port 7860)
+├── pyproject.toml        ← Package config + entry points
+├── requirements.txt
+├── uv.lock
+├── inference.py          ← Inference script
+├── demo.py               ← Demo script (no API key needed)
+├── client.py             ← HTTP client
+├── models.py             ← ReviewAction, ReviewObservation, ReviewState, Issue
+├── tasks/
+│   └── data.py           ← 3 task definitions + ground truth
+├── server/
+│   ├── app.py            ← FastAPI application
+│   ├── environment.py    ← Core environment logic
+│   └── graders.py        ← F1 grading + keyword baseline
+└── tests/
+    ├── test_environment.py
+    └── test_graders.py
+```

client.py ADDED Viewed

	@@ -0,0 +1,154 @@

+"""
+HTTP client for the Code Review Environment.
+Usage:
+    from client import CodeReviewEnv, ReviewAction
+    with CodeReviewEnv(base_url="http://localhost:7860").sync() as env:
+        result = env.reset(task_id="bug-detection")
+        obs = result.observation
+        print(obs.task_description)
+        print(obs.code_files)
+        # Flag an issue
+        result = env.step(ReviewAction(
+            action_type="flag_issue",
+            line_number=6,
+            filename="utils.py",
+            issue_type="bug",
+            severity="high",
+            description="Off-by-one in range()"
+        ))
+        print(result.observation.feedback)
+        # Submit
+        result = env.step(ReviewAction(action_type="submit_review"))
+        print(f"Score: {result.reward:.3f}")
+"""
+from __future__ import annotations
+import os
+import sys
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from typing import Optional, Generic, TypeVar
+from models import ReviewAction, ReviewObservation, ReviewState, Issue
+ObsT = TypeVar("ObsT")
+class StepResult(Generic[ObsT]):
+    def __init__(
+        self,
+        observation: ObsT,
+        reward: Optional[float] = None,
+        done: bool = False,
+    ):
+        self.observation = observation
+        self.reward = reward
+        self.done = done
+    def __repr__(self) -> str:
+        return (
+            f"StepResult(done={self.done}, reward={self.reward}, "
+            f"score={getattr(self.observation, 'current_score', None)})"
+        )
+try:
+    import httpx
+    _HAS_HTTPX = True
+except ImportError:
+    _HAS_HTTPX = False
+try:
+    from openenv.core.http_env_client import HTTPEnvClient as _OfficialClient
+    _HAS_OPENENV_CLIENT = True
+except ImportError:
+    _HAS_OPENENV_CLIENT = False
+class SyncCodeReviewEnv:
+    def __init__(self, base_url: str = "http://localhost:7860"):
+        self.base_url = base_url.rstrip("/")
+        if not _HAS_HTTPX:
+            raise ImportError("httpx is required: pip install httpx")
+        import httpx
+        self._client = httpx.Client(timeout=30.0)
+    def __enter__(self):
+        return self
+    def __exit__(self, *args):
+        self.close()
+    def close(self):
+        self._client.close()
+    def reset(
+        self,
+        task_id: Optional[str] = None,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+    ) -> StepResult[ReviewObservation]:
+        body = {}
+        if task_id:
+            body["task_id"] = task_id
+        if seed is not None:
+            body["seed"] = seed
+        if episode_id:
+            body["episode_id"] = episode_id
+        resp = self._client.post(f"{self.base_url}/reset", json=body)
+        resp.raise_for_status()
+        obs = ReviewObservation.from_dict(resp.json())
+        return StepResult(observation=obs, reward=obs.reward, done=obs.done)
+    def step(self, action: ReviewAction) -> StepResult[ReviewObservation]:
+        body = action.to_dict()
+        resp = self._client.post(f"{self.base_url}/step", json=body)
+        resp.raise_for_status()
+        obs = ReviewObservation.from_dict(resp.json())
+        return StepResult(observation=obs, reward=obs.reward, done=obs.done)
+    def state(self) -> ReviewState:
+        resp = self._client.get(f"{self.base_url}/state")
+        resp.raise_for_status()
+        data = resp.json()
+        return ReviewState(
+            task_id=data.get("task_id", ""),
+            difficulty=data.get("difficulty", ""),
+            episode_id=data.get("episode_id"),
+            step_count=data.get("step_count", 0),
+            flagged_issues=[Issue.from_dict(i) for i in data.get("flagged_issues", [])],
+            current_score=data.get("current_score", 0.0),
+            submitted=data.get("submitted", False),
+        )
+    def health(self) -> dict:
+        resp = self._client.get(f"{self.base_url}/health")
+        resp.raise_for_status()
+        return resp.json()
+    def list_tasks(self) -> dict:
+        resp = self._client.get(f"{self.base_url}/tasks")
+        resp.raise_for_status()
+        return resp.json()
+class CodeReviewEnv:
+    def __init__(self, base_url: str = "http://localhost:7860"):
+        self.base_url = base_url
+    def sync(self) -> SyncCodeReviewEnv:
+        return SyncCodeReviewEnv(self.base_url)
+    def __enter__(self):
+        self._sync = self.sync()
+        return self._sync
+    def __exit__(self, *args):
+        if hasattr(self, "_sync"):
+            self._sync.close()

demo.py ADDED Viewed

	@@ -0,0 +1,154 @@

+"""
+Demo script for the Code Review Environment.
+Runs a complete episode against the live environment using the
+keyword-heuristic agent (no API key required).
+Usage:
+    python demo.py
+    python demo.py --url https://tejasghatule-code-review-env.hf.space
+    python demo.py --task security-audit
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+import os
+import httpx
+DEFAULT_URL = "https://tejasghatule-code-review-env.hf.space"
+TASKS = ["bug-detection", "security-audit", "comprehensive-review"]
+def run_keyword_agent(base_url: str, task_id: str) -> dict:
+    """Run the built-in keyword-heuristic agent via the /baseline endpoint."""
+    with httpx.Client(timeout=30) as client:
+        # Health check
+        health = client.get(f"{base_url}/health")
+        health.raise_for_status()
+        print(f"  Health : {health.json()}")
+        # Reset
+        resp = client.post(f"{base_url}/reset", json={"task_id": task_id})
+        resp.raise_for_status()
+        obs = resp.json()
+        print(f"  Task   : {obs['task_id']} ({obs.get('difficulty', '')})")
+        print(f"  Files  : {list(obs['code_files'].keys())}")
+        print(f"  Steps  : 0 / {obs['max_steps']}")
+        print()
+        # Use /baseline endpoint (deterministic, no LLM)
+        baseline = client.post(f"{base_url}/baseline")
+        baseline.raise_for_status()
+        results = baseline.json()
+        return results
+def run_manual_episode(base_url: str, task_id: str) -> None:
+    """Walk through a full episode step-by-step to demonstrate the API."""
+    with httpx.Client(timeout=30) as client:
+        print(f"=== Episode Demo: {task_id} ===\n")
+        # 1. Reset
+        resp = client.post(f"{base_url}/reset", json={"task_id": task_id})
+        resp.raise_for_status()
+        obs = resp.json()
+        print(f"Task      : {obs['task_description'][:120]}...")
+        print(f"Files     : {list(obs['code_files'].keys())}")
+        print(f"Max steps : {obs['max_steps']}")
+        print(f"Score     : {obs['current_score']}")
+        print()
+        # 2. Flag a known issue (task-specific)
+        actions = {
+            "bug-detection": {
+                "action_type": "flag_issue",
+                "line_number": 6,
+                "filename": "utils.py",
+                "issue_type": "bug",
+                "severity": "high",
+                "description": "Off-by-one: range(len(numbers) + 1) causes IndexError",
+                "fix_suggestion": "Change to range(len(numbers))",
+            },
+            "security-audit": {
+                "action_type": "flag_issue",
+                "line_number": 8,
+                "filename": "app.py",
+                "issue_type": "security",
+                "severity": "high",
+                "description": "Hardcoded SECRET_KEY in source code",
+                "fix_suggestion": "Use os.environ.get('SECRET_KEY')",
+            },
+            "comprehensive-review": {
+                "action_type": "flag_issue",
+                "line_number": 8,
+                "filename": "models.py",
+                "issue_type": "security",
+                "severity": "critical",
+                "description": "Plaintext password storage in database",
+                "fix_suggestion": "Use Django's make_password / check_password",
+            },
+        }
+        action = actions.get(task_id, actions["bug-detection"])
+        print(f"Step 1 — flag_issue at {action['filename']}:{action['line_number']}")
+        resp = client.post(f"{base_url}/step", json=action)
+        resp.raise_for_status()
+        obs = resp.json()
+        print(f"  Feedback : {obs['feedback']}")
+        print(f"  Reward   : {obs['reward']}")
+        print(f"  Score    : {obs['current_score']}")
+        print()
+        # 3. Request a hint
+        print("Step 2 — request_hint")
+        resp = client.post(f"{base_url}/step", json={"action_type": "request_hint"})
+        resp.raise_for_status()
+        obs = resp.json()
+        print(f"  Feedback : {obs['feedback']}")
+        print()
+        # 4. Submit
+        print("Step 3 — submit_review")
+        resp = client.post(f"{base_url}/step", json={"action_type": "submit_review"})
+        resp.raise_for_status()
+        obs = resp.json()
+        print(f"  Feedback : {obs['feedback']}")
+        print(f"  Final score : {obs['reward']:.4f}")
+        print(f"  Done        : {obs['done']}")
+        print()
+        # 5. Check state
+        state = client.get(f"{base_url}/state")
+        state.raise_for_status()
+        s = state.json()
+        print(f"State — episode_id: {s['episode_id']}, steps: {s['step_count']}, submitted: {s['submitted']}")
+def main():
+    parser = argparse.ArgumentParser(description="Code Review Environment demo")
+    parser.add_argument("--url", default=DEFAULT_URL, help="Environment base URL")
+    parser.add_argument("--task", default="bug-detection", choices=TASKS)
+    parser.add_argument("--baseline", action="store_true", help="Run full baseline on all tasks")
+    args = parser.parse_args()
+    base_url = args.url.rstrip("/")
+    print(f"Code Review Environment — Demo")
+    print(f"  URL  : {base_url}")
+    print(f"  Task : {args.task}\n")
+    if args.baseline:
+        print("Running keyword-heuristic baseline on all tasks...\n")
+        results = run_keyword_agent(base_url, args.task)
+        print(json.dumps(results, indent=2))
+    else:
+        run_manual_episode(base_url, args.task)
+if __name__ == "__main__":
+    main()

inference.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""
+Inference script for the Code Review Environment.
+Environment variables:
+    API_BASE_URL  — LLM API endpoint (e.g. https://openrouter.ai/api/v1)
+    MODEL_NAME    — Model identifier (e.g. openai/gpt-4o-mini)
+    HF_TOKEN      — API key for the LLM provider
+    ENV_URL       — Environment base URL (default: localhost:7860)
+Usage:
+    export API_BASE_URL=https://openrouter.ai/api/v1
+    export MODEL_NAME=openai/gpt-4o-mini
+    export HF_TOKEN=sk-...
+    python inference.py
+"""
+from __future__ import annotations
+import os
+import sys
+import json
+import time
+import httpx
+API_BASE_URL: str = os.environ.get("API_BASE_URL", "").rstrip("/")
+MODEL_NAME: str = os.environ.get("MODEL_NAME", "gpt-4o-mini")
+HF_TOKEN: str = os.environ.get("HF_TOKEN", "")
+ENV_URL: str = os.environ.get("ENV_URL", "http://localhost:7860").rstrip("/")
+TASK_IDS = ["bug-detection", "security-audit", "comprehensive-review"]
+SYSTEM_PROMPT = """\
+You are an expert software engineer performing a thorough code review.
+Your job is to identify bugs, security vulnerabilities, and performance issues in code.
+For each issue you find, respond with a single JSON object:
+  {"action_type": "flag_issue", "line_number": <int>, "filename": "<file>", "issue_type": "bug|security|performance|logic", "severity": "low|medium|high|critical", "description": "<explanation>", "fix_suggestion": "<fix>"}
+When done, respond with:
+  {"action_type": "submit_review"}
+Rules:
+- Respond with raw JSON only — no markdown fences, no extra text
+- One action per response
+- Be precise with line numbers (count from line 1)
+- Only flag real issues, not style preferences
+"""
+def chat_completion(messages: list) -> str:
+    try:
+        from openai import OpenAI
+    except ImportError:
+        raise ImportError("pip install openai")
+    kwargs = {"api_key": HF_TOKEN or "no-key"}
+    if API_BASE_URL:
+        kwargs["base_url"] = API_BASE_URL
+    client = OpenAI(**kwargs)
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=messages,
+        temperature=0.0,
+        max_tokens=400,
+    )
+    return response.choices[0].message.content.strip()
+def parse_action(text: str) -> dict:
+    text = text.strip()
+    if "```" in text:
+        parts = text.split("```")
+        for part in parts:
+            part = part.strip()
+            if part.startswith("json"):
+                part = part[4:].strip()
+            if part.startswith("{") or part.startswith("["):
+                text = part
+                break
+    decoder = json.JSONDecoder()
+    for i, ch in enumerate(text):
+        if ch in ("{", "["):
+            try:
+                obj, _ = decoder.raw_decode(text, i)
+                if isinstance(obj, dict):
+                    return obj
+                if isinstance(obj, list):
+                    for item in obj:
+                        if isinstance(item, dict):
+                            return item
+            except json.JSONDecodeError:
+                continue
+    return {"action_type": "submit_review"}
+def run_keyword_fallback(base_url: str, task_id: str) -> dict:
+    """Fallback: use the built-in /baseline endpoint (no LLM needed)."""
+    with httpx.Client(timeout=30) as client:
+        resp = client.post(f"{base_url}/baseline")
+        resp.raise_for_status()
+        results = resp.json()
+        score = results["baseline_scores"].get(task_id, {}).get("score", 0.0)
+        return {"task_id": task_id, "score": score, "steps": 0, "method": "keyword_heuristic"}
+def run_task(task_id: str, http_client: httpx.Client) -> dict:
+    resp = http_client.post(f"{ENV_URL}/reset", json={"task_id": task_id}, timeout=30)
+    resp.raise_for_status()
+    obs = resp.json()
+    code_display = "\n\n".join(
+        f"=== {fname} ===\n{code}"
+        for fname, code in obs.get("code_files", {}).items()
+    )
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {
+            "role": "user",
+            "content": (
+                f"Task: {obs.get('task_description', '')}\n\n"
+                f"{code_display}\n\n"
+                f"Review this code carefully. Flag every issue you find. "
+                f"You have {obs.get('max_steps', 20)} steps total."
+            ),
+        },
+    ]
+    done = False
+    step_count = 0
+    max_steps = obs.get("max_steps", 20)
+    final_score = 0.0
+    while not done and step_count < max_steps:
+        action_text = chat_completion(messages)
+        action = parse_action(action_text)
+        try:
+            step_resp = http_client.post(f"{ENV_URL}/step", json=action, timeout=30)
+            step_resp.raise_for_status()
+            obs = step_resp.json()
+        except Exception as e:
+            print(f"    Step error: {e}")
+            break
+        done = obs.get("done", False)
+        step_count += 1
+        final_score = obs.get("current_score", 0.0)
+        reward = obs.get("reward")
+        messages.append({"role": "assistant", "content": action_text})
+        messages.append({
+            "role": "user",
+            "content": (
+                f"Feedback: {obs.get('feedback', '')} "
+                f"(step {step_count}/{max_steps}, score: {obs.get('current_score', 0.0):.3f})"
+            ),
+        })
+        atype = action.get("action_type", "")
+        print(f"    Step {step_count:2d}: {atype:20s} | reward={str(reward):8s} | score={obs.get('current_score', 0.0):.3f}")
+        if atype == "submit_review":
+            final_score = obs.get("reward", obs.get("current_score", 0.0)) or 0.0
+            break
+        time.sleep(0.3)
+    return {
+        "task_id": task_id,
+        "score": float(final_score),
+        "steps": step_count,
+        "method": "llm",
+    }
+def main():
+    use_llm = bool(HF_TOKEN and API_BASE_URL)
+    print("Code Review Environment — Inference")
+    print(f"  Model   : {MODEL_NAME}")
+    print(f"  API URL : {API_BASE_URL or '(not set — using keyword heuristic)'}")
+    print(f"  Env URL : {ENV_URL}")
+    print(f"  Tasks   : {TASK_IDS}\n")
+    try:
+        with httpx.Client(timeout=10) as probe:
+            health = probe.get(f"{ENV_URL}/health")
+            health.raise_for_status()
+            print(f"  Health: {health.json()}\n")
+    except Exception as e:
+        print(f"ERROR: Cannot reach environment at {ENV_URL}: {e}")
+        sys.exit(1)
+    results = {}
+    if use_llm:
+        with httpx.Client(timeout=60) as client:
+            for task_id in TASK_IDS:
+                print(f"Running task: {task_id}")
+                result = run_task(task_id, client)
+                results[task_id] = result
+                print(f"  → score: {result['score']:.4f}  ({result['steps']} steps)\n")
+    else:
+        print("HF_TOKEN / API_BASE_URL not set — using built-in keyword heuristic baseline.\n")
+        for task_id in TASK_IDS:
+            print(f"Running task: {task_id}")
+            result = run_keyword_fallback(ENV_URL, task_id)
+            results[task_id] = result
+            print(f"  → score: {result['score']:.4f}\n")
+    print("=" * 50)
+    print("INFERENCE RESULTS")
+    print("=" * 50)
+    for task_id, r in results.items():
+        print(f"  {task_id:30s}  score={r['score']:.4f}")
+    overall = sum(r["score"] for r in results.values()) / len(results)
+    print(f"\n  Overall average: {overall:.4f}")
+    print("=" * 50)
+    return results
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,184 @@

+from __future__ import annotations
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from dataclasses import dataclass, field
+from typing import List, Optional, Dict, Any
+@dataclass
+class Issue:
+    line_number: int
+    filename: str
+    issue_type: str   # bug | security | performance | logic
+    severity: str     # low | medium | high | critical
+    description: str = ""
+    fix_suggestion: Optional[str] = None
+    def to_dict(self) -> dict:
+        return {
+            "line_number": self.line_number,
+            "filename": self.filename,
+            "issue_type": self.issue_type,
+            "severity": self.severity,
+            "description": self.description,
+            "fix_suggestion": self.fix_suggestion,
+        }
+    @classmethod
+    def from_dict(cls, d: dict) -> "Issue":
+        return cls(
+            line_number=int(d.get("line_number", 0)),
+            filename=str(d.get("filename", "")),
+            issue_type=str(d.get("issue_type", "bug")),
+            severity=str(d.get("severity", "medium")),
+            description=str(d.get("description", "")),
+            fix_suggestion=d.get("fix_suggestion"),
+        )
+try:
+    from openenv.core.env_server import (
+        Action as _BaseAction,
+        Observation as _BaseObservation,
+        State as _BaseState,
+    )
+except ImportError:
+    @dataclass
+    class _BaseAction:
+        metadata: Dict[str, Any] = field(default_factory=dict)
+    @dataclass
+    class _BaseObservation:
+        done: bool = False
+        reward: Optional[float] = None
+        metadata: Dict[str, Any] = field(default_factory=dict)
+    @dataclass
+    class _BaseState:
+        episode_id: Optional[str] = None
+        step_count: int = 0
+@dataclass
+class ReviewAction(_BaseAction):
+    """
+    Agent action during a code review episode.
+    action_type:
+      flag_issue    — mark a line as containing an issue
+      clear_flag    — remove a previously flagged issue
+      request_hint  — get a hint (-0.01 reward)
+      submit_review — end the episode and receive final grade
+    """
+    action_type: str = "flag_issue"
+    line_number: Optional[int] = None
+    filename: Optional[str] = None
+    issue_type: Optional[str] = None
+    severity: Optional[str] = None
+    description: str = ""
+    fix_suggestion: Optional[str] = None
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    def to_dict(self) -> dict:
+        return {
+            "action_type": self.action_type,
+            "line_number": self.line_number,
+            "filename": self.filename,
+            "issue_type": self.issue_type,
+            "severity": self.severity,
+            "description": self.description,
+            "fix_suggestion": self.fix_suggestion,
+        }
+    @classmethod
+    def from_dict(cls, d: dict) -> "ReviewAction":
+        return cls(
+            action_type=str(d.get("action_type", "flag_issue")),
+            line_number=d.get("line_number"),
+            filename=d.get("filename"),
+            issue_type=d.get("issue_type"),
+            severity=d.get("severity"),
+            description=str(d.get("description", "")),
+            fix_suggestion=d.get("fix_suggestion"),
+        )
+@dataclass
+class ReviewObservation(_BaseObservation):
+    """
+    Observation returned after each reset/step call.
+    code_files is only populated on reset; subsequent steps omit it.
+    """
+    task_id: str = ""
+    task_description: str = ""
+    code_files: Dict[str, str] = field(default_factory=dict)
+    language: str = "python"
+    flagged_issues: List[Issue] = field(default_factory=list)
+    step_count: int = 0
+    max_steps: int = 20
+    hints_remaining: int = 3
+    feedback: str = ""
+    current_score: float = 0.0
+    done: bool = False
+    reward: Optional[float] = None
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    def to_dict(self) -> dict:
+        return {
+            "task_id": self.task_id,
+            "task_description": self.task_description,
+            "code_files": self.code_files,
+            "language": self.language,
+            "flagged_issues": [i.to_dict() for i in self.flagged_issues],
+            "step_count": self.step_count,
+            "max_steps": self.max_steps,
+            "hints_remaining": self.hints_remaining,
+            "feedback": self.feedback,
+            "current_score": self.current_score,
+            "done": self.done,
+            "reward": self.reward,
+            "metadata": self.metadata,
+        }
+    @classmethod
+    def from_dict(cls, d: dict) -> "ReviewObservation":
+        return cls(
+            task_id=d.get("task_id", ""),
+            task_description=d.get("task_description", ""),
+            code_files=d.get("code_files", {}),
+            language=d.get("language", "python"),
+            flagged_issues=[Issue.from_dict(i) for i in d.get("flagged_issues", [])],
+            step_count=d.get("step_count", 0),
+            max_steps=d.get("max_steps", 20),
+            hints_remaining=d.get("hints_remaining", 3),
+            feedback=d.get("feedback", ""),
+            current_score=d.get("current_score", 0.0),
+            done=d.get("done", False),
+            reward=d.get("reward"),
+        )
+@dataclass
+class ReviewState(_BaseState):
+    task_id: str = ""
+    difficulty: str = ""
+    episode_id: Optional[str] = None
+    step_count: int = 0
+    flagged_issues: List[Issue] = field(default_factory=list)
+    current_score: float = 0.0
+    submitted: bool = False
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    def to_dict(self) -> dict:
+        return {
+            "task_id": self.task_id,
+            "difficulty": self.difficulty,
+            "episode_id": self.episode_id,
+            "step_count": self.step_count,
+            "flagged_issues": [i.to_dict() for i in self.flagged_issues],
+            "current_score": self.current_score,
+            "submitted": self.submitted,
+        }

openenv.yaml ADDED Viewed

	@@ -0,0 +1,11 @@

+spec_version: 1
+name: code_review_env
+version: "1.0.0"
+description: >
+  A code review and security audit environment for training AI agents.
+  The agent identifies bugs, security vulnerabilities, and performance issues
+  across three tasks of increasing difficulty (easy → medium → hard).
+type: space
+runtime: fastapi
+app: server.app:app
+port: 7860

pyproject.toml ADDED Viewed

	@@ -0,0 +1,25 @@

+[project]
+name = "code-review-env"
+version = "1.0.0"
+description = "OpenEnv environment for code review and security audit training"
+requires-python = ">=3.10"
+dependencies = [
+    "fastapi>=0.100.0",
+    "uvicorn[standard]>=0.23.0",
+    "pydantic>=2.0.0",
+    "httpx>=0.24.0",
+    "openai>=1.0.0",
+    "python-dotenv>=1.0.0",
+    "websockets>=11.0",
+    "openenv-core>=0.2.0",
+]
+[project.optional-dependencies]
+dev = ["pytest>=7.0", "pytest-asyncio"]
+[project.scripts]
+serve = "server.app:main"
+[build-system]
+requires = ["setuptools>=68"]
+build-backend = "setuptools.backends.legacy:build"

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+fastapi>=0.100.0
+uvicorn[standard]>=0.23.0
+pydantic>=2.0.0
+httpx>=0.24.0
+openai>=1.0.0
+python-dotenv>=1.0.0
+websockets>=11.0

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # server package

server/app.py ADDED Viewed

	@@ -0,0 +1,306 @@

+"""
+FastAPI application for the Code Review Environment.
+Endpoints:
+  POST /reset    — start new episode
+  POST /step     — take an action
+  GET  /state    — get episode state
+  GET  /health   — health check
+  GET  /tasks    — list all tasks + action schema
+  POST /grader   — grade a set of findings (stateless)
+  POST /baseline — run keyword-heuristic baseline on all tasks
+  WS   /ws       — persistent WebSocket session
+  GET  /docs     — Swagger UI (auto-generated)
+"""
+from __future__ import annotations
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+import json
+import asyncio
+import dataclasses
+from typing import Optional, List, Dict, Any
+from fastapi import FastAPI, WebSocket, WebSocketDisconnect, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from models import ReviewAction, Issue
+from server.environment import CodeReviewEnvironment
+from server.graders import grade_episode, run_keyword_baseline
+from tasks.data import ALL_TASKS, TASK_IDS
+def _serialize(obj) -> dict:
+    if dataclasses.is_dataclass(obj) and not isinstance(obj, type):
+        d = dataclasses.asdict(obj)
+        # asdict handles nested dataclasses and lists recursively
+        return d
+    if isinstance(obj, dict):
+        return obj
+    raise TypeError(f"Cannot serialize {type(obj)}")
+_env_instance = CodeReviewEnvironment()
+def _make_app() -> FastAPI:
+    try:
+        from openenv.core.env_server import create_fastapi_app
+        base = create_fastapi_app(CodeReviewEnvironment)
+        return base
+    except Exception:
+        pass
+    _app = FastAPI(
+        title="Code Review Environment",
+        description=(
+            "An OpenEnv environment for training AI agents to perform "
+            "code review and security audits."
+        ),
+        version="1.0.0",
+    )
+    _app.add_middleware(
+        CORSMiddleware,
+        allow_origins=["*"],
+        allow_methods=["*"],
+        allow_headers=["*"],
+    )
+    @_app.get("/health")
+    async def health():
+        return {"status": "healthy"}
+    @_app.post("/reset")
+    async def reset(body: dict = None):
+        body = body or {}
+        task_id = body.get("task_id")
+        seed = body.get("seed")
+        episode_id = body.get("episode_id")
+        obs = _env_instance.reset(task_id=task_id, seed=seed, episode_id=episode_id)
+        return _serialize(obs)
+    @_app.post("/step")
+    async def step(body: dict):
+        action = ReviewAction.from_dict(body)
+        obs = _env_instance.step(action)
+        return _serialize(obs)
+    @_app.get("/state")
+    async def state():
+        return _serialize(_env_instance.state)
+    @_app.websocket("/ws")
+    async def websocket_endpoint(websocket: WebSocket):
+        await websocket.accept()
+        ws_env = CodeReviewEnvironment()
+        try:
+            while True:
+                raw = await websocket.receive_text()
+                msg = json.loads(raw)
+                msg_type = msg.get("type", "")
+                if msg_type == "reset":
+                    data = msg.get("data", {})
+                    obs = ws_env.reset(
+                        task_id=data.get("task_id"),
+                        seed=data.get("seed"),
+                        episode_id=data.get("episode_id"),
+                    )
+                    await websocket.send_text(json.dumps({
+                        "type": "observation",
+                        "data": _serialize(obs),
+                    }))
+                elif msg_type == "step":
+                    action = ReviewAction.from_dict(msg.get("data", {}))
+                    obs = ws_env.step(action)
+                    await websocket.send_text(json.dumps({
+                        "type": "observation",
+                        "data": _serialize(obs),
+                    }))
+                elif msg_type == "state":
+                    await websocket.send_text(json.dumps({
+                        "type": "state",
+                        "data": _serialize(ws_env.state),
+                    }))
+                elif msg_type == "close":
+                    break
+                else:
+                    await websocket.send_text(json.dumps({
+                        "type": "error",
+                        "data": f"Unknown message type: {msg_type}",
+                    }))
+        except WebSocketDisconnect:
+            pass
+        except Exception as e:
+            try:
+                await websocket.send_text(json.dumps({"type": "error", "data": str(e)}))
+            except Exception:
+                pass
+    return _app
+app = _make_app()
+@app.get("/tasks")
+async def list_tasks():
+    tasks_list = []
+    for task in ALL_TASKS.values():
+        tasks_list.append({
+            "task_id": task["task_id"],
+            "difficulty": task["difficulty"],
+            "description": task["description"],
+            "language": task.get("language", "python"),
+            "max_steps": task["max_steps"],
+            "num_issues": len(task["ground_truth_issues"]),
+            "files": list(task["code_files"].keys()),
+        })
+    action_schema = {
+        "type": "object",
+        "description": "ReviewAction — one action per /step call",
+        "required": ["action_type"],
+        "properties": {
+            "action_type": {
+                "type": "string",
+                "enum": ["flag_issue", "clear_flag", "request_hint", "submit_review"],
+                "description": (
+                    "flag_issue: mark a line as problematic. "
+                    "clear_flag: remove a previous flag. "
+                    "request_hint: get a hint (-0.01 reward). "
+                    "submit_review: end episode and receive final grade."
+                ),
+            },
+            "line_number": {
+                "type": "integer",
+                "description": "Line number of the issue (required for flag_issue / clear_flag)",
+            },
+            "filename": {
+                "type": "string",
+                "description": "File where the issue is (required for flag_issue / clear_flag)",
+            },
+            "issue_type": {
+                "type": "string",
+                "enum": ["bug", "security", "performance", "logic"],
+                "description": "Category of issue (required for flag_issue)",
+            },
+            "severity": {
+                "type": "string",
+                "enum": ["low", "medium", "high", "critical"],
+                "description": "Severity level (required for flag_issue)",
+            },
+            "description": {
+                "type": "string",
+                "description": "Human-readable description of the issue",
+            },
+            "fix_suggestion": {
+                "type": "string",
+                "description": "Optional suggested fix",
+            },
+        },
+        "examples": [
+            {
+                "action_type": "flag_issue",
+                "line_number": 6,
+                "filename": "utils.py",
+                "issue_type": "bug",
+                "severity": "high",
+                "description": "Off-by-one error in range()",
+                "fix_suggestion": "Change range(len(numbers) + 1) to range(len(numbers))",
+            },
+            {"action_type": "submit_review"},
+        ],
+    }
+    return {
+        "tasks": tasks_list,
+        "action_schema": action_schema,
+        "total_tasks": len(tasks_list),
+    }
+class GraderRequest(BaseModel):
+    task_id: str
+    flagged_issues: List[Dict[str, Any]]
+@app.post("/grader")
+async def run_grader(request: GraderRequest):
+    task = ALL_TASKS.get(request.task_id)
+    if not task:
+        raise HTTPException(
+            status_code=404,
+            detail=f"Unknown task_id '{request.task_id}'. Valid: {TASK_IDS}",
+        )
+    flagged = [Issue.from_dict(i) for i in request.flagged_issues]
+    ground_truth = [Issue.from_dict(gt) for gt in task["ground_truth_issues"]]
+    score = grade_episode(flagged, ground_truth)
+    tp = sum(
+        1 for f in flagged
+        if any(
+            True for gt in ground_truth
+            if abs(f.line_number - gt.line_number) <= 2
+            and f.filename == gt.filename
+        )
+    )
+    return {
+        "task_id": request.task_id,
+        "difficulty": task["difficulty"],
+        "score": score,
+        "max_score": 1.0,
+        "details": {
+            "total_flagged": len(flagged),
+            "true_positives": tp,
+            "false_positives": len(flagged) - tp,
+            "total_ground_truth": len(ground_truth),
+        },
+    }
+@app.post("/baseline")
+async def run_baseline():
+    results = {}
+    for task_id, task in ALL_TASKS.items():
+        findings = run_keyword_baseline(task)
+        ground_truth = [Issue.from_dict(gt) for gt in task["ground_truth_issues"]]
+        score = grade_episode(findings, ground_truth)
+        results[task_id] = {
+            "difficulty": task["difficulty"],
+            "score": score,
+            "findings_count": len(findings),
+            "ground_truth_count": len(ground_truth),
+        }
+    overall = sum(r["score"] for r in results.values()) / len(results)
+    return {
+        "baseline_scores": results,
+        "overall_average": round(overall, 4),
+        "method": "keyword_heuristic",
+        "note": (
+            "Run 'python baseline.py' with OPENAI_API_KEY for the LLM-based baseline. "
+            "This endpoint uses a deterministic regex heuristic."
+        ),
+    }
+def main():
+    import uvicorn
+    port = int(os.environ.get("PORT", 7860))
+    uvicorn.run("server.app:app", host="0.0.0.0", port=port)
+if __name__ == "__main__":
+    main()

server/environment.py ADDED Viewed

	@@ -0,0 +1,310 @@

+"""
+Core environment logic for the Code Review Environment.
+"""
+from __future__ import annotations
+import random
+import uuid
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from typing import Optional, List
+from models import Issue, ReviewAction, ReviewObservation, ReviewState
+from tasks.data import ALL_TASKS, TASK_IDS
+from server.graders import grade_episode, compute_live_score, match_issue
+try:
+    from openenv.core.env_server import Environment as _BaseEnv
+    _HAS_OPENENV = True
+except ImportError:
+    _HAS_OPENENV = False
+    class _BaseEnv:  # type: ignore[no-redef]
+        pass
+class CodeReviewEnvironment(_BaseEnv):
+    """
+    A code review and security audit environment.
+    The agent receives code files and must identify bugs, security
+    vulnerabilities, and performance issues by flagging them with
+    exact line numbers, types, and severity ratings.
+    Episode flow:
+      1. reset(task_id) — agent sees code files and task description
+      2. step(flag_issue) — flag a problem; get per-step reward
+      3. step(clear_flag) — remove an incorrectly flagged issue
+      4. step(request_hint) — get a hint (costs -0.01 reward)
+      5. step(submit_review) — episode ends, final grade is returned
+         (or auto-ends when max_steps is reached)
+    """
+    SUPPORTS_CONCURRENT_SESSIONS = False
+    def __init__(self) -> None:
+        self._state = ReviewState()
+        self._task: Optional[dict] = None
+        self._ground_truth: List[Issue] = []
+        self._hint_index: int = 0
+    def reset(
+        self,
+        task_id: Optional[str] = None,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs,
+    ) -> ReviewObservation:
+        """Start a new review episode."""
+        if seed is not None:
+            random.seed(seed)
+        if task_id is None or task_id not in ALL_TASKS:
+            task_id = random.choice(TASK_IDS)
+        self._task = ALL_TASKS[task_id]
+        self._ground_truth = [
+            Issue.from_dict(gt)
+            for gt in self._task["ground_truth_issues"]
+        ]
+        self._hint_index = 0
+        self._state = ReviewState(
+            task_id=task_id,
+            difficulty=self._task["difficulty"],
+            episode_id=episode_id or str(uuid.uuid4()),
+            step_count=0,
+            flagged_issues=[],
+            current_score=0.0,
+            submitted=False,
+        )
+        return ReviewObservation(
+            task_id=task_id,
+            task_description=self._task["description"],
+            code_files=self._task["code_files"],
+            language=self._task.get("language", "python"),
+            flagged_issues=[],
+            step_count=0,
+            max_steps=self._task["max_steps"],
+            hints_remaining=len(self._task.get("hints", [])),
+            feedback=(
+                f"New episode started. Task: {self._task['difficulty'].upper()}. "
+                f"Review the code carefully and flag all issues you find. "
+                f"Use 'submit_review' when done."
+            ),
+            current_score=0.0,
+            done=False,
+            reward=None,
+        )
+    def step(
+        self,
+        action: ReviewAction,
+        timeout_s: Optional[float] = None,
+        **kwargs,
+    ) -> ReviewObservation:
+        """Process one agent action and return the new observation."""
+        if self._task is None:
+            return ReviewObservation(
+                done=True,
+                reward=0.0,
+                feedback="Episode not initialized. Call reset() first.",
+            )
+        if self._state.submitted:
+            return ReviewObservation(
+                task_id=self._state.task_id,
+                task_description="",
+                code_files={},
+                flagged_issues=list(self._state.flagged_issues),
+                step_count=self._state.step_count,
+                max_steps=self._task["max_steps"],
+                hints_remaining=0,
+                feedback="Episode already submitted. Call reset() to start a new episode.",
+                current_score=self._state.current_score,
+                done=True,
+                reward=0.0,
+            )
+        if isinstance(action, dict):
+            action = ReviewAction.from_dict(action)
+        self._state.step_count += 1
+        reward, feedback = self._process_action(action)
+        max_steps = self._task["max_steps"]
+        auto_end = self._state.step_count >= max_steps and not self._state.submitted
+        done = self._state.submitted or auto_end
+        if auto_end and not self._state.submitted:
+            # Grade what was submitted so far
+            final = grade_episode(self._state.flagged_issues, self._ground_truth)
+            self._state.current_score = final
+            reward = final * 0.5  # partial credit for auto-end
+            feedback += (
+                f" Max steps reached. Auto-graded: {final:.3f}. "
+                f"Submit earlier for best score."
+            )
+            self._state.submitted = True
+        live = compute_live_score(self._state.flagged_issues, self._ground_truth)
+        self._state.current_score = live
+        return ReviewObservation(
+            task_id=self._state.task_id,
+            task_description="",
+            code_files={},
+            language=self._task.get("language", "python"),
+            flagged_issues=list(self._state.flagged_issues),
+            step_count=self._state.step_count,
+            max_steps=max_steps,
+            hints_remaining=max(0, len(self._task.get("hints", [])) - self._hint_index),
+            feedback=feedback,
+            current_score=live,
+            done=done,
+            reward=reward,
+        )
+    @property
+    def state(self) -> ReviewState:
+        return self._state
+    def _process_action(self, action: ReviewAction):
+        atype = (action.action_type or "").strip().lower()
+        if atype == "flag_issue":
+            return self._handle_flag(action)
+        elif atype == "clear_flag":
+            return self._handle_clear(action)
+        elif atype == "request_hint":
+            return self._handle_hint()
+        elif atype == "submit_review":
+            return self._handle_submit()
+        else:
+            return 0.0, (
+                f"Unknown action_type '{action.action_type}'. "
+                "Use: flag_issue | clear_flag | request_hint | submit_review"
+            )
+    def _handle_flag(self, action: ReviewAction):
+        if action.line_number is None:
+            return -0.02, "flag_issue requires 'line_number'."
+        if not action.filename:
+            return -0.02, "flag_issue requires 'filename'."
+        if action.issue_type not in ("bug", "security", "performance", "logic", None):
+            action.issue_type = "bug"
+        if action.severity not in ("low", "medium", "high", "critical", None):
+            action.severity = "medium"
+        for existing in self._state.flagged_issues:
+            if (existing.line_number == action.line_number
+                    and existing.filename == action.filename):
+                return 0.0, (
+                    f"Line {action.line_number} in {action.filename} already flagged. "
+                    "Use clear_flag first if you want to change the finding."
+                )
+        new_issue = Issue(
+            line_number=action.line_number,
+            filename=action.filename or "",
+            issue_type=action.issue_type or "bug",
+            severity=action.severity or "medium",
+            description=action.description or "",
+            fix_suggestion=action.fix_suggestion,
+        )
+        is_tp = any(
+            match_issue(new_issue, gt)
+            for gt in self._ground_truth
+        )
+        self._state.flagged_issues.append(new_issue)
+        if is_tp:
+            reward = 0.10
+            feedback = (
+                f"Good catch! Issue flagged at {action.filename}:{action.line_number}. "
+                f"[+0.10 reward — correct finding]"
+            )
+        else:
+            reward = -0.05
+            feedback = (
+                f"Issue flagged at {action.filename}:{action.line_number}. "
+                f"[-0.05 reward — no matching ground-truth issue nearby]"
+            )
+        return reward, feedback
+    def _handle_clear(self, action: ReviewAction):
+        if action.line_number is None or not action.filename:
+            return -0.02, "clear_flag requires 'line_number' and 'filename'."
+        before = len(self._state.flagged_issues)
+        removed = None
+        self._state.flagged_issues = [
+            f for f in self._state.flagged_issues
+            if not (f.line_number == action.line_number
+                    and f.filename == action.filename)
+        ]
+        if len(self._state.flagged_issues) == before:
+            return 0.0, (
+                f"No flagged issue found at {action.filename}:{action.line_number}."
+            )
+        removed_issue = Issue(
+            line_number=action.line_number,
+            filename=action.filename,
+            issue_type="bug",
+            severity="medium",
+        )
+        was_tp = any(match_issue(removed_issue, gt) for gt in self._ground_truth)
+        if was_tp:
+            reward = -0.03
+            feedback = (
+                f"Removed a correct finding at {action.filename}:{action.line_number}. "
+                f"[-0.03 reward]"
+            )
+        else:
+            reward = 0.03
+            feedback = (
+                f"Removed a false positive at {action.filename}:{action.line_number}. "
+                f"[+0.03 reward — good correction]"
+            )
+        return reward, feedback
+    def _handle_hint(self):
+        hints = self._task.get("hints", [])
+        if self._hint_index >= len(hints):
+            return -0.01, "No more hints available for this task."
+        hint = hints[self._hint_index]
+        self._hint_index += 1
+        remaining = len(hints) - self._hint_index
+        return -0.01, f"Hint {self._hint_index}/{len(hints)}: {hint} ({remaining} hints left)"
+    def _handle_submit(self):
+        self._state.submitted = True
+        final_score = grade_episode(self._state.flagged_issues, self._ground_truth)
+        self._state.current_score = final_score
+        tp_count = sum(
+            1 for f in self._state.flagged_issues
+            if any(match_issue(f, gt) for gt in self._ground_truth)
+        )
+        total_gt = len(self._ground_truth)
+        total_flagged = len(self._state.flagged_issues)
+        feedback = (
+            f"Review submitted! Final score: {final_score:.3f}. "
+            f"Found {tp_count}/{total_gt} real issues. "
+            f"Total flags: {total_flagged} "
+            f"({'perfect' if total_flagged == tp_count else f'{total_flagged - tp_count} false positives'})."
+        )
+        return final_score, feedback

server/graders.py ADDED Viewed

	@@ -0,0 +1,170 @@

+"""
+Grading logic for the Code Review Environment.
+"""
+from __future__ import annotations
+import re
+from typing import List, Tuple, Set
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from models import Issue
+_SEV_RANK = {"low": 0, "medium": 1, "high": 2, "critical": 3}
+_TYPE_COMPAT = {
+    "bug": {"bug", "logic"},
+    "logic": {"bug", "logic"},
+    "security": {"security"},
+    "performance": {"performance"},
+}
+def match_issue(flagged: Issue, gt: Issue, line_tolerance: int = 2) -> bool:
+    if flagged.filename != gt.filename:
+        return False
+    if abs(flagged.line_number - gt.line_number) > line_tolerance:
+        return False
+    compat = _TYPE_COMPAT.get(gt.issue_type, {gt.issue_type})
+    if flagged.issue_type not in compat:
+        return False
+    return True
+def grade_episode(
+    flagged: List[Issue],
+    ground_truth: List[Issue],
+    line_tolerance: int = 2,
+) -> float:
+    """Compute a 0.0–1.0 score: 0.70 * F1 + 0.30 * severity_accuracy."""
+    if not ground_truth:
+        return 1.0 if not flagged else 0.0
+    tp = 0
+    fp = 0
+    matched_gt_indices: Set[int] = set()
+    severity_scores: List[float] = []
+    for flag in flagged:
+        matched = False
+        for i, gt in enumerate(ground_truth):
+            if i in matched_gt_indices:
+                continue
+            if match_issue(flag, gt, line_tolerance):
+                tp += 1
+                matched_gt_indices.add(i)
+                matched = True
+                flag_rank = _SEV_RANK.get(flag.severity, 1)
+                gt_rank = _SEV_RANK.get(gt.severity, 1)
+                distance = abs(flag_rank - gt_rank)
+                severity_scores.append(max(0.0, 1.0 - distance * 0.34))
+                break
+        if not matched:
+            fp += 1
+    fn = len(ground_truth) - len(matched_gt_indices)
+    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
+    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
+    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0.0
+    if severity_scores:
+        severity_accuracy = sum(severity_scores) / len(ground_truth)
+    else:
+        severity_accuracy = 0.0
+    final = 0.70 * f1 + 0.30 * severity_accuracy
+    return round(min(1.0, max(0.0, final)), 4)
+def compute_live_score(flagged: List[Issue], ground_truth: List[Issue]) -> float:
+    """F1-only score for per-step feedback (no severity bonus)."""
+    if not ground_truth:
+        return 1.0 if not flagged else 0.0
+    tp = 0
+    fp = 0
+    matched: Set[int] = set()
+    for flag in flagged:
+        hit = False
+        for i, gt in enumerate(ground_truth):
+            if i not in matched and match_issue(flag, gt):
+                tp += 1
+                matched.add(i)
+                hit = True
+                break
+        if not hit:
+            fp += 1
+    fn = len(ground_truth) - len(matched)
+    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
+    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
+    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0.0
+    return round(f1, 4)
+_PATTERNS = [
+    (r"range\(len\(\w+\)\s*\+\s*1\)", None, "bug", "high",
+     "Off-by-one error: range(len(x) + 1) iterates one past the end"),
+    (r"left,\s*right\s*=\s*0,\s*len\(", None, "bug", "medium",
+     "Binary search upper bound should be len(arr) - 1"),
+    (r"counts\[word\]\s*=\s*0\b", None, "bug", "low",
+     "Counter initialized to 0 instead of 1"),
+    (r'SECRET_KEY\s*=\s*["\']', None, "security", "high",
+     "Hardcoded SECRET_KEY in source code"),
+    (r'PASSWORD\s*=\s*["\']', None, "security", "high",
+     "Hardcoded password in source code"),
+    (r"f['\"].*SELECT.*\{", None, "security", "critical",
+     "SQL injection via f-string query construction"),
+    (r"f['\"].*DELETE.*\{", None, "security", "critical",
+     "SQL injection via f-string DELETE query"),
+    (r"render_template_string\(f['\"]", None, "security", "high",
+     "XSS: unsanitized user input in render_template_string"),
+    (r"shell\s*=\s*True", None, "security", "critical",
+     "Command injection risk: shell=True with user input"),
+    (r"hashlib\.md5\(", None, "security", "medium",
+     "MD5 is cryptographically broken, use SHA-256 or HMAC-SHA256"),
+    (r"expected\s*==\s*\w+_hash", None, "security", "medium",
+     "Timing attack: use hmac.compare_digest() for constant-time comparison"),
+    (r"password\s*=\s*models\.CharField", None, "security", "critical",
+     "Plaintext password storage in database"),
+    (r"os\.path\.join\(['\"]\/", None, "security", "high",
+     "Path traversal: os.path.join with absolute prefix doesn't prevent traversal"),
+    (r"\.objects\.get\(id=item\.", None, "performance", "high",
+     "N+1 query: database lookup inside a loop"),
+    (r"FloatField\(\)", None, "bug", "medium",
+     "FloatField for monetary values causes precision errors, use DecimalField"),
+    (r"BinaryField\(\)", None, "security", "high",
+     "BinaryField with pickled data is a deserialization vulnerability"),
+]
+def run_keyword_baseline(task: dict) -> List[Issue]:
+    findings: List[Issue] = []
+    seen_lines: set = set()
+    for filename, code in task.get("code_files", {}).items():
+        lines = code.splitlines()
+        for line_idx, line in enumerate(lines, start=1):
+            for pattern, fname_hint, itype, severity, desc in _PATTERNS:
+                # Optional filename filter
+                if fname_hint and fname_hint not in filename:
+                    continue
+                if re.search(pattern, line):
+                    key = (filename, line_idx)
+                    if key not in seen_lines:
+                        seen_lines.add(key)
+                        findings.append(Issue(
+                            line_number=line_idx,
+                            filename=filename,
+                            issue_type=itype,
+                            severity=severity,
+                            description=desc,
+                        ))
+    return findings

tasks/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from tasks.data import ALL_TASKS, get_task, TASK_IDS
2	+
3	+ __all__ = ["ALL_TASKS", "get_task", "TASK_IDS"]

tasks/data.py ADDED Viewed

	@@ -0,0 +1,434 @@

+"""
+Task definitions for the Code Review Environment.
+"""
+from __future__ import annotations
+from typing import List, Dict, Any
+def _issue(line: int, filename: str, itype: str, severity: str, desc: str, fix: str = "") -> dict:
+    return {
+        "line_number": line,
+        "filename": filename,
+        "issue_type": itype,
+        "severity": severity,
+        "description": desc,
+        "fix_suggestion": fix,
+    }
+_UTILS_CODE = """\
+def calculate_average(numbers):
+    \"\"\"Calculate the average of a list of numbers.\"\"\"
+    if not numbers:
+        return 0
+    total = 0
+    for i in range(len(numbers) + 1):
+        total += numbers[i]
+    return total / len(numbers)
+def binary_search(arr, target):
+    \"\"\"Search for target in sorted array. Returns index or -1.\"\"\"
+    left, right = 0, len(arr)
+    while left <= right:
+        mid = (left + right) // 2
+        if arr[mid] == target:
+            return mid
+        elif arr[mid] < target:
+            left = mid + 1
+        else:
+            right = mid - 1
+    return -1
+def count_words(text):
+    \"\"\"Count word frequency in a text string.\"\"\"
+    words = text.lower().split()
+    counts = {}
+    for word in words:
+        if word in counts:
+            counts[word] += 1
+        else:
+            counts[word] = 0
+    return counts
+def reverse_string(s):
+    \"\"\"Return the reversed version of a string (no bug here).\"\"\"
+    return s[::-1]
+"""
+TASK_BUG_DETECTION: Dict[str, Any] = {
+    "task_id": "bug-detection",
+    "difficulty": "easy",
+    "description": (
+        "Review this Python utility module for logical bugs and errors.\n"
+        "The code contains several functions with subtle bugs that would cause\n"
+        "incorrect results or crashes. Identify all issues with exact line numbers,\n"
+        "issue type, severity, and a clear description of the problem.\n\n"
+        "File to review: utils.py"
+    ),
+    "language": "python",
+    "code_files": {
+        "utils.py": _UTILS_CODE,
+    },
+    "ground_truth_issues": [
+        _issue(
+            6, "utils.py", "bug", "high",
+            "Off-by-one error: range(len(numbers) + 1) iterates one past the end, "
+            "causing IndexError on the last iteration.",
+            "Change to: range(len(numbers))"
+        ),
+        _issue(
+            13, "utils.py", "bug", "medium",
+            "Binary search upper bound is wrong: right = len(arr) causes IndexError "
+            "when accessing arr[mid] on a full array.",
+            "Change to: right = len(arr) - 1"
+        ),
+        _issue(
+            33, "utils.py", "bug", "low",
+            "Word count initializes new entries to 0 instead of 1, so every word's "
+            "count is underreported by 1.",
+            "Change to: counts[word] = 1"
+        ),
+    ],
+    "max_steps": 15,
+    "hints": [
+        "Look carefully at loop boundary conditions — are they off by one?",
+        "The binary_search function has an issue with its initial right bound.",
+        "Check how new keys are initialized in the word count dictionary.",
+    ],
+}
+_APP_CODE = """\
+import sqlite3
+import os
+import subprocess
+from flask import Flask, request, render_template_string
+app = Flask(__name__)
+SECRET_KEY = "hardcoded_secret_key_123"
+DB_PASSWORD = "admin123"
+def get_db():
+    return sqlite3.connect('users.db')
+@app.route('/user/<username>')
+def get_user(username):
+    db = get_db()
+    query = f"SELECT * FROM users WHERE username = '{username}'"
+    result = db.execute(query).fetchone()
+    return str(result)
+@app.route('/search')
+def search():
+    term = request.args.get('term', '')
+    template = f"<h1>Results for: {term}</h1>"
+    return render_template_string(template)
+@app.route('/file')
+def read_file():
+    filename = request.args.get('name', '')
+    filepath = os.path.join('/data', filename)
+    with open(filepath, 'r') as f:
+        return f.read()
+@app.route('/admin/delete', methods=['POST'])
+def admin_delete():
+    user_id = request.form.get('user_id')
+    db = get_db()
+    db.execute(f"DELETE FROM users WHERE id = {user_id}")
+    db.commit()
+    return "Deleted"
+@app.route('/ping')
+def ping():
+    host = request.args.get('host', '')
+    result = subprocess.run(f"ping -c 1 {host}", shell=True, capture_output=True)
+    return result.stdout.decode()
+"""
+TASK_SECURITY_AUDIT: Dict[str, Any] = {
+    "task_id": "security-audit",
+    "difficulty": "medium",
+    "description": (
+        "Perform a security audit on this Flask web application.\n"
+        "The code contains multiple OWASP Top-10 security vulnerabilities.\n"
+        "Identify all security issues with their exact line numbers, severity ratings,\n"
+        "and recommended fixes. Consider: injection attacks, broken authentication,\n"
+        "sensitive data exposure, and improper input handling.\n\n"
+        "File to review: app.py"
+    ),
+    "language": "python",
+    "code_files": {
+        "app.py": _APP_CODE,
+    },
+    "ground_truth_issues": [
+        _issue(
+            8, "app.py", "security", "high",
+            "Hardcoded SECRET_KEY in source code. Anyone with repo access can forge sessions.",
+            "Use: SECRET_KEY = os.environ.get('SECRET_KEY') and set it as an env var."
+        ),
+        _issue(
+            9, "app.py", "security", "high",
+            "Hardcoded database password in source code. Credentials should never be in code.",
+            "Use: DB_PASSWORD = os.environ.get('DB_PASSWORD')"
+        ),
+        _issue(
+            19, "app.py", "security", "critical",
+            "SQL injection: username is interpolated directly into the query string. "
+            "An attacker can supply username = \\' OR 1=1 -- to dump the database.",
+            "Use parameterized queries: db.execute('SELECT * FROM users WHERE username = ?', (username,))"
+        ),
+        _issue(
+            27, "app.py", "security", "high",
+            "Cross-site scripting (XSS): user-supplied 'term' is rendered directly in an "
+            "HTML template via render_template_string without escaping.",
+            "Use flask.escape(term) or Markup.escape(term) before interpolating into HTML."
+        ),
+        _issue(
+            34, "app.py", "security", "high",
+            "Path traversal: os.path.join('/data', filename) does not prevent filenames "
+            "like '../etc/passwd' from escaping the /data directory.",
+            "Use: filename = os.path.basename(filename) and validate against an allowlist."
+        ),
+        _issue(
+            40, "app.py", "security", "critical",
+            "Missing authentication: the /admin/delete endpoint has no access control. "
+            "Any unauthenticated user can delete records.",
+            "Add @login_required decorator and check that request.user.is_admin is True."
+        ),
+        _issue(
+            51, "app.py", "security", "critical",
+            "Command injection: user-supplied 'host' is interpolated into a shell command "
+            "with shell=True. Attacker can supply 'x; rm -rf /' to execute arbitrary commands.",
+            "Use: subprocess.run(['ping', '-c', '1', host], shell=False) after validating host."
+        ),
+    ],
+    "max_steps": 20,
+    "hints": [
+        "Look for hardcoded credentials and secrets at the top of the file.",
+        "Check every place user input (request.args, request.form) touches a database query, "
+        "template, file path, or shell command.",
+        "The admin endpoint is missing an authorization check.",
+    ],
+}
+_VIEWS_CODE = """\
+import threading
+from django.db import transaction
+from django.contrib.auth.decorators import login_required
+from django.http import JsonResponse
+from .models import Order, Product, Cart
+import hashlib
+_lock = threading.Lock()
+@login_required
+def place_order(request):
+    user = request.user
+    cart_items = Cart.objects.filter(user=user)
+    if not cart_items.exists():
+        return JsonResponse({'error': 'Cart is empty'}, status=400)
+    total = 0
+    for item in cart_items:
+        product = Product.objects.get(id=item.product_id)
+        total += product.price * item.quantity
+    for item in cart_items:
+        product = Product.objects.get(id=item.product_id)
+        if product.stock < item.quantity:
+            return JsonResponse({'error': f'Insufficient stock for {product.name}'}, status=400)
+    order = Order.objects.create(
+        user=user,
+        total=total,
+        status='pending'
+    )
+    for item in cart_items:
+        product = Product.objects.get(id=item.product_id)
+        product.stock -= item.quantity
+        product.save()
+    cart_items.delete()
+    return JsonResponse({'order_id': order.id, 'total': float(total)})
+@login_required
+def get_order_history(request):
+    page = int(request.GET.get('page', 1))
+    per_page = int(request.GET.get('per_page', 10))
+    orders = Order.objects.filter(user=request.user)[
+        (page - 1) * per_page: page * per_page
+    ]
+    result = []
+    for order in orders:
+        result.append({
+            'id': order.id,
+            'total': order.total,
+            'status': order.status,
+        })
+    return JsonResponse({'orders': result})
+def verify_payment(order_id, payment_hash):
+    order = Order.objects.get(id=order_id)
+    expected = hashlib.md5(f"{order_id}{order.total}".encode()).hexdigest()
+    return expected == payment_hash
+"""
+_MODELS_CODE = """\
+from django.db import models
+import pickle
+class User(models.Model):
+    username = models.CharField(max_length=150)
+    email = models.CharField(max_length=255)
+    password = models.CharField(max_length=255)
+    class Meta:
+        db_table = 'users'
+class Product(models.Model):
+    name = models.CharField(max_length=255)
+    price = models.FloatField()
+    stock = models.IntegerField(default=0)
+    metadata = models.BinaryField()
+    class Meta:
+        db_table = 'products'
+class Order(models.Model):
+    user = models.ForeignKey(User, on_delete=models.CASCADE)
+    total = models.FloatField()
+    status = models.CharField(max_length=50)
+    created_at = models.DateTimeField(auto_now_add=True)
+    class Meta:
+        db_table = 'orders'
+class Cart(models.Model):
+    user = models.ForeignKey(User, on_delete=models.CASCADE)
+    product_id = models.IntegerField()
+    quantity = models.IntegerField()
+    class Meta:
+        db_table = 'cart'
+"""
+TASK_COMPREHENSIVE: Dict[str, Any] = {
+    "task_id": "comprehensive-review",
+    "difficulty": "hard",
+    "description": (
+        "Perform a comprehensive code review of this Django e-commerce API.\n"
+        "The code spans two files and contains bugs, security vulnerabilities,\n"
+        "performance issues, and data modeling problems.\n"
+        "Find ALL issues across BOTH files. This is a hard task — look carefully\n"
+        "for subtle architectural problems, not just surface-level issues.\n\n"
+        "Files to review: views.py, models.py"
+    ),
+    "language": "python",
+    "code_files": {
+        "views.py": _VIEWS_CODE,
+        "models.py": _MODELS_CODE,
+    },
+    "ground_truth_issues": [
+        _issue(
+            21, "views.py", "performance", "high",
+            "N+1 query: Product.objects.get() is called inside a loop, issuing one SQL "
+            "query per cart item. With 100 items this means 100 DB roundtrips.",
+            "Use: Product.objects.filter(id__in=[i.product_id for i in cart_items]) "
+            "then build a dict for O(1) lookup."
+        ),
+        _issue(
+            26, "views.py", "bug", "critical",
+            "Race condition: the stock check and stock decrement are not atomic. "
+            "Two concurrent requests can both pass the check and oversell the product.",
+            "Wrap in transaction.atomic() and use Product.objects.select_for_update() "
+            "to lock rows during the check."
+        ),
+        _issue(
+            29, "views.py", "bug", "high",
+            "Order is created outside a database transaction. If stock decrement fails "
+            "after the order is created, the database is left in an inconsistent state.",
+            "Wrap the entire order creation flow in: with transaction.atomic():"
+        ),
+        _issue(
+            47, "views.py", "security", "medium",
+            "No maximum cap on per_page: an attacker can request per_page=1000000 "
+            "to dump the entire orders table in one request, causing DoS or data leak.",
+            "Add: per_page = min(int(request.GET.get('per_page', 10)), 100)"
+        ),
+        _issue(
+            66, "views.py", "security", "medium",
+            "MD5 is a cryptographically broken hash function and should not be used "
+            "for payment verification. Collisions can be manufactured.",
+            "Use HMAC-SHA256: hmac.new(SECRET.encode(), payload.encode(), hashlib.sha256).hexdigest()"
+        ),
+        _issue(
+            67, "views.py", "security", "medium",
+            "Timing attack: string comparison with == leaks timing information that "
+            "allows an attacker to forge valid hashes byte-by-byte.",
+            "Use: hmac.compare_digest(expected, payment_hash) for constant-time comparison."
+        ),
+        _issue(
+            8, "models.py", "security", "critical",
+            "Plaintext password storage: passwords are stored as raw strings in the "
+            "database. Any DB breach immediately exposes all user passwords.",
+            "Use Django's built-in: from django.contrib.auth.hashers import make_password, check_password"
+        ),
+        _issue(
+            16, "models.py", "bug", "medium",
+            "FloatField for monetary values causes floating-point precision errors "
+            "(e.g., 0.1 + 0.2 != 0.3). This will produce wrong totals over time.",
+            "Use: DecimalField(max_digits=10, decimal_places=2) for all monetary fields."
+        ),
+        _issue(
+            18, "models.py", "security", "high",
+            "BinaryField storing pickled data is dangerous: pickle.loads() on untrusted "
+            "data can execute arbitrary code. Anyone who can write to this field can RCE.",
+            "Use: JSONField() instead. If binary storage is required, validate/sign the data."
+        ),
+    ],
+    "max_steps": 30,
+    "hints": [
+        "Look for database queries inside for loops — this is a classic N+1 problem.",
+        "Check whether stock checks and order creation happen inside a database transaction.",
+        "Look at models.py: how are passwords and monetary values stored?",
+    ],
+}
+ALL_TASKS: Dict[str, Dict[str, Any]] = {
+    TASK_BUG_DETECTION["task_id"]: TASK_BUG_DETECTION,
+    TASK_SECURITY_AUDIT["task_id"]: TASK_SECURITY_AUDIT,
+    TASK_COMPREHENSIVE["task_id"]: TASK_COMPREHENSIVE,
+}
+TASK_IDS: List[str] = list(ALL_TASKS.keys())
+def get_task(task_id: str) -> Dict[str, Any]:
+    """Return task definition by ID, raising KeyError if not found."""
+    if task_id not in ALL_TASKS:
+        raise KeyError(f"Unknown task_id '{task_id}'. Valid: {TASK_IDS}")
+    return ALL_TASKS[task_id]

tests/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # tests package

tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,314 @@

+"""
+Tests for CodeReviewEnvironment.
+Run with:  pytest tests/ -v
+Or:        python -m pytest tests/ -v
+"""
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+import pytest
+from models import ReviewAction, ReviewObservation, ReviewState
+from server.environment import CodeReviewEnvironment
+from tasks.data import ALL_TASKS, TASK_IDS
+# ---------------------------------------------------------------------------
+# Fixtures
+# ---------------------------------------------------------------------------
+@pytest.fixture
+def env():
+    return CodeReviewEnvironment()
+@pytest.fixture
+def env_bug(env):
+    env.reset(task_id="bug-detection")
+    return env
+@pytest.fixture
+def env_sec(env):
+    env.reset(task_id="security-audit")
+    return env
+@pytest.fixture
+def env_hard(env):
+    env.reset(task_id="comprehensive-review")
+    return env
+# ---------------------------------------------------------------------------
+# reset() tests
+# ---------------------------------------------------------------------------
+class TestReset:
+    def test_reset_returns_observation(self, env):
+        obs = env.reset()
+        assert isinstance(obs, ReviewObservation)
+    def test_reset_done_is_false(self, env):
+        obs = env.reset()
+        assert obs.done is False
+    def test_reset_reward_is_none(self, env):
+        obs = env.reset()
+        assert obs.reward is None
+    def test_reset_has_code_files(self, env):
+        obs = env.reset()
+        assert isinstance(obs.code_files, dict)
+        assert len(obs.code_files) > 0
+    def test_reset_step_count_zero(self, env):
+        obs = env.reset()
+        assert obs.step_count == 0
+    def test_reset_no_flagged_issues(self, env):
+        obs = env.reset()
+        assert obs.flagged_issues == []
+    def test_reset_specific_task(self, env):
+        for task_id in TASK_IDS:
+            obs = env.reset(task_id=task_id)
+            assert obs.task_id == task_id
+    def test_reset_bug_detection(self, env):
+        obs = env.reset(task_id="bug-detection")
+        assert "utils.py" in obs.code_files
+    def test_reset_security_audit(self, env):
+        obs = env.reset(task_id="security-audit")
+        assert "app.py" in obs.code_files
+    def test_reset_comprehensive(self, env):
+        obs = env.reset(task_id="comprehensive-review")
+        assert "views.py" in obs.code_files
+        assert "models.py" in obs.code_files
+    def test_reset_with_seed_is_reproducible(self, env):
+        obs1 = env.reset(seed=42)
+        task1 = obs1.task_id
+        obs2 = env.reset(seed=42)
+        task2 = obs2.task_id
+        assert task1 == task2
+    def test_reset_clears_previous_state(self, env):
+        env.reset(task_id="bug-detection")
+        env.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        obs = env.reset(task_id="bug-detection")
+        assert obs.flagged_issues == []
+        assert obs.step_count == 0
+# ---------------------------------------------------------------------------
+# step() — flag_issue tests
+# ---------------------------------------------------------------------------
+class TestFlagIssue:
+    def test_flag_increments_step_count(self, env_bug):
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        assert obs.step_count == 1
+    def test_flag_adds_to_flagged_issues(self, env_bug):
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        assert len(obs.flagged_issues) == 1
+    def test_flag_true_positive_gives_positive_reward(self, env_bug):
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="off-by-one"
+        ))
+        assert obs.reward is not None and obs.reward > 0
+    def test_flag_false_positive_gives_negative_reward(self, env_bug):
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=100, filename="utils.py",
+            issue_type="bug", severity="low", description="nonexistent issue"
+        ))
+        assert obs.reward is not None and obs.reward < 0
+    def test_flag_missing_line_number_gives_penalty(self, env_bug):
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        assert obs.reward is not None and obs.reward <= 0
+    def test_flag_duplicate_line_no_change(self, env_bug):
+        env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="same line again"
+        ))
+        assert len(obs.flagged_issues) == 1  # not doubled
+    def test_flag_multiple_issues(self, env_bug):
+        for line in [6, 13, 33]:
+            env_bug.step(ReviewAction(
+                action_type="flag_issue", line_number=line, filename="utils.py",
+                issue_type="bug", severity="medium", description=f"bug at {line}"
+            ))
+        obs = env_bug.state
+        assert len(obs.flagged_issues) == 3
+# ---------------------------------------------------------------------------
+# step() — clear_flag tests
+# ---------------------------------------------------------------------------
+class TestClearFlag:
+    def test_clear_removes_flag(self, env_bug):
+        env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        obs = env_bug.step(ReviewAction(
+            action_type="clear_flag", line_number=6, filename="utils.py",
+            description=""
+        ))
+        assert len(obs.flagged_issues) == 0
+    def test_clear_nonexistent_flag_no_reward(self, env_bug):
+        obs = env_bug.step(ReviewAction(
+            action_type="clear_flag", line_number=999, filename="utils.py",
+            description=""
+        ))
+        assert obs.reward == 0.0
+    def test_clear_false_positive_gives_positive_reward(self, env_bug):
+        # First flag a FP
+        env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=100, filename="utils.py",
+            issue_type="bug", severity="low", description="wrong"
+        ))
+        obs = env_bug.step(ReviewAction(
+            action_type="clear_flag", line_number=100, filename="utils.py",
+            description=""
+        ))
+        assert obs.reward is not None and obs.reward > 0
+# ---------------------------------------------------------------------------
+# step() — request_hint tests
+# ---------------------------------------------------------------------------
+class TestRequestHint:
+    def test_hint_gives_small_negative_reward(self, env_bug):
+        obs = env_bug.step(ReviewAction(action_type="request_hint"))
+        assert obs.reward is not None and obs.reward < 0
+    def test_hint_decrements_hints_remaining(self, env_bug):
+        before = env_bug.state.step_count  # proxy check
+        obs1 = env_bug.step(ReviewAction(action_type="request_hint"))
+        obs2 = env_bug.step(ReviewAction(action_type="request_hint"))
+        assert obs2.hints_remaining < obs1.hints_remaining
+    def test_hint_content_in_feedback(self, env_bug):
+        obs = env_bug.step(ReviewAction(action_type="request_hint"))
+        assert "hint" in obs.feedback.lower() or "loop" in obs.feedback.lower()
+# ---------------------------------------------------------------------------
+# step() — submit_review tests
+# ---------------------------------------------------------------------------
+class TestSubmitReview:
+    def test_submit_ends_episode(self, env_bug):
+        obs = env_bug.step(ReviewAction(action_type="submit_review"))
+        assert obs.done is True
+    def test_submit_reward_is_float_in_range(self, env_bug):
+        obs = env_bug.step(ReviewAction(action_type="submit_review"))
+        assert obs.reward is not None
+        assert 0.0 <= obs.reward <= 1.0
+    def test_submit_all_bugs_gives_high_score(self, env_bug):
+        # Flag all 3 correct bugs
+        for line, sev in [(6, "high"), (13, "medium"), (33, "low")]:
+            env_bug.step(ReviewAction(
+                action_type="flag_issue", line_number=line, filename="utils.py",
+                issue_type="bug", severity=sev, description=f"bug at line {line}"
+            ))
+        obs = env_bug.step(ReviewAction(action_type="submit_review"))
+        assert obs.reward is not None and obs.reward >= 0.7
+    def test_submit_no_flags_gives_zero(self, env_bug):
+        obs = env_bug.step(ReviewAction(action_type="submit_review"))
+        assert obs.reward == 0.0
+    def test_submit_after_done_is_noop(self, env_bug):
+        env_bug.step(ReviewAction(action_type="submit_review"))
+        obs2 = env_bug.step(ReviewAction(action_type="submit_review"))
+        assert obs2.done is True  # still done
+# ---------------------------------------------------------------------------
+# state property tests
+# ---------------------------------------------------------------------------
+class TestState:
+    def test_state_returns_review_state(self, env):
+        env.reset(task_id="bug-detection")
+        st = env.state
+        assert isinstance(st, ReviewState)
+    def test_state_has_episode_id(self, env):
+        env.reset(task_id="bug-detection")
+        assert env.state.episode_id is not None
+    def test_state_tracks_step_count(self, env_bug):
+        env_bug.step(ReviewAction(action_type="request_hint"))
+        assert env_bug.state.step_count == 1
+    def test_state_tracks_flagged_issues(self, env_bug):
+        env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        assert len(env_bug.state.flagged_issues) == 1
+# ---------------------------------------------------------------------------
+# Unknown action type
+# ---------------------------------------------------------------------------
+class TestUnknownAction:
+    def test_unknown_action_type_no_crash(self, env_bug):
+        obs = env_bug.step(ReviewAction(action_type="invalid_action"))
+        assert obs is not None
+        assert obs.done is False or obs.done is True
+# ---------------------------------------------------------------------------
+# Max steps auto-end
+# ---------------------------------------------------------------------------
+class TestMaxSteps:
+    def test_episode_auto_ends_at_max_steps(self):
+        """Verify episode ends when step budget is exhausted."""
+        env = CodeReviewEnvironment()
+        obs = env.reset(task_id="bug-detection")
+        max_steps = obs.max_steps
+        for _ in range(max_steps):
+            obs = env.step(ReviewAction(action_type="request_hint"))
+            if obs.done:
+                break
+        assert obs.done is True

tests/test_graders.py ADDED Viewed

	@@ -0,0 +1,215 @@

+"""
+Tests for the grading logic.
+"""
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+import pytest
+from models import Issue
+from server.graders import grade_episode, match_issue, run_keyword_baseline
+from tasks.data import ALL_TASKS, TASK_IDS
+def _issue(line, filename, itype="bug", severity="medium", desc=""):
+    return Issue(line_number=line, filename=filename, issue_type=itype,
+                 severity=severity, description=desc)
+# ---------------------------------------------------------------------------
+# match_issue()
+# ---------------------------------------------------------------------------
+class TestMatchIssue:
+    def test_exact_match(self):
+        f = _issue(6, "utils.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_issue(f, gt) is True
+    def test_line_within_tolerance(self):
+        f = _issue(7, "utils.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_issue(f, gt, line_tolerance=2) is True
+    def test_line_outside_tolerance(self):
+        f = _issue(10, "utils.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_issue(f, gt, line_tolerance=2) is False
+    def test_wrong_filename(self):
+        f = _issue(6, "other.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_issue(f, gt) is False
+    def test_bug_logic_interchangeable(self):
+        f = _issue(6, "utils.py", "logic", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_issue(f, gt) is True
+    def test_logic_bug_interchangeable(self):
+        f = _issue(6, "utils.py", "bug", "high")
+        gt = _issue(6, "utils.py", "logic", "high")
+        assert match_issue(f, gt) is True
+    def test_wrong_type_no_match(self):
+        f = _issue(6, "utils.py", "performance", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_issue(f, gt) is False
+# ---------------------------------------------------------------------------
+# grade_episode()
+# ---------------------------------------------------------------------------
+class TestGradeEpisode:
+    def test_empty_both_is_perfect(self):
+        assert grade_episode([], []) == 1.0
+    def test_empty_flagged_is_zero(self):
+        gt = [_issue(6, "utils.py")]
+        assert grade_episode([], gt) == 0.0
+    def test_false_positives_only_is_zero(self):
+        flagged = [_issue(100, "utils.py"), _issue(200, "utils.py")]
+        gt = [_issue(6, "utils.py")]
+        score = grade_episode(flagged, gt)
+        assert score == 0.0
+    def test_perfect_match_is_near_one(self):
+        gt = [
+            _issue(6, "utils.py", "bug", "high"),
+            _issue(13, "utils.py", "bug", "medium"),
+        ]
+        score = grade_episode(gt, gt)
+        assert score >= 0.9
+    def test_partial_match(self):
+        gt = [
+            _issue(6, "utils.py", "bug", "high"),
+            _issue(13, "utils.py", "bug", "medium"),
+            _issue(33, "utils.py", "bug", "low"),
+        ]
+        flagged = [_issue(6, "utils.py", "bug", "high")]  # only 1 of 3
+        score = grade_episode(flagged, gt)
+        # recall = 1/3, precision = 1/1, F1 = 0.5
+        assert 0.3 < score < 0.6
+    def test_false_positives_lower_score(self):
+        gt = [_issue(6, "utils.py", "bug", "high")]
+        perfect = [_issue(6, "utils.py", "bug", "high")]
+        with_fp = [_issue(6, "utils.py", "bug", "high"), _issue(100, "utils.py")]
+        assert grade_episode(perfect, gt) > grade_episode(with_fp, gt)
+    def test_severity_mismatch_lowers_score(self):
+        gt = [_issue(6, "utils.py", "bug", "critical")]
+        exact = [_issue(6, "utils.py", "bug", "critical")]
+        wrong_sev = [_issue(6, "utils.py", "bug", "low")]
+        assert grade_episode(exact, gt) > grade_episode(wrong_sev, gt)
+    def test_score_is_always_in_0_1(self):
+        import random
+        random.seed(0)
+        gt = [_issue(i * 10, "f.py") for i in range(5)]
+        for _ in range(20):
+            n = random.randint(0, 10)
+            flagged = [_issue(random.randint(1, 100), "f.py") for _ in range(n)]
+            score = grade_episode(flagged, gt)
+            assert 0.0 <= score <= 1.0, f"Score {score} out of range"
+    def test_multifile_match(self):
+        gt = [
+            _issue(21, "views.py", "performance", "high"),
+            _issue(8, "models.py", "security", "critical"),
+        ]
+        flagged = [
+            _issue(21, "views.py", "performance", "high"),
+            _issue(8, "models.py", "security", "critical"),
+        ]
+        score = grade_episode(flagged, gt)
+        assert score >= 0.85
+    def test_multifile_wrong_file_no_match(self):
+        gt = [_issue(21, "views.py", "performance", "high")]
+        flagged = [_issue(21, "models.py", "performance", "high")]  # wrong file
+        assert grade_episode(flagged, gt) == 0.0
+# ---------------------------------------------------------------------------
+# run_keyword_baseline()
+# ---------------------------------------------------------------------------
+class TestKeywordBaseline:
+    def test_baseline_returns_list(self):
+        from tasks.data import TASK_BUG_DETECTION
+        findings = run_keyword_baseline(TASK_BUG_DETECTION)
+        assert isinstance(findings, list)
+    def test_baseline_issues_have_correct_types(self):
+        from tasks.data import TASK_BUG_DETECTION
+        findings = run_keyword_baseline(TASK_BUG_DETECTION)
+        for f in findings:
+            assert isinstance(f, Issue)
+            assert f.issue_type in ("bug", "security", "performance", "logic")
+            assert f.severity in ("low", "medium", "high", "critical")
+    def test_baseline_finds_some_security_issues(self):
+        from tasks.data import TASK_SECURITY_AUDIT
+        findings = run_keyword_baseline(TASK_SECURITY_AUDIT)
+        security_finds = [f for f in findings if f.issue_type == "security"]
+        assert len(security_finds) >= 2
+    def test_baseline_score_in_range(self):
+        for task_id in TASK_IDS:
+            task = ALL_TASKS[task_id]
+            findings = run_keyword_baseline(task)
+            gt = [Issue.from_dict(i) for i in task["ground_truth_issues"]]
+            score = grade_episode(findings, gt)
+            assert 0.0 <= score <= 1.0, f"Task {task_id}: score={score} out of range"
+    def test_baseline_score_is_nonzero(self):
+        """Heuristic should find at least something in most tasks."""
+        for task_id in TASK_IDS:
+            task = ALL_TASKS[task_id]
+            findings = run_keyword_baseline(task)
+            gt = [Issue.from_dict(i) for i in task["ground_truth_issues"]]
+            score = grade_episode(findings, gt)
+            # Not every task may have regex hits, but security-audit should
+            if task_id == "security-audit":
+                assert score > 0.0, f"Heuristic found nothing in {task_id}"
+# ---------------------------------------------------------------------------
+# Ground truth sanity checks
+# ---------------------------------------------------------------------------
+class TestGroundTruth:
+    def test_all_tasks_have_3_plus_issues(self):
+        for task_id, task in ALL_TASKS.items():
+            assert len(task["ground_truth_issues"]) >= 3, (
+                f"Task {task_id} has fewer than 3 issues"
+            )
+    def test_all_tasks_have_valid_difficulties(self):
+        difficulties = {t["difficulty"] for t in ALL_TASKS.values()}
+        assert "easy" in difficulties
+        assert "medium" in difficulties
+        assert "hard" in difficulties
+    def test_all_issues_have_required_fields(self):
+        for task_id, task in ALL_TASKS.items():
+            for i, issue in enumerate(task["ground_truth_issues"]):
+                assert "line_number" in issue, f"{task_id}[{i}] missing line_number"
+                assert "filename" in issue, f"{task_id}[{i}] missing filename"
+                assert "issue_type" in issue, f"{task_id}[{i}] missing issue_type"
+                assert "severity" in issue, f"{task_id}[{i}] missing severity"
+    def test_bug_detection_issues_in_utils_py(self):
+        task = ALL_TASKS["bug-detection"]
+        for issue in task["ground_truth_issues"]:
+            assert issue["filename"] == "utils.py"
+    def test_comprehensive_has_multifile_issues(self):
+        task = ALL_TASKS["comprehensive-review"]
+        files = {i["filename"] for i in task["ground_truth_issues"]}
+        assert "views.py" in files
+        assert "models.py" in files

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff