Spaces:

Avnishjain
/

codereview

Configuration error

App Files Files Community

Avnishjain commited on 5 days ago

Commit

7b2a69c

verified ·

1 Parent(s): db3bb4f

Upload 34 files

Browse files

Files changed (34) hide show

Dockerfile +40 -0
README.md +326 -5
RULES.md +261 -0
__pycache__/app.cpython-313.pyc +0 -0
app.py +145 -0
baseline_agent.py +316 -0
corpus/__init__.py +1 -0
corpus/__pycache__/__init__.cpython-313.pyc +0 -0
corpus/__pycache__/snippets.cpython-313.pyc +0 -0
corpus/snippets.py +390 -0
env/__init__.py +1 -0
env/__pycache__/__init__.cpython-313.pyc +0 -0
env/__pycache__/environment.cpython-313.pyc +0 -0
env/__pycache__/models.cpython-313.pyc +0 -0
env/environment.py +317 -0
env/models.py +117 -0
graders/__init__.py +1 -0
graders/__pycache__/__init__.cpython-313.pyc +0 -0
graders/__pycache__/graders.cpython-313.pyc +0 -0
graders/graders.py +313 -0
inference.py +304 -0
openenv-code-review.tar.gz +3 -0
openenv.yaml +163 -0
pyproject.toml +31 -0
requirements.txt +7 -0
server/__init__.py +1 -0
server/app.py +34 -0
templates/index.html +807 -0
tests/__init__.py +1 -0
tests/__pycache__/__init__.cpython-313.pyc +0 -0
tests/__pycache__/test_env.cpython-313-pytest-9.0.3.pyc +0 -0
tests/test_env.py +269 -0
uv.lock +0 -0
validate-submission.sh +185 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,40 @@

+# ---- Build stage ----
+FROM python:3.11-slim AS builder
+WORKDIR /app
+# Install dependencies into a virtual environment
+COPY requirements.txt .
+RUN python -m venv /opt/venv && \
+    /opt/venv/bin/pip install --upgrade pip && \
+    /opt/venv/bin/pip install --no-cache-dir -r requirements.txt
+# ---- Runtime stage ----
+FROM python:3.11-slim
+# HF Spaces expects the app to listen on port 7860
+ENV PORT=7860 \
+    PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PATH="/opt/venv/bin:$PATH"
+WORKDIR /app
+# Copy virtual env from builder
+COPY --from=builder /opt/venv /opt/venv
+# Copy application code
+COPY . .
+# Create non-root user (HF Spaces security requirement)
+RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
+USER appuser
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')"
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

README.md CHANGED Viewed

@@ -1,10 +1,331 @@
 ---
-title: Codereview
-emoji: 🌍
-colorFrom: green
-colorTo: yellow
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 🔍 CodeReview OpenEnv
+An **OpenEnv-compliant AI training environment** that simulates professional Python code review. Agents learn to identify bugs, security vulnerabilities, performance bottlenecks, style issues, and documentation gaps — exactly as a senior engineer would in a real pull-request workflow.
+---
+## Why Code Review?
+Code review is one of the highest-leverage tasks in software engineering. It is:
+- **Real-world**: Every professional software team does it daily
+- **Structured enough to grade**: Issues have objectively correct or incorrect assessments
+- **Rich in partial signal**: An agent that spots 3/5 critical issues is measurably better than one that spots 1/5
+- **Scalable in difficulty**: Easy (bugs only) → Hard (all categories + written summary)
+This makes it an ideal domain for training and evaluating LLM-based agents on multi-step reasoning and quality estimation tasks.
+---
+## Environment Description
+```
+CodeReviewEnv
+├── Task 1 – Easy    : Bug detection + Code style        (calculator.py, 31 lines)
+├── Task 2 – Medium  : Security + Performance audit      (user_service.py, 55 lines)
+└── Task 3 – Hard    : Full review, all 5 categories     (data_pipeline.py, 49 lines)
+```
+Each task presents a Python snippet containing intentional flaws. The agent submits `ReviewComment` objects across one or more steps, then finalises with `submit=True`. A deterministic grader scores the review against ground-truth issues.
+---
+## Observation Space
+What the agent sees on each step:
+| Field | Type | Description |
+|---|---|---|
+| `task_id` | `str` | Active task identifier |
+| `step` | `int` | Current step (0-indexed) |
+| `snippet.file_name` | `str` | Logical file name (e.g. `auth.py`) |
+| `snippet.source` | `str` | Full Python source code |
+| `instructions` | `str` | Review scope, difficulty, and guidance |
+| `previous_comments` | `list[ReviewComment]` | All comments submitted so far |
+| `feedback` | `str \| None` | Env feedback on the last action |
+| `done` | `bool` | Whether the episode has ended |
+---
+## Action Space
+What the agent submits on each step:
+```json
+{
+  "comments": [
+    {
+      "line": 10,
+      "category": "security",
+      "severity": "critical",
+      "message": "SQL injection via string interpolation in query.",
+      "suggestion": "Use parameterised queries: cursor.execute('...', (username,))"
+    }
+  ],
+  "summary": "Overall review summary (required for task_3_hard)",
+  "submit": true
+}
+```
+| Field | Type | Values |
+|---|---|---|
+| `comments[].line` | `int \| null` | 1-indexed line number; `null` for file-level |
+| `comments[].category` | `enum` | `bug`, `security`, `performance`, `style`, `documentation` |
+| `comments[].severity` | `enum` | `low`, `medium`, `high`, `critical` |
+| `comments[].message` | `str` | 5–500 chars |
+| `comments[].suggestion` | `str \| null` | Optional fix suggestion |
+| `summary` | `str \| null` | Required for `task_3_hard`, optional otherwise |
+| `submit` | `bool` | `true` finalises the review and triggers the grader |
 ---
+## Reward Function
+Rewards are shaped to provide signal over the **full trajectory**, not just on terminal submit.
+### Per-step (incremental) rewards
+| Event | Reward |
+|---|---|
+| New valid comment added | `+0.05` per comment (max `+0.15`) |
+| Progress signal (grader score delta) | `+0.5 × Δscore` |
+| Empty step (no new comments) | `−0.05` |
+| Spam (> 2.5× expected comments) | `−0.10` |
+### On `submit=True` (terminal)
+```
+submit_reward = score × 0.8 + (0.2 if score ≥ threshold else −0.2)
+```
+### Per-category penalties (applied to terminal grader score)
+| Event | Penalty |
+|---|---|
+| False positive (fabricated issue) | `−0.08–0.12` per comment |
+| Missed CRITICAL security issue | `−0.15–0.20` |
+| Missed HIGH issue | `−0.08–0.10` |
+| No summary on task 3 | `−0.10` |
+All rewards are clipped to `[−1.0, 1.0]`.
+---
+## Task Descriptions
+### Task 1 – Easy: Bug Detection & Style Review
+**File**: `calculator.py` (31 lines) | **Max steps**: 5 | **Pass threshold**: 0.55
+Covers basic utility functions: `divide`, `average`, `celsius_to_fahrenheit`, `find_max`, `count_words`.
+**Ground-truth issues (6)**:
+- `divide()` — no zero-division guard (HIGH bug)
+- `average()` — crashes on empty list (HIGH bug)
+- `celsius_to_fahrenheit` — off-by-one (+31 vs +32) (MEDIUM bug)
+- `find_max()` — crashes on empty list (MEDIUM bug)
+- `for i in range(len(lst))` — unpythonic iteration (LOW style)
+- Manual `Counter` reimplementation (LOW style)
+---
+### Task 2 – Medium: Security & Performance Audit
+**File**: `user_service.py` (55 lines) | **Max steps**: 7 | **Pass threshold**: 0.60
+A SQLite-backed user management service with authentication.
+**Ground-truth issues (6)**:
+- SQL injection in `get_user()` — f-string query (CRITICAL security)
+- MD5 password hashing in `create_user()` (CRITICAL security)
+- SQL injection in `delete_user()` (CRITICAL security)
+- MD5 reuse in `authenticate()` (HIGH security)
+- `fetchall()` on unbounded table (HIGH performance)
+- New DB connection per query, no pooling (MEDIUM performance)
+---
+### Task 3 – Hard: Comprehensive Code Review
+**File**: `data_pipeline.py` (49 lines) | **Max steps**: 10 | **Pass threshold**: 0.65
+An analytics data pipeline with CSV loading, row transformation, caching, and stats.
+**Ground-truth issues (13 across all 5 categories)**:
+- `subprocess.run(shell=True)` with user input — OS command injection (CRITICAL security)
+- `pickle.loads()` on arbitrary cache data — RCE risk (CRITICAL security)
+- Pickling into module-level dict (HIGH security)
+- `compute_stats()` ZeroDivisionError on empty data (HIGH bug)
+- Missing `"value"` key → silent KeyError (MEDIUM bug)
+- `open()` without encoding (MEDIUM bug)
+- Two-pass iteration in `compute_stats` (MEDIUM performance)
+- Subprocess per row instead of batching (MEDIUM performance)
+- `str(stats)` instead of JSON export (LOW style)
+- Module-level mutable global cache (LOW style)
+- `load_data()` missing docstring (LOW documentation)
+- `process_row()` missing docstring (LOW documentation)
+- Insufficient module-level docstring (LOW documentation)
+A **written summary** is required (`summary` field) — absence incurs a `−0.10` score penalty.
+---
+## Expected Baseline Scores (gpt-4o)
+| Task | Score | Pass? | Notes |
+|---|---|---|---|
+| `task_1_easy` | ~0.75 | ✅ | GPT-4o reliably spots ZeroDivisionError and off-by-one |
+| `task_2_medium` | ~0.65 | ✅ | SQL injection found; MD5 usually flagged; perf issues partial |
+| `task_3_hard` | ~0.55 | ✅ | Pickle RCE and shell injection found; docs often missed |
+---
+## Setup & Usage
+### Option A — Docker (recommended)
+```bash
+# Build
+docker build -t code-review-env .
+# Run (port 7860)
+docker run -p 7860:7860 code-review-env
+# Test it
+curl http://localhost:7860/health
+```
+### Option B — Local Python
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Start the server
+uvicorn app:app --host 0.0.0.0 --port 7860 --reload
+# Open docs
+open http://localhost:7860/docs
+```
+### Run the test suite
+```bash
+pytest tests/ -v
+# Expected: 25 passed
+```
+### Run the baseline agent
+```bash
+export OPENAI_API_KEY=sk-...
+# All tasks (direct mode — no server needed)
+python baseline_agent.py
+# Single task
+python baseline_agent.py --task task_2_medium
+# Against a running HTTP server
+python baseline_agent.py --mode http --base-url http://localhost:7860
+```
+---
+## API Reference
+| Endpoint | Method | Description |
+|---|---|---|
+| `/` | GET | HTML landing page |
+| `/health` | GET | Health check |
+| `/tasks` | GET | List all task specs |
+| `/reset` | POST | Start or restart an episode |
+| `/step` | POST | Submit an action |
+| `/state` | GET | Get full serialisable state |
+| `/docs` | GET | Interactive Swagger UI |
+### Example: Full episode via curl
+```bash
+# 1. Reset
+curl -X POST http://localhost:7860/reset \
+  -H 'Content-Type: application/json' \
+  -d '{"task_id": "task_1_easy", "session_id": "demo"}'
+# 2. Step
+curl -X POST http://localhost:7860/step \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "session_id": "demo",
+    "action": {
+      "comments": [
+        {
+          "line": 2,
+          "category": "bug",
+          "severity": "high",
+          "message": "divide() will raise ZeroDivisionError when b is 0.",
+          "suggestion": "Guard with: if b == 0: raise ValueError"
+        }
+      ],
+      "submit": true
+    }
+  }'
+# 3. Check state
+curl "http://localhost:7860/state?session_id=demo"
+```
+---
+## Project Structure
+```
+openenv-code-review/
+├── app.py                  # FastAPI HTTP server
+├── openenv.yaml            # OpenEnv spec metadata
+├── Dockerfile              # Container definition
+├── requirements.txt
+├── baseline_agent.py       # gpt-4o baseline inference script
+│
+├── env/
+│   ├── models.py           # Pydantic typed models (Observation, Action, Reward, …)
+│   └── environment.py      # CodeReviewEnv — step() / reset() / state()
+│
+├── corpus/
+│   └── snippets.py         # Python snippets with ground-truth issues
+│
+├── graders/
+│   └── graders.py          # Task1Grader, Task2Grader, Task3Grader
+│
+└── tests/
+    └── test_env.py         # 25-test pytest suite (all passing)
+```
+---
+## Deploying to Hugging Face Spaces
+1. Create a new Space with **Docker** SDK
+2. Push this repository to the Space
+3. Set `OPENAI_API_KEY` as a Space secret (only needed for baseline script)
+4. The Space will auto-build and expose port 7860
+```yaml
+# README.md frontmatter for HF Spaces
+---
+title: CodeReview OpenEnv
+emoji: 🔍
+colorFrom: blue
+colorTo: indigo
 sdk: docker
 pinned: false
+tags:
+  - openenv
+  - code-review
+  - ai-agent
+  - evaluation
 ---
+```
+---
+## License
+MIT

RULES.md ADDED Viewed

	@@ -0,0 +1,261 @@

+# RULES.md — CodeReview OpenEnv Agent Grounding Rules
+You are an AI agent operating inside the **CodeReview OpenEnv** environment.
+Read every rule below before generating any action. Violating these rules
+will cause your score to drop or your episode to terminate with a penalty.
+---
+## 1. YOUR ONLY JOB
+You are reviewing a **Python source file** for real issues.
+You are **not** writing code. You are **not** explaining Python concepts.
+You are **not** summarising the file. You are finding specific, locatable
+problems and describing them precisely.
+---
+## 2. OUTPUT FORMAT — NON-NEGOTIABLE
+You must respond with **one JSON object and nothing else**.
+No markdown. No backticks. No preamble. No explanation outside the JSON.
+```
+{
+  "comments": [ ...ReviewComment objects... ],
+  "summary": "string or null",
+  "submit": true or false
+}
+```
+Any response that is not valid JSON will be treated as an empty action
+and penalised with −0.05 reward.
+---
+## 3. ReviewComment SCHEMA — EXACT TYPES REQUIRED
+Every object inside `comments` must have exactly these fields:
+| Field        | Type            | Allowed values / constraints                              |
+|--------------|-----------------|-----------------------------------------------------------|
+| `line`       | int or null     | 1-indexed line number from the code. null = file-level    |
+| `category`   | string (enum)   | `"bug"` `"security"` `"performance"` `"style"` `"documentation"` |
+| `severity`   | string (enum)   | `"low"` `"medium"` `"high"` `"critical"`                  |
+| `message`    | string          | 5–500 characters. Must describe the SPECIFIC issue.       |
+| `suggestion` | string or null  | Optional fix. Max 500 characters.                         |
+Do not add extra fields. Do not omit required fields. Do not use integers
+for `category` or `severity`.
+---
+## 4. CATEGORY SCOPE — ONLY FLAG WHAT YOU ARE ASKED TO FLAG
+The `instructions` field in the observation tells you which categories
+to check. **Do not submit comments for categories outside that scope.**
+- Task 1 (Easy):  `bug`, `style` only
+- Task 2 (Medium): `security`, `performance` only
+- Task 3 (Hard):  all five categories
+Submitting comments in the wrong category is treated as a false positive
+and incurs a penalty. The grader will ignore them.
+---
+## 5. LINE NUMBERS — BE PRECISE
+- Count lines from **1** (the first line of the source is line 1).
+- The source shown in the observation has line numbers prefixed — use them.
+- If you cannot pinpoint a line, use `null` (file-level comment).
+- Do not guess or approximate. Off-by-more-than-3 lines reduces your score.
+---
+## 6. NO FABRICATION
+Do not invent issues that are not present in the code.
+Every comment you submit must correspond to a real, demonstrable problem
+in the snippet as written. Ask yourself:
+> "Can I point to the exact line where this fails and show the failure?"
+If the answer is no, do not submit that comment.
+False positives reduce your score. Many false positives can bring your
+score below zero.
+---
+## 7. SEVERITY CALIBRATION
+Use severity consistently:
+| Severity   | Meaning                                                    | Examples                                          |
+|------------|------------------------------------------------------------|---------------------------------------------------|
+| `critical` | Exploitable in production. Immediate risk of data loss, RCE, auth bypass. | SQL injection, pickle.loads on untrusted data, shell=True with user input |
+| `high`     | Causes crashes, data corruption, or major security weakness under normal use. | ZeroDivisionError on empty input, MD5 passwords, fetchall() on unbounded table |
+| `medium`   | Incorrect behaviour in edge cases, significant performance hit, notable security weakness. | Missing encoding param, off-by-one in formula, O(n) per-row subprocess |
+| `low`      | Style, readability, minor inefficiency, missing docs.      | Unpythonic loop, manual Counter, missing docstring |
+Do not mark everything as `critical`. Severity inflation is penalised.
+---
+## 8. MESSAGE QUALITY
+A good message answers three questions:
+1. **What** is wrong?
+2. **Where** exactly (line / function)?
+3. **Why** does it matter?
+**Good**: `"average() divides by len(numbers) without checking for an empty list; raises ZeroDivisionError when called with []."`
+**Bad**: `"This function has a bug."` — too vague, will not match ground truth.
+**Bad**: `"Consider adding error handling."` — not specific enough.
+**Bad**: `"Line 8 is problematic."` — no description of the actual problem.
+Minimum 5 characters. Maximum 500 characters.
+---
+## 9. SUGGESTIONS ARE OPTIONAL BUT VALUABLE
+- If you include a `suggestion`, make it concrete and correct Python.
+- Do not include suggestions that are themselves buggy or insecure.
+- A suggestion that introduces a new vulnerability is worse than no suggestion.
+---
+## 10. THE `summary` FIELD
+- **Task 3 (Hard) only**: `summary` is **required**. Omitting it deducts 0.10 from your score.
+- For Tasks 1 and 2: `summary` is optional. Include it if it adds value.
+- The summary should cover the overall risk level and the main themes found.
+- Mention key categories found: e.g. "security", "injection", "pickle", "performance", "documentation".
+- More relevant keywords in the summary = small score bonus (up to +0.15).
+---
+## 11. WHEN TO SET `"submit": true`
+Set `submit` to `true` when you believe your review is complete.
+The grader runs immediately on submit and the episode ends.
+Set `submit` to `false` if you want to add more comments in the next step.
+You have `max_steps` steps per episode (varies by task: 5 / 7 / 10).
+Rules:
+- You MUST set `submit: true` on your final step.
+- If you run out of steps without submitting, the episode auto-terminates.
+- Do not waste steps submitting empty comment lists. Each empty step costs −0.05.
+Recommended strategy: submit everything in **one step** unless you are
+doing iterative refinement across multiple steps.
+---
+## 12. DEDUPLICATION — DO NOT REPEAT YOURSELF
+The environment deduplicates comments across steps by `(line, category, message[:40])`.
+Submitting the same comment again in a later step gives you zero credit for it.
+Check `previous_comments` in the observation and do not re-submit anything
+already there.
+---
+## 13. DO NOT SPAM
+Submitting more than 2.5× the expected number of comments triggers a spam penalty (−0.10).
+Quality over quantity. If you find 6 real issues, submit 6.
+Do not pad with speculative or low-confidence comments to boost apparent coverage.
+---
+## 14. MULTI-STEP STRATEGY (if using more than 1 step)
+Step 1 — Read carefully. Submit your highest-confidence comments.
+Step 2 — Review `feedback` and `previous_comments` in the observation.
+         Add only NEW comments not already submitted.
+Step N — Set `submit: true` when confident you have covered all categories.
+Do not submit `submit: true` before you have reviewed the whole file.
+---
+## 15. WHAT THE GRADER CHECKS
+The grader matches your comments against a hidden ground-truth list using:
+- **Category match** (exact)
+- **Line proximity** (within ±3 lines)
+- **Keyword overlap** (≥25% of significant words from the truth message appear in yours)
+- **Severity proximity** (within 1 level)
+You get full credit for exact matches, partial credit (0.5×) for right issue
+wrong line. You get nothing for wrong category, and a penalty for fabricated issues.
+**Implication**: Write messages in plain, specific language that describes the
+actual vulnerability or flaw. Technical terms matter (e.g. "SQL injection",
+"ZeroDivisionError", "MD5", "shell=True", "pickle.loads").
+---
+## 16. FORBIDDEN BEHAVIOURS
+The following will actively hurt your score:
+| Behaviour | Consequence |
+|---|---|
+| Responding with non-JSON text | Treated as empty action, −0.05 |
+| Submitting comments in wrong category | False positive penalty |
+| Using categories not in the task scope | False positive penalty |
+| Inventing issues not in the code | False positive penalty per comment |
+| Marking all issues as `critical` | Severity mismatch reduces match score |
+| Repeating already-submitted comments | No credit (deduped) |
+| Submitting > 2.5× expected comments | Spam penalty −0.10 |
+| Omitting `summary` on Task 3 | −0.10 from final score |
+| Calling `submit: true` with 0 comments | Episode ends with near-zero score |
+---
+## 17. CHECKLIST BEFORE YOU RESPOND
+Before generating your JSON, run through this mentally:
+- [ ] Is my response a single valid JSON object with no surrounding text?
+- [ ] Does every comment have all 5 fields with correct types?
+- [ ] Are all my categories within the task scope defined in `instructions`?
+- [ ] Is every line number accurate (1-indexed from the source)?
+- [ ] Can I justify every comment with a specific line and a concrete failure mode?
+- [ ] Have I avoided re-submitting comments from `previous_comments`?
+- [ ] For Task 3: have I included a `summary` with key technical themes?
+- [ ] Is my severity realistic (not everything is `critical`)?
+- [ ] Should I set `submit: true` now, or do I have more to add?
+---
+## QUICK REFERENCE
+```json
+{
+  "comments": [
+    {
+      "line": 10,
+      "category": "security",
+      "severity": "critical",
+      "message": "get_user() interpolates username directly into the SQL query string, enabling SQL injection attacks.",
+      "suggestion": "Use parameterised queries: cursor.execute('SELECT * FROM users WHERE username=?', (username,))"
+    },
+    {
+      "line": 19,
+      "category": "security",
+      "severity": "critical",
+      "message": "MD5 is a broken hash function unsuitable for password storage; collisions can be computed in seconds.",
+      "suggestion": "Replace with bcrypt.hashpw(password.encode(), bcrypt.gensalt()) or hashlib.scrypt."
+    }
+  ],
+  "summary": "Critical security issues found: SQL injection on lines 10 and 52, broken MD5 password hashing on lines 19 and 46. Performance issue: fetchall() loads entire table. Connection pooling absent.",
+  "submit": true
+}
+```

__pycache__/app.cpython-313.pyc ADDED Viewed

Binary file (6.58 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,145 @@

+"""
+FastAPI HTTP server for CodeReview OpenEnv.
+Exposes the environment as a REST API for agents to interact with.
+"""
+from __future__ import annotations
+from typing import Any, Dict, Optional
+from fastapi import FastAPI, HTTPException, Query
+from fastapi.responses import HTMLResponse
+from pydantic import BaseModel
+from env.environment import CodeReviewEnv, TASK_SPECS
+from env.models import Action, ReviewCategory, ReviewComment, Severity
+# ---------------------------------------------------------------------------
+# App setup
+# ---------------------------------------------------------------------------
+app = FastAPI(
+    title="CodeReview OpenEnv",
+    description="An OpenEnv-compliant AI training environment for Python code review.",
+    version="1.0.0",
+)
+# In-memory session store
+SESSIONS: Dict[str, CodeReviewEnv] = {}
+# ---------------------------------------------------------------------------
+# Request / Response schemas
+# ---------------------------------------------------------------------------
+class ResetRequest(BaseModel):
+    task_id: str = "task_1_easy"
+    session_id: str = "default"
+class StepRequest(BaseModel):
+    session_id: str = "default"
+    action: Dict[str, Any]
+# ---------------------------------------------------------------------------
+# Endpoints
+# ---------------------------------------------------------------------------
+import os
+@app.get("/", response_class=HTMLResponse)
+def landing_page():
+    """HTML landing page."""
+    template_path = os.path.join(os.path.dirname(__file__), "templates", "index.html")
+    try:
+        with open(template_path, "r", encoding="utf-8") as f:
+            return f.read()
+    except FileNotFoundError:
+        return "<html><body><h1>Error: templates/index.html not found.</h1></body></html>"
+@app.get("/health")
+def health():
+    """Health check endpoint."""
+    return {"status": "ok"}
+@app.get("/tasks")
+def list_tasks():
+    """Return specs for all available tasks."""
+    return {
+        task_id: spec.model_dump()
+        for task_id, spec in TASK_SPECS.items()
+    }
+@app.post("/reset")
+def reset(req: ResetRequest):
+    """Start or restart an episode for the given task and session."""
+    if req.task_id not in TASK_SPECS:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Unknown task_id '{req.task_id}'. Choose from: {list(TASK_SPECS.keys())}",
+        )
+    env = CodeReviewEnv(task_id=req.task_id)
+    obs = env.reset()
+    SESSIONS[req.session_id] = env
+    return {"observation": obs.model_dump(), "session_id": req.session_id}
+@app.post("/step")
+def step(req: StepRequest):
+    """Submit an action for the given session."""
+    env = SESSIONS.get(req.session_id)
+    if env is None:
+        raise HTTPException(
+            status_code=404,
+            detail=f"Session '{req.session_id}' not found. Call /reset first.",
+        )
+    # Parse the action dict into an Action model
+    action_dict = req.action
+    comments = []
+    for c in action_dict.get("comments", []):
+        try:
+            comments.append(ReviewComment(
+                line=c.get("line"),
+                category=ReviewCategory(c.get("category", "bug")),
+                severity=Severity(c.get("severity", "medium")),
+                message=c.get("message", ""),
+                suggestion=c.get("suggestion"),
+            ))
+        except Exception:
+            pass  # skip malformed comments
+    action = Action(
+        comments=comments,
+        summary=action_dict.get("summary"),
+        submit=action_dict.get("submit", False),
+    )
+    try:
+        result = env.step(action)
+    except RuntimeError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    return {
+        "observation": result.observation.model_dump(),
+        "reward": result.reward.model_dump(),
+        "done": result.done,
+        "info": result.info,
+    }
+@app.get("/state")
+def get_state(session_id: str = Query(default="default")):
+    """Return full serialisable state for the given session."""
+    env = SESSIONS.get(session_id)
+    if env is None:
+        raise HTTPException(
+            status_code=404,
+            detail=f"Session '{session_id}' not found. Call /reset first.",
+        )
+    return env.state().model_dump()

baseline_agent.py ADDED Viewed

	@@ -0,0 +1,316 @@

+#!/usr/bin/env python3
+"""
+baseline_agent.py – Baseline inference script for CodeReview OpenEnv.
+Runs gpt-4o against all three tasks using the OpenAI client.
+Reads credentials from OPENAI_API_KEY environment variable.
+Connects to the env either locally (direct Python import) or via HTTP.
+Usage
+-----
+    # Direct mode (no server needed):
+    python baseline_agent.py
+    # Against a running server:
+    python baseline_agent.py --mode http --base-url http://localhost:7860
+    # Single task:
+    python baseline_agent.py --task task_2_medium
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import sys
+import textwrap
+import time
+from typing import Any, Dict, List, Optional
+import requests
+from openai import OpenAI
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+MODEL = os.environ.get("BASELINE_MODEL", "gpt-4o")
+API_KEY = os.environ.get("OPENAI_API_KEY", "")
+ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:7860")
+TASKS = ["task_1_easy", "task_2_medium", "task_3_hard"]
+# ---------------------------------------------------------------------------
+# Prompt construction
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = textwrap.dedent("""
+You are an expert Python code reviewer.
+You will be given a code snippet along with review instructions.
+Your job is to produce a JSON action object that identifies issues in the code.
+The JSON object you return must match this schema exactly:
+{
+  "comments": [
+    {
+      "line": <int or null>,
+      "category": <"bug"|"security"|"performance"|"style"|"documentation">,
+      "severity": <"low"|"medium"|"high"|"critical">,
+      "message": "<clear description of the issue>",
+      "suggestion": "<optional fix>"
+    }
+  ],
+  "summary": "<overall assessment – required for hard tasks, optional otherwise>",
+  "submit": true
+}
+Rules:
+- Only flag genuine issues. Do not fabricate problems.
+- Be precise about line numbers (1-indexed from the code).
+- Match the categories listed in the instructions.
+- Always set "submit": true when you believe your review is complete.
+- Return ONLY the JSON object. No markdown, no explanations.
+""").strip()
+def build_user_message(observation: dict) -> str:
+    snippet = observation["snippet"]
+    instructions = observation["instructions"]
+    previous = observation.get("previous_comments", [])
+    numbered_source = "\n".join(
+        f"{i+1:3d}  {line}"
+        for i, line in enumerate(snippet["source"].splitlines())
+    )
+    msg = f"""
+{instructions}
+### File: {snippet['file_name']}
+```python
+{numbered_source}
+```
+"""
+    if previous:
+        msg += f"\n### Your previous comments ({len(previous)} so far):\n"
+        for c in previous:
+            msg += f"  - L{c.get('line','?')} [{c['category']}] {c['message'][:80]}\n"
+    return msg.strip()
+# ---------------------------------------------------------------------------
+# Direct mode (import env directly)
+# ---------------------------------------------------------------------------
+def run_direct(task_id: str, client: OpenAI) -> dict:
+    """Run the agent against the environment by direct Python import."""
+    # Import here to avoid circular dependency when running in HTTP mode
+    sys.path.insert(0, os.path.dirname(__file__))
+    from env.environment import CodeReviewEnv
+    from env.models import Action, ReviewComment, ReviewCategory, Severity
+    env = CodeReviewEnv(task_id=task_id)
+    obs = env.reset()
+    total_reward = 0.0
+    final_score = 0.0
+    steps_taken = 0
+    for step_num in range(env.spec.max_steps):
+        user_msg = build_user_message(obs.model_dump())
+        try:
+            response = client.chat.completions.create(
+                model=MODEL,
+                messages=[
+                    {"role": "system", "content": SYSTEM_PROMPT},
+                    {"role": "user", "content": user_msg},
+                ],
+                temperature=0.2,
+                response_format={"type": "json_object"},
+            )
+            raw = response.choices[0].message.content or "{}"
+            action_dict = json.loads(raw)
+        except Exception as e:
+            print(f"  [!] LLM error on step {step_num}: {e}")
+            action_dict = {"comments": [], "submit": True}
+        # Build Action
+        comments = []
+        for c in action_dict.get("comments", []):
+            try:
+                comments.append(ReviewComment(
+                    line=c.get("line"),
+                    category=ReviewCategory(c.get("category", "bug")),
+                    severity=Severity(c.get("severity", "medium")),
+                    message=c.get("message", ""),
+                    suggestion=c.get("suggestion"),
+                ))
+            except Exception:
+                pass  # skip malformed comments
+        action = Action(
+            comments=comments,
+            summary=action_dict.get("summary"),
+            submit=action_dict.get("submit", True),
+        )
+        result = env.step(action)
+        total_reward += result.reward.value
+        steps_taken += 1
+        final_score = result.info.get("grader", {}).get("score", 0.0)
+        print(f"  Step {step_num+1}: reward={result.reward.value:+.3f} | "
+              f"comments={result.info['total_comments']} | "
+              f"score={final_score:.3f}")
+        obs = result.observation
+        if result.done:
+            break
+    passed = final_score >= env.spec.passing_threshold
+    return {
+        "task_id": task_id,
+        "steps": steps_taken,
+        "total_reward": round(total_reward, 4),
+        "final_score": round(final_score, 4),
+        "passed": passed,
+        "threshold": env.spec.passing_threshold,
+    }
+# ---------------------------------------------------------------------------
+# HTTP mode (against a running server)
+# ---------------------------------------------------------------------------
+def run_http(task_id: str, client: OpenAI, base_url: str) -> dict:
+    """Run the agent against a live HTTP server."""
+    session_id = f"baseline-{task_id}-{int(time.time())}"
+    headers = {"Content-Type": "application/json"}
+    # Reset
+    r = requests.post(f"{base_url}/reset",
+                      json={"task_id": task_id, "session_id": session_id}, headers=headers)
+    r.raise_for_status()
+    obs = r.json()["observation"]
+    # Get task spec for threshold
+    tasks_r = requests.get(f"{base_url}/tasks")
+    spec = tasks_r.json()[task_id]
+    max_steps = spec["max_steps"]
+    threshold = spec["passing_threshold"]
+    total_reward = 0.0
+    final_score = 0.0
+    steps_taken = 0
+    for step_num in range(max_steps):
+        user_msg = build_user_message(obs)
+        try:
+            response = client.chat.completions.create(
+                model=MODEL,
+                messages=[
+                    {"role": "system", "content": SYSTEM_PROMPT},
+                    {"role": "user", "content": user_msg},
+                ],
+                temperature=0.2,
+                response_format={"type": "json_object"},
+            )
+            action_dict = json.loads(response.choices[0].message.content or "{}")
+        except Exception as e:
+            print(f"  [!] LLM error: {e}")
+            action_dict = {"comments": [], "submit": True}
+        step_r = requests.post(
+            f"{base_url}/step",
+            json={"session_id": session_id, "action": action_dict},
+            headers=headers,
+        )
+        step_r.raise_for_status()
+        result = step_r.json()
+        total_reward += result["reward"]["value"]
+        steps_taken += 1
+        final_score = result["info"].get("grader", {}).get("score", 0.0)
+        print(f"  Step {step_num+1}: reward={result['reward']['value']:+.3f} | "
+              f"comments={result['info']['total_comments']} | "
+              f"score={final_score:.3f}")
+        obs = result["observation"]
+        if result["done"]:
+            break
+    return {
+        "task_id": task_id,
+        "steps": steps_taken,
+        "total_reward": round(total_reward, 4),
+        "final_score": round(final_score, 4),
+        "passed": final_score >= threshold,
+        "threshold": threshold,
+    }
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(description="Baseline agent for CodeReview OpenEnv")
+    parser.add_argument("--mode", choices=["direct", "http"], default="direct")
+    parser.add_argument("--base-url", default=ENV_BASE_URL)
+    parser.add_argument("--task", choices=TASKS + ["all"], default="all")
+    args = parser.parse_args()
+    if not API_KEY:
+        print("ERROR: OPENAI_API_KEY environment variable not set.")
+        sys.exit(1)
+    client = OpenAI(api_key=API_KEY)
+    tasks_to_run = TASKS if args.task == "all" else [args.task]
+    print(f"\n{'='*60}")
+    print(f"  CodeReview OpenEnv – Baseline Agent ({MODEL})")
+    print(f"  Mode: {args.mode}")
+    print(f"{'='*60}\n")
+    results: List[dict] = []
+    for task_id in tasks_to_run:
+        print(f"▶ Running {task_id} ...")
+        t0 = time.time()
+        if args.mode == "direct":
+            r = run_direct(task_id, client)
+        else:
+            r = run_http(task_id, client, args.base_url)
+        elapsed = round(time.time() - t0, 1)
+        r["elapsed_s"] = elapsed
+        results.append(r)
+        status = "✅ PASSED" if r["passed"] else "❌ FAILED"
+        print(f"  → {status} | score={r['final_score']:.3f} | reward={r['total_reward']:+.3f} | {elapsed}s\n")
+    # Summary table
+    print(f"\n{'='*60}")
+    print(f"  BASELINE RESULTS")
+    print(f"{'='*60}")
+    print(f"  {'Task':<22} {'Score':>7} {'Threshold':>10} {'Reward':>8} {'Pass':>6}")
+    print(f"  {'-'*55}")
+    for r in results:
+        print(f"  {r['task_id']:<22} {r['final_score']:>7.3f} {r['threshold']:>10.2f} "
+              f"{r['total_reward']:>+8.3f} {'✅' if r['passed'] else '❌':>6}")
+    avg_score = sum(r["final_score"] for r in results) / len(results)
+    pass_rate = sum(1 for r in results if r["passed"]) / len(results)
+    print(f"  {'-'*55}")
+    print(f"  {'AVERAGE':<22} {avg_score:>7.3f} {'':>10} {'':>8} {pass_rate*100:>5.0f}%")
+    print(f"{'='*60}\n")
+    # Save results
+    out_path = "baseline_results.json"
+    with open(out_path, "w") as f:
+        json.dump({"model": MODEL, "results": results}, f, indent=2)
+    print(f"  Results saved to {out_path}")
+if __name__ == "__main__":
+    main()

corpus/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # corpus package

corpus/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (155 Bytes). View file

corpus/__pycache__/snippets.cpython-313.pyc ADDED Viewed

Binary file (11.7 kB). View file

corpus/snippets.py ADDED Viewed

	@@ -0,0 +1,390 @@

+"""
+Code corpus: Python snippets with embedded ground-truth issues.
+Each entry has:
+  - snippet  : CodeSnippet to show the agent
+  - issues   : list of ground-truth ReviewComment objects the grader checks against
+  - task_id  : which task this belongs to
+"""
+from __future__ import annotations
+from env.models import CodeSnippet, ReviewCategory, ReviewComment, Severity
+# ---------------------------------------------------------------------------
+# TASK 1 – Easy  (Bug detection + Code style)
+# ---------------------------------------------------------------------------
+TASK1_SNIPPET = CodeSnippet(
+    file_name="calculator.py",
+    source='''\
+def divide(a, b):
+    return a / b                          # line 2
+def average(numbers):
+    total = 0
+    for n in numbers:
+        total = total + n
+    return total / len(numbers)           # line 8
+def celsius_to_fahrenheit(c):
+    return c * 9/5 + 31                  # line 11  (bug: should be +32)
+def is_palindrome(s):
+    return s == s[::-1]                   # line 14
+def find_max(lst):
+    max_val = lst[0]                      # line 17
+    for i in range(len(lst)):
+        if lst[i] > max_val:
+            max_val = lst[i]
+    return max_val                        # line 21
+def count_words(text):
+    words = text.split(" ")
+    wordcount = {}
+    for w in words:
+        if w in wordcount:
+            wordcount[w] = wordcount[w]+1
+        else:
+            wordcount[w] = 1
+    return wordcount                      # line 30
+''',
+)
+TASK1_ISSUES: list[ReviewComment] = [
+    # ---- Bugs ----
+    ReviewComment(
+        line=2,
+        category=ReviewCategory.BUG,
+        severity=Severity.HIGH,
+        message="divide() has no guard against division by zero; will raise ZeroDivisionError when b=0.",
+        suggestion="Add `if b == 0: raise ValueError('b must not be zero')` before returning.",
+    ),
+    ReviewComment(
+        line=8,
+        category=ReviewCategory.BUG,
+        severity=Severity.HIGH,
+        message="average() crashes with ZeroDivisionError on an empty list.",
+        suggestion="Guard with `if not numbers: return 0.0` or raise ValueError.",
+    ),
+    ReviewComment(
+        line=11,
+        category=ReviewCategory.BUG,
+        severity=Severity.MEDIUM,
+        message="celsius_to_fahrenheit uses +31 instead of +32, giving wrong results.",
+        suggestion="Change `+ 31` to `+ 32`.",
+    ),
+    ReviewComment(
+        line=17,
+        category=ReviewCategory.BUG,
+        severity=Severity.MEDIUM,
+        message="find_max() crashes with IndexError on an empty list.",
+        suggestion="Add `if not lst: raise ValueError('list is empty')` at the top.",
+    ),
+    # ---- Style ----
+    ReviewComment(
+        line=18,
+        category=ReviewCategory.STYLE,
+        severity=Severity.LOW,
+        message="Iterating with `for i in range(len(lst))` is unpythonic; prefer `for val in lst`.",
+        suggestion="Replace loop body with `for val in lst: if val > max_val: max_val = val`.",
+    ),
+    ReviewComment(
+        line=25,
+        category=ReviewCategory.STYLE,
+        severity=Severity.LOW,
+        message="count_words manually reimplements collections.Counter; use the stdlib instead.",
+        suggestion="Replace with `from collections import Counter; return Counter(text.split())`.",
+    ),
+]
+# ---------------------------------------------------------------------------
+# TASK 2 – Medium  (Security + Performance)
+# ---------------------------------------------------------------------------
+TASK2_SNIPPET = CodeSnippet(
+    file_name="user_service.py",
+    source='''\
+import sqlite3
+import hashlib
+import os
+DB_PATH = "users.db"
+def get_user(username):
+    conn = sqlite3.connect(DB_PATH)
+    cursor = conn.cursor()
+    query = f"SELECT * FROM users WHERE username = \'{ username }\'"   # line 10
+    cursor.execute(query)
+    result = cursor.fetchone()
+    conn.close()
+    return result
+def create_user(username, password):
+    conn = sqlite3.connect(DB_PATH)
+    cursor = conn.cursor()
+    pw_hash = hashlib.md5(password.encode()).hexdigest()               # line 19
+    cursor.execute(
+        "INSERT INTO users (username, password) VALUES (?, ?)",
+        (username, pw_hash),
+    )
+    conn.commit()
+    conn.close()
+def load_all_users():
+    conn = sqlite3.connect(DB_PATH)
+    cursor = conn.cursor()
+    cursor.execute("SELECT * FROM users")
+    rows = cursor.fetchall()                                           # line 31
+    conn.close()
+    users = []
+    for row in rows:
+        users.append({
+            "id": row[0],
+            "username": row[1],
+            "password": row[2],
+        })
+    return users
+def authenticate(username, password):
+    user = get_user(username)
+    if user is None:
+        return False
+    pw_hash = hashlib.md5(password.encode()).hexdigest()               # line 46
+    return user[2] == pw_hash
+def delete_user(username):
+    conn = sqlite3.connect(DB_PATH)
+    cursor = conn.cursor()
+    query = f"DELETE FROM users WHERE username = \'{ username }\'"    # line 52
+    cursor.execute(query)
+    conn.commit()
+    conn.close()
+''',
+)
+TASK2_ISSUES: list[ReviewComment] = [
+    # ---- Security ----
+    ReviewComment(
+        line=10,
+        category=ReviewCategory.SECURITY,
+        severity=Severity.CRITICAL,
+        message="SQL injection vulnerability: username is interpolated directly into the query string.",
+        suggestion="Use parameterised queries: `cursor.execute('SELECT * FROM users WHERE username=?', (username,))`",
+    ),
+    ReviewComment(
+        line=19,
+        category=ReviewCategory.SECURITY,
+        severity=Severity.CRITICAL,
+        message="MD5 is cryptographically broken and must not be used for password hashing.",
+        suggestion="Replace with `bcrypt.hashpw(password.encode(), bcrypt.gensalt())` or `hashlib.scrypt`.",
+    ),
+    ReviewComment(
+        line=52,
+        category=ReviewCategory.SECURITY,
+        severity=Severity.CRITICAL,
+        message="delete_user() is also vulnerable to SQL injection via string interpolation.",
+        suggestion="Use parameterised queries: `cursor.execute('DELETE FROM users WHERE username=?', (username,))`",
+    ),
+    ReviewComment(
+        line=46,
+        category=ReviewCategory.SECURITY,
+        severity=Severity.HIGH,
+        message="authenticate() re-hashes with MD5 for comparison; same broken-hash issue as create_user.",
+        suggestion="Adopt bcrypt.checkpw() or equivalent constant-time comparison.",
+    ),
+    # ---- Performance ----
+    ReviewComment(
+        line=31,
+        category=ReviewCategory.PERFORMANCE,
+        severity=Severity.HIGH,
+        message="fetchall() loads the entire users table into memory; will OOM on large tables.",
+        suggestion="Use `cursor.fetchmany(size=1000)` in a loop or add a LIMIT clause.",
+    ),
+    ReviewComment(
+        line=8,
+        category=ReviewCategory.PERFORMANCE,
+        severity=Severity.MEDIUM,
+        message="A new DB connection is opened and closed for every single query; connection pooling should be used.",
+        suggestion="Use a module-level connection or a context-manager pool (e.g. `sqlite3.connect` as a shared resource).",
+    ),
+]
+# ---------------------------------------------------------------------------
+# TASK 3 – Hard  (All categories: Bug + Security + Performance + Style + Docs)
+# ---------------------------------------------------------------------------
+TASK3_SNIPPET = CodeSnippet(
+    file_name="data_pipeline.py",
+    source='''\
+"""Data pipeline for processing CSV exports from the analytics platform."""
+import csv
+import os
+import pickle
+import subprocess
+import time
+CACHE = {}
+def load_data(filepath):
+    with open(filepath) as f:                                         # line 12
+        reader = csv.DictReader(f)
+        data = []
+        for row in reader:
+            data.append(row)
+    return data
+def process_row(row, transform_script):
+    result = subprocess.run(transform_script, shell=True, input=str(row))  # line 20
+    return result.stdout
+def cache_result(key, value):
+    CACHE[key] = pickle.dumps(value)                                  # line 24
+def get_cached(key):
+    if key in CACHE:
+        return pickle.loads(CACHE[key])                               # line 28
+def compute_stats(data):
+    n = len(data)                                                     # line 31
+    total = sum(float(row["value"]) for row in data)
+    mean = total / n
+    variance = sum((float(row["value"]) - mean) ** 2 for row in data) / n
+    return {"mean": mean, "variance": variance, "count": n}
+def run_pipeline(filepath, transform_script=None):
+    data = load_data(filepath)
+    if transform_script:
+        processed = []
+        for row in data:
+            processed.append(process_row(row, transform_script))
+        data = processed
+    stats = compute_stats(data)
+    cache_result(filepath, stats)
+    return stats
+def export_results(stats, output_path):
+    with open(output_path, "w") as f:                                 # line 47
+        f.write(str(stats))
+''',
+)
+TASK3_ISSUES: list[ReviewComment] = [
+    # ---- Security ----
+    ReviewComment(
+        line=20,
+        category=ReviewCategory.SECURITY,
+        severity=Severity.CRITICAL,
+        message="subprocess.run with shell=True and user-supplied transform_script enables arbitrary OS command injection.",
+        suggestion="Avoid shell=True; pass args as a list or whitelist allowed scripts.",
+    ),
+    ReviewComment(
+        line=28,
+        category=ReviewCategory.SECURITY,
+        severity=Severity.CRITICAL,
+        message="pickle.loads() on untrusted/arbitrary cache data allows arbitrary code execution.",
+        suggestion="Replace pickle with json.dumps/loads for serialisable data, or sign+verify the payload.",
+    ),
+    ReviewComment(
+        line=24,
+        category=ReviewCategory.SECURITY,
+        severity=Severity.HIGH,
+        message="Storing pickled data in a module-level dict means deserialization risk persists across calls.",
+        suggestion="Use JSON for the cache and validate schemas on retrieval.",
+    ),
+    # ---- Bugs ----
+    ReviewComment(
+        line=31,
+        category=ReviewCategory.BUG,
+        severity=Severity.HIGH,
+        message="compute_stats() raises ZeroDivisionError when data is empty (n=0).",
+        suggestion="Guard with `if not data: return {'mean': 0, 'variance': 0, 'count': 0}`.",
+    ),
+    ReviewComment(
+        line=32,
+        category=ReviewCategory.BUG,
+        severity=Severity.MEDIUM,
+        message="If any row is missing the 'value' key, a KeyError will silently abort the pipeline.",
+        suggestion="Use `row.get('value', 0)` or validate schema at load time.",
+    ),
+    ReviewComment(
+        line=12,
+        category=ReviewCategory.BUG,
+        severity=Severity.MEDIUM,
+        message="open(filepath) without encoding='utf-8' will use the system locale; may fail on non-ASCII data.",
+        suggestion="Use `open(filepath, encoding='utf-8')`.",
+    ),
+    # ---- Performance ----
+    ReviewComment(
+        line=31,
+        category=ReviewCategory.PERFORMANCE,
+        severity=Severity.MEDIUM,
+        message="compute_stats() iterates over data twice (once for sum, once for variance); single-pass Welford's algorithm is more efficient.",
+        suggestion="Use Welford's online algorithm or numpy for large datasets.",
+    ),
+    ReviewComment(
+        line=38,
+        category=ReviewCategory.PERFORMANCE,
+        severity=Severity.MEDIUM,
+        message="process_row() spawns a new subprocess for every row; should batch or vectorise the transformation.",
+        suggestion="Pass all rows to a single subprocess call or use a Python-native transform function.",
+    ),
+    # ---- Style ----
+    ReviewComment(
+        line=47,
+        category=ReviewCategory.STYLE,
+        severity=Severity.LOW,
+        message="export_results writes str(stats) (a Python dict repr) rather than valid JSON or CSV.",
+        suggestion="Use `import json; f.write(json.dumps(stats, indent=2))`.",
+    ),
+    ReviewComment(
+        line=9,
+        category=ReviewCategory.STYLE,
+        severity=Severity.LOW,
+        message="Module-level mutable CACHE dict is a global side-effect; makes the pipeline hard to test and thread-unsafe.",
+        suggestion="Encapsulate state inside a Pipeline class or pass cache explicitly.",
+    ),
+    # ---- Documentation ----
+    ReviewComment(
+        line=12,
+        category=ReviewCategory.DOCUMENTATION,
+        severity=Severity.LOW,
+        message="load_data() has no docstring; expected CSV schema (required columns, types) is undocumented.",
+        suggestion="Add a docstring describing filepath, expected columns, and return type.",
+    ),
+    ReviewComment(
+        line=19,
+        category=ReviewCategory.DOCUMENTATION,
+        severity=Severity.LOW,
+        message="process_row() does not document what transform_script should be, its expected format, or return value.",
+        suggestion="Add docstring: args, expected script interface, return type, and example.",
+    ),
+    ReviewComment(
+        line=None,
+        category=ReviewCategory.DOCUMENTATION,
+        severity=Severity.LOW,
+        message="Module-level docstring is too vague; doesn't mention side-effects, required CSV schema, or dependencies.",
+        suggestion="Expand the module docstring with usage example, required columns, and external dependencies.",
+    ),
+]
+# ---------------------------------------------------------------------------
+# Registry
+# ---------------------------------------------------------------------------
+CORPUS: dict[str, dict] = {
+    "task_1_easy": {
+        "snippet": TASK1_SNIPPET,
+        "issues": TASK1_ISSUES,
+    },
+    "task_2_medium": {
+        "snippet": TASK2_SNIPPET,
+        "issues": TASK2_ISSUES,
+    },
+    "task_3_hard": {
+        "snippet": TASK3_SNIPPET,
+        "issues": TASK3_ISSUES,
+    },
+}

env/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # env package

env/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (152 Bytes). View file

env/__pycache__/environment.cpython-313.pyc ADDED Viewed

Binary file (11.8 kB). View file

env/__pycache__/models.cpython-313.pyc ADDED Viewed

Binary file (5.37 kB). View file

env/environment.py ADDED Viewed

	@@ -0,0 +1,317 @@

+"""
+CodeReviewEnv – main OpenEnv environment.
+Interface
+---------
+env = CodeReviewEnv(task_id="task_1_easy")
+obs = env.reset()
+result = env.step(action)
+state = env.state()
+"""
+from __future__ import annotations
+import time
+from typing import Any, Dict, List, Optional
+from corpus.snippets import CORPUS
+from env.models import (
+    Action,
+    CodeSnippet,
+    EnvironmentState,
+    Observation,
+    Reward,
+    ReviewComment,
+    StepResult,
+    TaskDifficulty,
+    TaskSpec,
+)
+from graders.graders import GRADERS
+# ---------------------------------------------------------------------------
+# Task specs
+# ---------------------------------------------------------------------------
+TASK_SPECS: dict[str, TaskSpec] = {
+    "task_1_easy": TaskSpec(
+        task_id="task_1_easy",
+        title="Bug Detection & Style Review",
+        difficulty=TaskDifficulty.EASY,
+        categories=["bug", "style"],
+        description=(
+            "Review calculator.py for correctness bugs (division by zero, off-by-one, "
+            "empty collection crashes) and Python style issues. "
+            "You do NOT need to check for security or performance."
+        ),
+        max_steps=5,
+        passing_threshold=0.55,
+    ),
+    "task_2_medium": TaskSpec(
+        task_id="task_2_medium",
+        title="Security & Performance Audit",
+        difficulty=TaskDifficulty.MEDIUM,
+        categories=["security", "performance"],
+        description=(
+            "Audit user_service.py for security vulnerabilities (SQL injection, weak "
+            "hashing, unsafe deserialization) and performance problems (unbounded queries, "
+            "connection churn). Identify ALL critical security issues – missing one costs "
+            "heavily."
+        ),
+        max_steps=7,
+        passing_threshold=0.60,
+    ),
+    "task_3_hard": TaskSpec(
+        task_id="task_3_hard",
+        title="Comprehensive Code Review",
+        difficulty=TaskDifficulty.HARD,
+        categories=["bug", "security", "performance", "style", "documentation"],
+        description=(
+            "Perform a full production-grade review of data_pipeline.py covering bugs, "
+            "security flaws, performance issues, code style, and documentation gaps. "
+            "You MUST provide a written summary of overall findings. "
+            "This snippet has intentional issues across all five categories."
+        ),
+        max_steps=10,
+        passing_threshold=0.65,
+    ),
+}
+# ---------------------------------------------------------------------------
+# Environment
+# ---------------------------------------------------------------------------
+INSTRUCTIONS_TEMPLATE = """
+You are performing a Python code review.
+Task: {title}
+Difficulty: {difficulty}
+Categories to check: {categories}
+{description}
+Your job:
+1. Read the code snippet carefully.
+2. Identify issues matching the specified categories.
+3. For each issue, provide: line number (if applicable), category, severity, a clear message, and an optional fix suggestion.
+4. When you are satisfied, set `submit=True` in your action.
+{summary_note}
+The code will be shown in the observation. Previous comments you have already submitted are also included so you can refine or expand them across steps.
+""".strip()
+class CodeReviewEnv:
+    """
+    OpenEnv-compliant environment for Python code review tasks.
+    """
+    def __init__(self, task_id: str = "task_1_easy"):
+        if task_id not in TASK_SPECS:
+            raise ValueError(f"Unknown task_id '{task_id}'. Choose from: {list(TASK_SPECS)}")
+        self.task_id = task_id
+        self.spec: TaskSpec = TASK_SPECS[task_id]
+        self.corpus_entry: dict = CORPUS[task_id]
+        self.grader = GRADERS[task_id]
+        self.ground_truth: List[ReviewComment] = self.corpus_entry["issues"]
+        self.snippet: CodeSnippet = self.corpus_entry["snippet"]
+        # State
+        self._step: int = 0
+        self._done: bool = False
+        self._comments: List[ReviewComment] = []
+        self._total_reward: float = 0.0
+        self._grader_scores: Dict[str, float] = {}
+        self._last_feedback: Optional[str] = None
+    # ------------------------------------------------------------------
+    # Public API
+    # ------------------------------------------------------------------
+    def reset(self) -> Observation:
+        """Reset the environment to initial state and return first observation."""
+        self._step = 0
+        self._done = False
+        self._comments = []
+        self._total_reward = 0.0
+        self._grader_scores = {}
+        self._last_feedback = None
+        return self._build_observation()
+    def step(self, action: Action) -> StepResult:
+        """
+        Advance the environment by one step.
+        Parameters
+        ----------
+        action : Action
+            Comments produced this step plus optional submit flag.
+        Returns
+        -------
+        StepResult with (observation, reward, done, info)
+        """
+        if self._done:
+            raise RuntimeError("Episode is done; call reset() first.")
+        self._step += 1
+        # Accumulate comments (deduplicate by message fingerprint)
+        new_comments = self._deduplicate(action.comments)
+        self._comments.extend(new_comments)
+        # Compute incremental reward for new comments
+        reward, feedback, grader_result = self._compute_reward(action, new_comments)
+        self._grader_scores = grader_result
+        self._total_reward = round(self._total_reward + reward.value, 4)
+        self._last_feedback = feedback
+        # Determine done
+        done = action.submit or self._step >= self.spec.max_steps
+        self._done = done
+        obs = self._build_observation(feedback=feedback, done=done)
+        info: Dict[str, Any] = {
+            "step": self._step,
+            "new_comments": len(new_comments),
+            "total_comments": len(self._comments),
+            "grader": grader_result,
+            "passed": grader_result.get("score", 0.0) >= self.spec.passing_threshold,
+        }
+        return StepResult(observation=obs, reward=reward, done=done, info=info)
+    def state(self) -> EnvironmentState:
+        """Return full serialisable state snapshot."""
+        return EnvironmentState(
+            task_id=self.task_id,
+            step=self._step,
+            max_steps=self.spec.max_steps,
+            total_reward=self._total_reward,
+            comments_so_far=self._comments,
+            done=self._done,
+            grader_scores=self._grader_scores,
+        )
+    # ------------------------------------------------------------------
+    # Internal helpers
+    # ------------------------------------------------------------------
+    def _build_observation(
+        self,
+        feedback: Optional[str] = None,
+        done: bool = False,
+    ) -> Observation:
+        summary_note = (
+            "\n5. You MUST include a `summary` field with your overall assessment."
+            if self.task_id == "task_3_hard"
+            else ""
+        )
+        instructions = INSTRUCTIONS_TEMPLATE.format(
+            title=self.spec.title,
+            difficulty=self.spec.difficulty.value.upper(),
+            categories=", ".join(self.spec.categories),
+            description=self.spec.description,
+            summary_note=summary_note,
+        )
+        return Observation(
+            task_id=self.task_id,
+            step=self._step,
+            snippet=self.snippet,
+            instructions=instructions,
+            previous_comments=list(self._comments),
+            feedback=feedback or self._last_feedback,
+            done=done,
+        )
+    def _compute_reward(
+        self,
+        action: Action,
+        new_comments: List[ReviewComment],
+    ) -> tuple[Reward, str, dict]:
+        """
+        Compute reward with partial progress signals.
+        Components
+        ----------
+        * +step_signal  : positive if new valid comments were added
+        * +submit_bonus : grader score applied on final submit
+        * -loop_penalty : penalty for submitting zero new comments repeatedly
+        * -over_comment : penalty for > 2× the expected number of comments
+        """
+        # Run grader against ALL accumulated comments
+        full_action = Action(
+            comments=self._comments,
+            summary=action.summary,
+            submit=action.submit,
+        )
+        grader_result = self.grader.grade(full_action, self.ground_truth)
+        current_score = grader_result["score"]
+        breakdown: Dict[str, float] = {}
+        reward_val = 0.0
+        if action.submit:
+            # Final reward = full grader score (0–1 mapped to -0.2–1.0)
+            submit_reward = current_score * 0.8 + (0.2 if current_score >= self.spec.passing_threshold else -0.2)
+            reward_val += submit_reward
+            breakdown["submit_reward"] = round(submit_reward, 4)
+            feedback = (
+                f"Review submitted. Score: {current_score:.3f} "
+                f"({'PASSED' if current_score >= self.spec.passing_threshold else 'FAILED'}). "
+                f"Matched {grader_result['matched_count']}/{grader_result['total_ground_truth']} issues."
+            )
+        else:
+            # Incremental reward: positive if new valid comments detected
+            if new_comments:
+                # Small positive signal for adding comments (+0.05 per comment, capped)
+                step_reward = min(0.05 * len(new_comments), 0.15)
+                reward_val += step_reward
+                breakdown["step_reward"] = round(step_reward, 4)
+                # Progress signal: reward increase in grader score
+                # We run a "previous" grader check without new comments to get delta
+                prev_action = Action(
+                    comments=[c for c in self._comments if c not in new_comments],
+                    summary=None,
+                    submit=False,
+                )
+                prev_result = self.grader.grade(prev_action, self.ground_truth)
+                score_delta = current_score - prev_result["score"]
+                if score_delta > 0:
+                    progress_reward = round(score_delta * 0.5, 4)
+                    reward_val += progress_reward
+                    breakdown["progress_reward"] = progress_reward
+            else:
+                # Penalty for empty step
+                reward_val -= 0.05
+                breakdown["empty_step_penalty"] = -0.05
+            # Penalty for too many comments (spam)
+            expected = grader_result["total_ground_truth"]
+            if len(self._comments) > expected * 2.5:
+                spam_penalty = -0.10
+                reward_val += spam_penalty
+                breakdown["spam_penalty"] = spam_penalty
+            feedback = (
+                f"Step {self._step}: Added {len(new_comments)} comment(s). "
+                f"Running score: {current_score:.3f}. "
+                f"Steps remaining: {self.spec.max_steps - self._step}."
+            )
+        reward_val = round(max(-1.0, min(1.0, reward_val)), 4)
+        return Reward(value=reward_val, breakdown=breakdown, reason=feedback), feedback, grader_result
+    def _deduplicate(self, incoming: List[ReviewComment]) -> List[ReviewComment]:
+        """Remove comments whose (line, category, message[:40]) already exist."""
+        existing_keys = {
+            (c.line, c.category, c.message[:40]) for c in self._comments
+        }
+        new: List[ReviewComment] = []
+        for c in incoming:
+            key = (c.line, c.category, c.message[:40])
+            if key not in existing_keys:
+                existing_keys.add(key)
+                new.append(c)
+        return new

env/models.py ADDED Viewed

	@@ -0,0 +1,117 @@

+"""
+Pydantic typed models for CodeReview OpenEnv.
+Defines all core data structures: enums for review categories and severities,
+code snippets, review comments, actions, observations, rewards, step results,
+task specifications, and environment state.
+"""
+from __future__ import annotations
+from enum import Enum
+from typing import Any, Dict, List, Optional
+from pydantic import BaseModel, Field
+# ---------------------------------------------------------------------------
+# Enums
+# ---------------------------------------------------------------------------
+class ReviewCategory(str, Enum):
+    """Categories of code review issues."""
+    BUG = "bug"
+    SECURITY = "security"
+    PERFORMANCE = "performance"
+    STYLE = "style"
+    DOCUMENTATION = "documentation"
+class Severity(str, Enum):
+    """Severity levels for review comments."""
+    LOW = "low"
+    MEDIUM = "medium"
+    HIGH = "high"
+    CRITICAL = "critical"
+class TaskDifficulty(str, Enum):
+    """Difficulty levels for tasks."""
+    EASY = "easy"
+    MEDIUM = "medium"
+    HARD = "hard"
+# ---------------------------------------------------------------------------
+# Core models
+# ---------------------------------------------------------------------------
+class CodeSnippet(BaseModel):
+    """A Python source code snippet for review."""
+    file_name: str
+    source: str
+    language: str = "python"
+class ReviewComment(BaseModel):
+    """A single review comment identifying an issue in the code."""
+    line: Optional[int] = None
+    category: ReviewCategory
+    severity: Severity = Severity.MEDIUM
+    message: str
+    suggestion: Optional[str] = None
+class Action(BaseModel):
+    """Agent action: a list of review comments plus control flags."""
+    comments: List[ReviewComment] = Field(default_factory=list)
+    summary: Optional[str] = None
+    submit: bool = False
+class Observation(BaseModel):
+    """What the agent sees on each step."""
+    task_id: str
+    step: int
+    snippet: CodeSnippet
+    instructions: str
+    previous_comments: List[ReviewComment] = Field(default_factory=list)
+    feedback: Optional[str] = None
+    done: bool = False
+class Reward(BaseModel):
+    """Reward signal returned after each step."""
+    value: float = 0.0
+    breakdown: Dict[str, float] = Field(default_factory=dict)
+    reason: str = ""
+class StepResult(BaseModel):
+    """Result of a single environment step."""
+    observation: Observation
+    reward: Reward
+    done: bool
+    info: Dict[str, Any] = Field(default_factory=dict)
+class TaskSpec(BaseModel):
+    """Specification for a single task."""
+    task_id: str
+    title: str
+    difficulty: TaskDifficulty
+    categories: List[str]
+    description: str
+    max_steps: int
+    passing_threshold: float
+class EnvironmentState(BaseModel):
+    """Full serialisable state snapshot of the environment."""
+    task_id: str
+    step: int
+    max_steps: int
+    total_reward: float
+    comments_so_far: List[ReviewComment] = Field(default_factory=list)
+    done: bool
+    grader_scores: Dict[str, Any] = Field(default_factory=dict)

graders/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # graders package

graders/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (156 Bytes). View file

graders/__pycache__/graders.cpython-313.pyc ADDED Viewed

Binary file (13.9 kB). View file

graders/graders.py ADDED Viewed

	@@ -0,0 +1,313 @@

+"""
+Agent graders for all three tasks.
+Each grader implements:
+    grade(action: Action, ground_truth: list[ReviewComment]) -> dict
+Scoring philosophy
+------------------
+* True positive (found real issue)         → positive reward
+* False positive (fabricated issue)        → small penalty
+* Missed critical issue                    → large penalty
+* Summary quality (task 3)                → bonus
+* Partial credit for correct category/severity with wrong line
+"""
+from __future__ import annotations
+import re
+from dataclasses import dataclass, field
+from typing import List, Optional
+from env.models import Action, ReviewCategory, ReviewComment, Severity
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+SEVERITY_WEIGHT: dict[Severity, float] = {
+    Severity.CRITICAL: 1.0,
+    Severity.HIGH: 0.75,
+    Severity.MEDIUM: 0.5,
+    Severity.LOW: 0.25,
+}
+def _category_match(a: ReviewComment, b: ReviewComment) -> bool:
+    return a.category == b.category
+def _severity_close(a: ReviewComment, b: ReviewComment) -> bool:
+    order = [Severity.LOW, Severity.MEDIUM, Severity.HIGH, Severity.CRITICAL]
+    return abs(order.index(a.severity) - order.index(b.severity)) <= 1
+def _line_close(a: ReviewComment, b: ReviewComment, tolerance: int = 3) -> bool:
+    if a.line is None or b.line is None:
+        return True  # file-level comments always match positionally
+    return abs(a.line - b.line) <= tolerance
+def _message_relevant(comment: ReviewComment, truth: ReviewComment) -> bool:
+    """Check if comment message contains keywords from the truth message."""
+    # Pull significant words (>4 chars) from the ground truth message
+    truth_keywords = {
+        w.lower()
+        for w in re.findall(r"\b\w{4,}\b", truth.message)
+        if w.lower() not in {"this", "that", "with", "from", "will", "should", "must", "have", "been", "when"}
+    }
+    comment_text = (comment.message + " " + (comment.suggestion or "")).lower()
+    if not truth_keywords:
+        return True
+    overlap = sum(1 for kw in truth_keywords if kw in comment_text)
+    return overlap / len(truth_keywords) >= 0.25  # 25% keyword overlap
+@dataclass
+class MatchResult:
+    matched: bool = False
+    partial: bool = False  # right category, wrong line
+    score: float = 0.0
+    reason: str = ""
+def _match_comment_to_truth(
+    comment: ReviewComment,
+    truth_list: List[ReviewComment],
+    already_matched: set[int],
+) -> tuple[MatchResult, Optional[int]]:
+    """Try to match a single agent comment against the ground-truth list."""
+    best = MatchResult()
+    best_idx: Optional[int] = None
+    for idx, truth in enumerate(truth_list):
+        if idx in already_matched:
+            continue
+        if not _category_match(comment, truth):
+            continue
+        line_ok = _line_close(comment, truth)
+        sev_ok = _severity_close(comment, truth)
+        msg_ok = _message_relevant(comment, truth)
+        if line_ok and msg_ok:
+            # Full match
+            score = SEVERITY_WEIGHT[truth.severity]
+            if sev_ok:
+                score *= 1.0
+            else:
+                score *= 0.7  # severity mismatch penalty
+            result = MatchResult(matched=True, partial=False, score=score,
+                                 reason=f"TP: {truth.category} L{truth.line}")
+            if score > best.score:
+                best = result
+                best_idx = idx
+        elif _category_match(comment, truth) and msg_ok and not line_ok:
+            # Partial: right issue, wrong line
+            score = SEVERITY_WEIGHT[truth.severity] * 0.5
+            result = MatchResult(matched=False, partial=True, score=score,
+                                 reason=f"Partial: right issue wrong line for {truth.category}")
+            if score > best.score:
+                best = result
+                best_idx = idx
+    return best, best_idx
+# ---------------------------------------------------------------------------
+# Base grader
+# ---------------------------------------------------------------------------
+class BaseGrader:
+    TASK_ID: str = ""
+    CATEGORIES: list[ReviewCategory] = []
+    def grade(
+        self,
+        action: Action,
+        ground_truth: List[ReviewComment],
+    ) -> dict:
+        raise NotImplementedError
+# ---------------------------------------------------------------------------
+# Task 1 – Easy (Bug + Style)
+# ---------------------------------------------------------------------------
+class Task1Grader(BaseGrader):
+    TASK_ID = "task_1_easy"
+    CATEGORIES = [ReviewCategory.BUG, ReviewCategory.STYLE]
+    def grade(self, action: Action, ground_truth: List[ReviewComment]) -> dict:
+        comments = action.comments
+        matched_truths: set[int] = set()
+        tp_score = 0.0
+        fp_penalty = 0.0
+        breakdown: dict[str, float] = {}
+        for comment in comments:
+            if comment.category not in self.CATEGORIES:
+                fp_penalty += 0.05
+                continue
+            result, idx = _match_comment_to_truth(comment, ground_truth, matched_truths)
+            if result.matched or result.partial:
+                tp_score += result.score
+                if idx is not None:
+                    matched_truths.add(idx)
+            else:
+                fp_penalty += 0.1  # fabricated issue
+        # Max possible TP score
+        max_score = sum(SEVERITY_WEIGHT[t.severity] for t in ground_truth
+                        if t.category in self.CATEGORIES)
+        recall = tp_score / max_score if max_score > 0 else 0.0
+        # Penalise missed criticals/highs
+        missed_critical_penalty = 0.0
+        for idx, truth in enumerate(ground_truth):
+            if idx not in matched_truths and truth.severity in (Severity.HIGH, Severity.CRITICAL):
+                if truth.category in self.CATEGORIES:
+                    missed_critical_penalty += 0.15
+        raw = recall - min(fp_penalty, 0.3) - missed_critical_penalty
+        final = round(max(0.0, min(1.0, raw)), 4)
+        breakdown["recall"] = round(recall, 4)
+        breakdown["fp_penalty"] = round(-min(fp_penalty, 0.3), 4)
+        breakdown["missed_critical_penalty"] = round(-missed_critical_penalty, 4)
+        return {
+            "score": final,
+            "breakdown": breakdown,
+            "matched_count": len(matched_truths),
+            "total_ground_truth": len([t for t in ground_truth if t.category in self.CATEGORIES]),
+        }
+# ---------------------------------------------------------------------------
+# Task 2 – Medium (Security + Performance)
+# ---------------------------------------------------------------------------
+class Task2Grader(BaseGrader):
+    TASK_ID = "task_2_medium"
+    CATEGORIES = [ReviewCategory.SECURITY, ReviewCategory.PERFORMANCE]
+    def grade(self, action: Action, ground_truth: List[ReviewComment]) -> dict:
+        comments = action.comments
+        matched_truths: set[int] = set()
+        tp_score = 0.0
+        fp_penalty = 0.0
+        for comment in comments:
+            if comment.category not in self.CATEGORIES:
+                fp_penalty += 0.03
+                continue
+            result, idx = _match_comment_to_truth(comment, ground_truth, matched_truths)
+            if result.matched or result.partial:
+                tp_score += result.score
+                if idx is not None:
+                    matched_truths.add(idx)
+            else:
+                fp_penalty += 0.12
+        max_score = sum(SEVERITY_WEIGHT[t.severity] for t in ground_truth
+                        if t.category in self.CATEGORIES)
+        recall = tp_score / max_score if max_score > 0 else 0.0
+        # Security criticals have double penalty if missed
+        missed_penalty = 0.0
+        for idx, truth in enumerate(ground_truth):
+            if idx not in matched_truths and truth.category == ReviewCategory.SECURITY:
+                if truth.severity == Severity.CRITICAL:
+                    missed_penalty += 0.20
+                elif truth.severity == Severity.HIGH:
+                    missed_penalty += 0.10
+        raw = recall - min(fp_penalty, 0.3) - missed_penalty
+        final = round(max(0.0, min(1.0, raw)), 4)
+        return {
+            "score": final,
+            "breakdown": {
+                "recall": round(recall, 4),
+                "fp_penalty": round(-min(fp_penalty, 0.3), 4),
+                "missed_security_penalty": round(-missed_penalty, 4),
+            },
+            "matched_count": len(matched_truths),
+            "total_ground_truth": len([t for t in ground_truth if t.category in self.CATEGORIES]),
+        }
+# ---------------------------------------------------------------------------
+# Task 3 – Hard (All categories + summary required)
+# ---------------------------------------------------------------------------
+class Task3Grader(BaseGrader):
+    TASK_ID = "task_3_hard"
+    CATEGORIES = list(ReviewCategory)
+    def grade(self, action: Action, ground_truth: List[ReviewComment]) -> dict:
+        comments = action.comments
+        matched_truths: set[int] = set()
+        tp_score = 0.0
+        fp_penalty = 0.0
+        for comment in comments:
+            result, idx = _match_comment_to_truth(comment, ground_truth, matched_truths)
+            if result.matched or result.partial:
+                tp_score += result.score
+                if idx is not None:
+                    matched_truths.add(idx)
+            else:
+                fp_penalty += 0.08
+        max_score = sum(SEVERITY_WEIGHT[t.severity] for t in ground_truth)
+        recall = tp_score / max_score if max_score > 0 else 0.0
+        # Summary quality bonus (up to +0.15)
+        summary_bonus = 0.0
+        if action.summary:
+            summary_lower = action.summary.lower()
+            key_themes = ["security", "injection", "pickle", "performance", "documentation", "bug"]
+            hits = sum(1 for kw in key_themes if kw in summary_lower)
+            summary_bonus = min(0.15, hits * 0.025)
+        # Summary required penalty
+        summary_penalty = 0.10 if not action.summary else 0.0
+        # Missed critical penalty
+        missed_penalty = 0.0
+        for idx, truth in enumerate(ground_truth):
+            if idx not in matched_truths:
+                if truth.severity == Severity.CRITICAL:
+                    missed_penalty += 0.15
+                elif truth.severity == Severity.HIGH:
+                    missed_penalty += 0.08
+        raw = recall + summary_bonus - min(fp_penalty, 0.3) - missed_penalty - summary_penalty
+        final = round(max(0.0, min(1.0, raw)), 4)
+        return {
+            "score": final,
+            "breakdown": {
+                "recall": round(recall, 4),
+                "summary_bonus": round(summary_bonus, 4),
+                "fp_penalty": round(-min(fp_penalty, 0.3), 4),
+                "missed_critical_penalty": round(-missed_penalty, 4),
+                "summary_penalty": round(-summary_penalty, 4),
+            },
+            "matched_count": len(matched_truths),
+            "total_ground_truth": len(ground_truth),
+        }
+# ---------------------------------------------------------------------------
+# Registry
+# ---------------------------------------------------------------------------
+GRADERS: dict[str, BaseGrader] = {
+    "task_1_easy": Task1Grader(),
+    "task_2_medium": Task2Grader(),
+    "task_3_hard": Task3Grader(),
+}

inference.py ADDED Viewed

	@@ -0,0 +1,304 @@

+#!/usr/bin/env python3
+"""
+Inference Script for CodeReview OpenEnv
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after the episode, always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw last_action_error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+    - Each task should return score in [0, 1]
+  Example:
+    [START] task=task_1_easy env=code_review model=Qwen/Qwen2.5-72B-Instruct
+    [STEP] step=1 action=review(comments=6,submit=true) reward=0.85 done=true error=null
+    [END] success=true steps=1 score=0.850 rewards=0.85
+"""
+from __future__ import annotations
+import asyncio
+import json
+import os
+import sys
+import textwrap
+from typing import Any, Dict, List, Optional
+from openai import OpenAI
+# Ensure project root is on the import path
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from env.environment import CodeReviewEnv, TASK_SPECS
+from env.models import Action, ReviewComment, ReviewCategory, Severity
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")  # If using docker image
+HF_TOKEN = os.getenv("HF_TOKEN")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+BENCHMARK = os.getenv("BENCHMARK", "code_review")
+TASK_NAME = os.getenv("CODE_REVIEW_TASK", "all")  # "all" or a specific task id
+TASKS = ["task_1_easy", "task_2_medium", "task_3_hard"]
+TEMPERATURE = 0.2
+MAX_TOKENS = 2048
+# ---------------------------------------------------------------------------
+# System prompt for code review
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = textwrap.dedent("""
+You are an expert Python code reviewer.
+You will be given a code snippet along with review instructions.
+Your job is to produce a JSON action object that identifies issues in the code.
+The JSON object you return must match this schema exactly:
+{
+  "comments": [
+    {
+      "line": <int or null>,
+      "category": <"bug"|"security"|"performance"|"style"|"documentation">,
+      "severity": <"low"|"medium"|"high"|"critical">,
+      "message": "<clear description of the issue>",
+      "suggestion": "<optional fix>"
+    }
+  ],
+  "summary": "<overall assessment – required for hard tasks, optional otherwise>",
+  "submit": true
+}
+Rules:
+- Only flag genuine issues. Do not fabricate problems.
+- Be precise about line numbers (1-indexed from the code).
+- Match the categories listed in the instructions.
+- Always set "submit": true when you believe your review is complete.
+- Return ONLY the JSON object. No markdown, no explanations.
+""").strip()
+# ---------------------------------------------------------------------------
+# Logging helpers  (exact STDOUT format from spec)
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+# ---------------------------------------------------------------------------
+# LLM interaction
+# ---------------------------------------------------------------------------
+def build_user_message(obs_dict: dict) -> str:
+    """Build LLM user prompt from an observation dict."""
+    snippet = obs_dict["snippet"]
+    instructions = obs_dict["instructions"]
+    previous = obs_dict.get("previous_comments", [])
+    numbered_source = "\n".join(
+        f"{i+1:3d}  {line}"
+        for i, line in enumerate(snippet["source"].splitlines())
+    )
+    msg = f"""
+{instructions}
+### File: {snippet['file_name']}
+```python
+{numbered_source}
+```
+"""
+    if previous:
+        msg += f"\n### Your previous comments ({len(previous)} so far):\n"
+        for c in previous:
+            line_val = c.get("line", "?")
+            category = c.get("category", "?")
+            message = c.get("message", "")[:80]
+            msg += f"  - L{line_val} [{category}] {message}\n"
+    return msg.strip()
+def get_model_action(client: OpenAI, obs_dict: dict) -> dict:
+    """Call the LLM and return a parsed action dict."""
+    user_msg = build_user_message(obs_dict)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_msg},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            response_format={"type": "json_object"},
+            stream=False,
+        )
+        raw = (completion.choices[0].message.content or "{}").strip()
+        action_dict = json.loads(raw)
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        action_dict = {"comments": [], "submit": True}
+    return action_dict
+# ---------------------------------------------------------------------------
+# Action parsing
+# ---------------------------------------------------------------------------
+def parse_action(action_dict: dict) -> Action:
+    """Convert a raw action dict into a typed Action model."""
+    comments: List[ReviewComment] = []
+    for c in action_dict.get("comments", []):
+        try:
+            comments.append(ReviewComment(
+                line=c.get("line"),
+                category=ReviewCategory(c.get("category", "bug")),
+                severity=Severity(c.get("severity", "medium")),
+                message=c.get("message", ""),
+                suggestion=c.get("suggestion"),
+            ))
+        except Exception:
+            pass  # skip malformed comments
+    return Action(
+        comments=comments,
+        summary=action_dict.get("summary"),
+        submit=action_dict.get("submit", True),
+    )
+def format_action_str(action_dict: dict) -> str:
+    """Format action dict into a compact string for STEP logging."""
+    n = len(action_dict.get("comments", []))
+    submit = str(action_dict.get("submit", False)).lower()
+    return f"review(comments={n},submit={submit})"
+# ---------------------------------------------------------------------------
+# Task runner
+# ---------------------------------------------------------------------------
+async def run_task(task_id: str, client: OpenAI) -> dict:
+    """Run a single code-review task episode and return results."""
+    env = CodeReviewEnv(task_id=task_id)
+    obs = env.reset()
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        for step in range(1, env.spec.max_steps + 1):
+            obs_dict = obs.model_dump()
+            # Get LLM response
+            action_dict = get_model_action(client, obs_dict)
+            action = parse_action(action_dict)
+            # Step the environment
+            result = env.step(action)
+            reward = result.reward.value
+            done = result.done
+            error = None
+            rewards.append(reward)
+            steps_taken = step
+            action_str = format_action_str(action_dict)
+            log_step(step=step, action=action_str, reward=reward, done=done, error=error)
+            obs = result.observation
+            if done:
+                score = result.info.get("grader", {}).get("score", 0.0)
+                success = score >= env.spec.passing_threshold
+                break
+    except Exception as e:
+        print(f"[DEBUG] Error during task {task_id}: {e}", flush=True)
+    finally:
+        # Clamp score to [0, 1]
+        score = min(max(score, 0.0), 1.0)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return {"task_id": task_id, "score": score, "success": success}
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
+    tasks_to_run = TASKS if TASK_NAME == "all" else [TASK_NAME]
+    results: List[dict] = []
+    for task_id in tasks_to_run:
+        result = await run_task(task_id, client)
+        results.append(result)
+    # Print final summary to stderr (not part of the spec, but useful for debugging)
+    avg_score = sum(r["score"] for r in results) / len(results) if results else 0.0
+    pass_count = sum(1 for r in results if r["success"])
+    print(
+        f"\n[SUMMARY] tasks={len(results)} passed={pass_count} avg_score={avg_score:.3f}",
+        file=sys.stderr,
+        flush=True,
+    )
+if __name__ == "__main__":
+    asyncio.run(main())

openenv-code-review.tar.gz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d389c58d2d84185dae21c86ccda3422c3bec52ab239859e93b90f721e5ce7fe1
+size 50132

openenv.yaml ADDED Viewed

	@@ -0,0 +1,163 @@

+name: code-review-env
+version: "1.0.0"
+description: >
+  An OpenEnv-compliant AI training environment that simulates professional
+  Python code review. Agents learn to identify bugs, security vulnerabilities,
+  performance issues, style problems, and documentation gaps across three
+  progressively harder tasks.
+tags:
+  - openenv
+  - code-review
+  - python
+  - security
+  - software-engineering
+author: imaginephoenix / rawgenn.tech
+license: MIT
+environment:
+  class: CodeReviewEnv
+  module: env.environment
+  entrypoint: app.py
+  framework: fastapi
+observation_space:
+  type: object
+  description: >
+    What the agent sees each step. Contains the code snippet to review,
+    task instructions, all previously submitted comments, and optional
+    feedback from the last step.
+  fields:
+    task_id:
+      type: string
+      description: Identifier of the active task
+    step:
+      type: integer
+      description: Current step number (0-indexed)
+    snippet:
+      type: object
+      description: Python source code to review
+      fields:
+        file_name: { type: string }
+        source: { type: string, description: "Full Python source with line numbers" }
+        language: { type: string, const: "python" }
+    instructions:
+      type: string
+      description: Review instructions and scope for this task
+    previous_comments:
+      type: array
+      description: All review comments submitted in prior steps
+    feedback:
+      type: string
+      nullable: true
+      description: Environment feedback on the most recent action
+    done:
+      type: boolean
+action_space:
+  type: object
+  description: >
+    What the agent submits. A list of review comments (each with line,
+    category, severity, message, optional suggestion) plus an optional
+    overall summary and a submit flag.
+  fields:
+    comments:
+      type: array
+      items:
+        type: object
+        fields:
+          line: { type: integer, nullable: true, description: "1-indexed line number" }
+          category:
+            type: string
+            enum: [bug, security, performance, style, documentation]
+          severity:
+            type: string
+            enum: [low, medium, high, critical]
+          message: { type: string, minLength: 5, maxLength: 500 }
+          suggestion: { type: string, nullable: true, maxLength: 500 }
+    summary:
+      type: string
+      nullable: true
+      description: "Required for task_3_hard; optional otherwise"
+    submit:
+      type: boolean
+      description: "Set true to finalise the review and trigger the grader"
+reward:
+  type: float
+  range: [-1.0, 1.0]
+  description: >
+    Shaped reward with partial progress signals. Incremental positive reward
+    for each new valid comment added (proportional to issue severity). On
+    submit: final grader score mapped to [-0.2, 1.0]. Penalties for false
+    positives, missed criticals, and spamming low-quality comments.
+tasks:
+  - id: task_1_easy
+    title: "Bug Detection & Style Review"
+    difficulty: easy
+    categories: [bug, style]
+    max_steps: 5
+    passing_threshold: 0.55
+    description: >
+      Review calculator.py (31 lines) for division-by-zero bugs, off-by-one
+      errors, empty-collection crashes, and Python style anti-patterns.
+  - id: task_2_medium
+    title: "Security & Performance Audit"
+    difficulty: medium
+    categories: [security, performance]
+    max_steps: 7
+    passing_threshold: 0.60
+    description: >
+      Audit user_service.py (55 lines) for SQL injection, broken MD5 password
+      hashing, unbounded DB queries, and connection churn. Missed critical
+      security issues carry heavy penalties.
+  - id: task_3_hard
+    title: "Comprehensive Code Review"
+    difficulty: hard
+    categories: [bug, security, performance, style, documentation]
+    max_steps: 10
+    passing_threshold: 0.65
+    description: >
+      Full production-grade review of data_pipeline.py (49 lines). Covers
+      all five categories including shell injection, unsafe pickle
+      deserialization, ZeroDivisionError, and missing docstrings. An overall
+      written summary is required.
+api_endpoints:
+  - path: /reset
+    method: POST
+    description: Start or restart an episode
+  - path: /step
+    method: POST
+    description: Submit an action
+  - path: /state
+    method: GET
+    description: Get full serialisable state
+  - path: /tasks
+    method: GET
+    description: List all available tasks
+  - path: /health
+    method: GET
+    description: Health check
+baseline:
+  model: gpt-4o
+  script: baseline_agent.py
+  expected_scores:
+    task_1_easy: ~0.75
+    task_2_medium: ~0.65
+    task_3_hard: ~0.55
+docker:
+  base_image: python:3.11-slim
+  port: 7860
+  build: docker build -t code-review-env .
+  run: docker run -p 7860:7860 code-review-env
+huggingface:
+  space_sdk: docker
+  tags: [openenv, code-review, ai-agent, evaluation]

pyproject.toml ADDED Viewed

	@@ -0,0 +1,31 @@

+[project]
+name = "code-review-env"
+version = "1.0.0"
+description = "An OpenEnv-compliant AI training environment that simulates professional Python code review."
+readme = "README.md"
+license = { text = "MIT" }
+requires-python = ">=3.11"
+dependencies = [
+    "fastapi>=0.104.0",
+    "uvicorn[standard]>=0.24.0",
+    "pydantic>=2.5.0",
+    "requests>=2.31.0",
+    "openai>=1.6.0",
+    "openenv-core>=0.2.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=7.4.0",
+]
+[project.scripts]
+server = "server.app:main"
+[build-system]
+requires = ["setuptools>=68.0", "wheel"]
+build-backend = "setuptools.backends._legacy:_Backend"
+[tool.setuptools.packages.find]
+include = ["env*", "corpus*", "graders*", "server*"]

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+fastapi>=0.104.0
+uvicorn[standard]>=0.24.0
+pydantic>=2.5.0
+requests>=2.31.0
+openai>=1.6.0
+pytest>=7.4.0
+openenv-core>=0.2.0

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # server package

server/app.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""
+Server entry point for CodeReview OpenEnv.
+This module provides the main() entry point used by:
+  - pyproject.toml [project.scripts] server = "server.app:main"
+  - openenv serve
+  - uv run server
+It imports and runs the FastAPI app defined in the root app.py.
+"""
+from __future__ import annotations
+import sys
+import os
+# Ensure project root is importable
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+def main(host: str = "0.0.0.0", port: int = 7860, workers: int = 1) -> None:
+    """Start the CodeReview OpenEnv server."""
+    import uvicorn
+    uvicorn.run(
+        "app:app",
+        host=host,
+        port=int(os.environ.get("PORT", port)),
+        workers=workers,
+    )
+if __name__ == "__main__":
+    main()

templates/index.html ADDED Viewed

	@@ -0,0 +1,807 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>CodeReview OpenEnv</title>
+    <!-- Google Fonts for modern typography -->
+    <link href="https://fonts.googleapis.com/css2?family=Outfit:wght@300;400;600;700&family=Fira+Code:wght@400;500&display=swap" rel="stylesheet">
+    <!-- PrismJS for code syntax highlighting -->
+    <link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css" rel="stylesheet" />
+    <style>
+        :root {
+            --bg-color: #0b1120;
+            --surface-color: rgba(30, 41, 59, 0.7);
+            --surface-border: rgba(255, 255, 255, 0.08);
+            --text-primary: #f8fafc;
+            --text-secondary: #94a3b8;
+            --primary-accent: #3b82f6;
+            --primary-glow: rgba(59, 130, 246, 0.5);
+            --secondary-accent: #8b5cf6;
+            --danger: #ef4444;
+            --success: #10b981;
+            --warning: #f59e0b;
+        }
+        body {
+            margin: 0;
+            padding: 0;
+            font-family: 'Outfit', sans-serif;
+            background-color: var(--bg-color);
+            background-image:
+                radial-gradient(at 0% 0%, rgba(59, 130, 246, 0.15) 0px, transparent 50%),
+                radial-gradient(at 100% 100%, rgba(139, 92, 246, 0.15) 0px, transparent 50%);
+            background-attachment: fixed;
+            color: var(--text-primary);
+            min-height: 100vh;
+            display: flex;
+            flex-direction: column;
+            overflow-x: hidden;
+        }
+        /* Glassmorphism Classes */
+        .glass-panel {
+            background: var(--surface-color);
+            backdrop-filter: blur(12px);
+            -webkit-backdrop-filter: blur(12px);
+            border: 1px solid var(--surface-border);
+            border-radius: 16px;
+            box-shadow: 0 4px 30px rgba(0, 0, 0, 0.1);
+        }
+        header {
+            padding: 20px 40px;
+            display: flex;
+            justify-content: space-between;
+            align-items: center;
+            border-bottom: 1px solid var(--surface-border);
+            background: rgba(11, 17, 32, 0.8);
+            backdrop-filter: blur(8px);
+            position: sticky;
+            top: 0;
+            z-index: 100;
+        }
+        h1 {
+            margin: 0;
+            font-size: 1.5rem;
+            font-weight: 700;
+            background: linear-gradient(135deg, #fff, #94a3b8);
+            -webkit-background-clip: text;
+            -webkit-text-fill-color: transparent;
+            display: flex;
+            align-items: center;
+            gap: 10px;
+        }
+        .controls {
+            display: flex;
+            gap: 15px;
+            align-items: center;
+        }
+        select, button, input, textarea {
+            font-family: 'Outfit', sans-serif;
+            outline: none;
+            transition: all 0.3s ease;
+        }
+        select {
+            padding: 10px 16px;
+            background: rgba(30, 41, 59, 0.8);
+            color: white;
+            border: 1px solid var(--surface-border);
+            border-radius: 8px;
+            font-size: 0.95rem;
+            cursor: pointer;
+        }
+        select:focus {
+            border-color: var(--primary-accent);
+            box-shadow: 0 0 0 2px var(--primary-glow);
+        }
+        button {
+            padding: 10px 20px;
+            border: none;
+            border-radius: 8px;
+            font-weight: 600;
+            cursor: pointer;
+            display: inline-flex;
+            align-items: center;
+            gap: 8px;
+        }
+        .btn-primary {
+            background: linear-gradient(135deg, var(--primary-accent), var(--secondary-accent));
+            color: white;
+            box-shadow: 0 4px 15px var(--primary-glow);
+        }
+        .btn-primary:hover {
+            transform: translateY(-2px);
+            box-shadow: 0 6px 20px var(--primary-glow);
+        }
+        .btn-primary:active {
+            transform: translateY(1px);
+        }
+        .btn-outline {
+            background: transparent;
+            color: var(--text-primary);
+            border: 1px solid var(--surface-border);
+        }
+        .btn-outline:hover {
+            background: rgba(255, 255, 255, 0.05);
+        }
+        main {
+            display: flex;
+            flex: 1;
+            padding: 20px;
+            gap: 20px;
+            height: calc(100vh - 100px);
+            box-sizing: border-box;
+        }
+        /* Loading Overlay */
+        #loader {
+            position: fixed;
+            top: 0; left: 0; right: 0; bottom: 0;
+            background: var(--bg-color);
+            display: flex;
+            justify-content: center;
+            align-items: center;
+            z-index: 1000;
+            transition: opacity 0.5s ease;
+        }
+        .spinner {
+            width: 50px;
+            height: 50px;
+            border: 3px solid rgba(255,255,255,0.1);
+            border-radius: 50%;
+            border-top-color: var(--primary-accent);
+            animation: spin 1s ease-in-out infinite;
+        }
+        @keyframes spin { to { transform: rotate(360deg); } }
+        /* Left Pane - Code Snippet */
+        .pane-left {
+            flex: 1.2;
+            display: flex;
+            flex-direction: column;
+            overflow: hidden;
+            animation: slideInLeft 0.5s cubic-bezier(0.16, 1, 0.3, 1);
+        }
+        .pane-header {
+            padding: 15px 20px;
+            border-bottom: 1px solid var(--surface-border);
+            font-weight: 600;
+            display: flex;
+            justify-content: space-between;
+            align-items: center;
+            font-size: 0.9rem;
+            color: var(--text-secondary);
+        }
+        .code-container {
+            flex: 1;
+            overflow: auto;
+            border-bottom-left-radius: 16px;
+            border-bottom-right-radius: 16px;
+            background: rgba(0, 0, 0, 0.2);
+            padding: 0;
+            margin: 0;
+        }
+        pre[class*="language-"] {
+            margin: 0;
+            padding: 20px;
+            background: transparent !important;
+            font-family: 'Fira Code', monospace;
+            font-size: 0.9rem;
+            line-height: 1.5;
+        }
+        /* Right Pane - Instructions & Actions */
+        .pane-right {
+            flex: 0.8;
+            display: flex;
+            flex-direction: column;
+            gap: 20px;
+            overflow-y: auto;
+            animation: slideInRight 0.5s cubic-bezier(0.16, 1, 0.3, 1);
+        }
+        .card {
+            padding: 20px;
+        }
+        .card-title {
+            font-size: 1.1rem;
+            font-weight: 600;
+            margin-top: 0;
+            margin-bottom: 15px;
+            color: var(--text-primary);
+            display: flex;
+            align-items: center;
+            gap: 10px;
+        }
+        .instructions-text {
+            font-size: 0.95rem;
+            color: var(--text-secondary);
+            line-height: 1.6;
+            white-space: pre-wrap;
+        }
+        .badge {
+            display: inline-block;
+            padding: 4px 10px;
+            border-radius: 20px;
+            font-size: 0.75rem;
+            font-weight: 600;
+            background: rgba(255, 255, 255, 0.1);
+            text-transform: uppercase;
+            letter-spacing: 0.5px;
+        }
+        .badge.easy { color: var(--success); background: rgba(16, 185, 129, 0.1); }
+        .badge.medium { color: var(--warning); background: rgba(245, 158, 11, 0.1); }
+        .badge.hard { color: var(--danger); background: rgba(239, 68, 68, 0.1); }
+        /* Comments Form */
+        .form-group {
+            margin-bottom: 15px;
+        }
+        .form-group label {
+            display: block;
+            margin-bottom: 8px;
+            font-size: 0.85rem;
+            color: var(--text-secondary);
+            font-weight: 500;
+        }
+        .form-row {
+            display: flex;
+            gap: 15px;
+        }
+        .form-row > div {
+            flex: 1;
+        }
+        input[type="number"], input[type="text"], textarea {
+            width: 100%;
+            padding: 10px 12px;
+            background: rgba(15, 23, 42, 0.6);
+            border: 1px solid var(--surface-border);
+            border-radius: 8px;
+            color: white;
+            box-sizing: border-box;
+            resize: vertical;
+        }
+        input:focus, textarea:focus {
+            border-color: var(--primary-accent);
+            box-shadow: 0 0 0 2px rgba(59, 130, 246, 0.2);
+        }
+        /* Staged Comments List */
+        .comments-list {
+            margin-top: 15px;
+            display: flex;
+            flex-direction: column;
+            gap: 10px;
+        }
+        .comment-item {
+            background: rgba(0, 0, 0, 0.2);
+            border-left: 3px solid var(--primary-accent);
+            padding: 12px 15px;
+            border-radius: 4px 8px 8px 4px;
+            font-size: 0.9rem;
+            animation: fadeIn 0.3s ease;
+            position: relative;
+        }
+        .comment-item .meta {
+            font-size: 0.8rem;
+            color: var(--text-secondary);
+            margin-bottom: 5px;
+            display: flex;
+            justify-content: space-between;
+        }
+        .comment-item .remove-btn {
+            position: absolute;
+            top: 10px;
+            right: 10px;
+            background: none;
+            border: none;
+            padding: 0;
+            color: var(--danger);
+            font-size: 1.2rem;
+            cursor: pointer;
+            box-shadow: none;
+            opacity: 0.6;
+        }
+        .comment-item .remove-btn:hover {
+            opacity: 1;
+            transform: scale(1.1);
+        }
+        textarea#summary {
+            height: 80px;
+            margin-top: 10px;
+        }
+        .submit-section {
+            display: flex;
+            justify-content: flex-end;
+            margin-top: 20px;
+            border-top: 1px solid var(--surface-border);
+            padding-top: 20px;
+        }
+        /* Modal */
+        #result-modal {
+            position: fixed;
+            top: 0; left: 0; right: 0; bottom: 0;
+            background: rgba(0,0,0,0.7);
+            backdrop-filter: blur(5px);
+            display: none;
+            justify-content: center;
+            align-items: center;
+            z-index: 2000;
+        }
+        .modal-content {
+            width: 100%;
+            max-width: 500px;
+            padding: 30px;
+            text-align: center;
+            animation: popIn 0.4s cubic-bezier(0.16, 1, 0.3, 1);
+        }
+        .score-circle {
+            width: 120px;
+            height: 120px;
+            border-radius: 50%;
+            background: linear-gradient(135deg, var(--bg-color), rgba(30,41,59,1));
+            border: 4px solid var(--success);
+            display: flex;
+            flex-direction: column;
+            justify-content: center;
+            align-items: center;
+            margin: 0 auto 20px;
+            box-shadow: 0 0 30px rgba(16, 185, 129, 0.3);
+            position: relative;
+        }
+        .score-circle.failed {
+            border-color: var(--danger);
+            box-shadow: 0 0 30px rgba(239, 68, 68, 0.3);
+        }
+        .score-value {
+            font-size: 2.5rem;
+            font-weight: 700;
+            margin: 0;
+            line-height: 1;
+        }
+        .score-label {
+            font-size: 0.8rem;
+            color: var(--text-secondary);
+            text-transform: uppercase;
+        }
+        .reward-breakdown {
+            text-align: left;
+            margin-top: 25px;
+            background: rgba(0,0,0,0.2);
+            padding: 15px;
+            border-radius: 8px;
+            font-size: 0.9rem;
+        }
+        /* Animations */
+        @keyframes slideInLeft {
+            from { transform: translateX(-30px); opacity: 0; }
+            to { transform: translateX(0); opacity: 1; }
+        }
+        @keyframes slideInRight {
+            from { transform: translateX(30px); opacity: 0; }
+            to { transform: translateX(0); opacity: 1; }
+        }
+        @keyframes fadeIn {
+            from { opacity: 0; transform: translateY(10px); }
+            to { opacity: 1; transform: translateY(0); }
+        }
+        @keyframes popIn {
+            0% { transform: scale(0.9); opacity: 0; }
+            70% { transform: scale(1.02); opacity: 1; }
+            100% { transform: scale(1); opacity: 1; }
+        }
+        /* Highlight specific lines */
+        .line-highlight {
+            background: rgba(239, 68, 68, 0.2);
+            display: inline-block;
+            width: 100%;
+        }
+    </style>
+</head>
+<body>
+    <div id="loader">
+        <div class="spinner"></div>
+    </div>
+    <header>
+        <h1>
+            <svg width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
+                <polyline points="16 18 22 12 16 6"></polyline>
+                <polyline points="8 6 2 12 8 18"></polyline>
+            </svg>
+            CodeReview Hub
+        </h1>
+        <div class="controls">
+            <select id="task-select">
+                <option value="">Loading tasks...</option>
+            </select>
+            <button class="btn-primary" onclick="initSession()">
+                <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M2 12h4l2-9 5 18 3-10 4 3"/></svg>
+                Load Environment
+            </button>
+        </div>
+    </header>
+    <main id="main-content" style="opacity: 0; transition: opacity 0.5s ease;">
+        <div class="glass-panel pane-left">
+            <div class="pane-header">
+                <span id="file-name">filename.py</span>
+                <span class="badge" id="task-difficulty">EASY</span>
+            </div>
+            <div class="code-container">
+                <pre><code class="language-python" id="code-block"># Load a task to see code</code></pre>
+            </div>
+        </div>
+        <div class="pane-right">
+            <div class="glass-panel card">
+                <h2 class="card-title">
+                    <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="var(--primary-accent)" stroke-width="2"><circle cx="12" cy="12" r="10"/><path d="M12 16v-4"/><path d="M12 8h.01"/></svg>
+                    Review Instructions
+                </h2>
+                <div class="instructions-text" id="instructions">
+                    Select a task and click 'Load Environment' to begin your code review session.
+                </div>
+            </div>
+            <div class="glass-panel card" id="review-panel" style="display: none;">
+                <h2 class="card-title">
+                    <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="var(--secondary-accent)" stroke-width="2"><path d="M21 15a2 2 0 0 1-2 2H7l-4 4V5a2 2 0 0 1 2-2h14a2 2 0 0 1 2 2z"/></svg>
+                    Add Comment
+                </h2>
+                <div class="form-row">
+                    <div class="form-group">
+                        <label>Line # (optional)</label>
+                        <input type="number" id="c-line" placeholder="e.g. 15" min="1">
+                    </div>
+                    <div class="form-group">
+                        <label>Category</label>
+                        <select id="c-category">
+                            <option value="bug">Bug</option>
+                            <option value="security">Security</option>
+                            <option value="performance">Performance</option>
+                            <option value="style">Style</option>
+                            <option value="documentation">Documentation</option>
+                        </select>
+                    </div>
+                    <div class="form-group">
+                        <label>Severity</label>
+                        <select id="c-severity">
+                            <option value="low">Low</option>
+                            <option value="medium" selected>Medium</option>
+                            <option value="high">High</option>
+                            <option value="critical">Critical</option>
+                        </select>
+                    </div>
+                </div>
+                <div class="form-group">
+                    <label>Message</label>
+                    <textarea id="c-message" rows="2" placeholder="Describe the issue..."></textarea>
+                </div>
+                <div class="form-group">
+                    <label>Suggestion (optional)</label>
+                    <input type="text" id="c-suggestion" placeholder="Proposed fix code...">
+                </div>
+                <button class="btn-outline" style="width: 100%; justify-content: center;" onclick="stageComment()">
+                    + Add to Review
+                </button>
+                <div class="comments-list" id="staged-comments">
+                    <!-- Dynamic comments go here -->
+                </div>
+                <div class="form-group" style="margin-top: 20px; padding-top: 20px; border-top: 1px solid var(--surface-border);">
+                    <label>Overall Summary (required for Hard tasks)</label>
+                    <textarea id="summary" placeholder="Provide an overall assessment of the code quality..."></textarea>
+                </div>
+                <div class="submit-section">
+                    <button class="btn-primary" onclick="submitReview()">
+                        <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M22 11.08V12a10 10 0 1 1-5.93-9.14"/><polyline points="22 4 12 14.01 9 11.01"/></svg>
+                        Submit Review
+                    </button>
+                </div>
+            </div>
+        </div>
+    </main>
+    <!-- Results Modal -->
+    <div id="result-modal">
+        <div class="glass-panel modal-content">
+            <h2 style="margin-top: 0;">Evaluation Complete</h2>
+            <div class="score-circle" id="modal-score-circle">
+                <p class="score-value" id="modal-score">0.0</p>
+                <p class="score-label">Score</p>
+            </div>
+            <h3 id="modal-status" style="margin-bottom: 5px;">Passed!</h3>
+            <p id="modal-desc" style="color: var(--text-secondary); margin-top: 0; font-size: 0.9rem;"></p>
+            <div class="reward-breakdown" id="modal-breakdown">
+                <!-- Breakdown inserted here -->
+            </div>
+            <button class="btn-primary" style="margin-top: 20px; width: 100%; justify-content: center;" onclick="closeModal()">
+                Continue
+            </button>
+        </div>
+    </div>
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
+    <script>
+        // State
+        let currentSessionId = 'session_' + Math.floor(Math.random() * 100000);
+        let stagedComments = [];
+        let maxSteps = 1;
+        // Init
+        document.addEventListener('DOMContentLoaded', async () => {
+            try {
+                const res = await fetch('/tasks');
+                const tasks = await res.json();
+                const select = document.getElementById('task-select');
+                select.innerHTML = '';
+                for (const [id, spec] of Object.entries(tasks)) {
+                    const option = document.createElement('option');
+                    option.value = id;
+                    option.textContent = `${spec.title} (${spec.difficulty})`;
+                    select.appendChild(option);
+                }
+                document.getElementById('loader').style.opacity = '0';
+                setTimeout(() => document.getElementById('loader').style.display = 'none', 500);
+                document.getElementById('main-content').style.opacity = '1';
+            } catch (e) {
+                alert("Failed to connect to backend api.");
+            }
+        });
+        function getLineNumberedSource(source) {
+            const lines = source.split('\n');
+            let result = '';
+            for (let i = 0; i < lines.length; i++) {
+                // simple line numbering injected into markup
+                result += `${lines[i]}\n`;
+            }
+            return result;
+        }
+        async function initSession() {
+            const taskId = document.getElementById('task-select').value;
+            if(!taskId) return;
+            document.getElementById('loader').style.display = 'flex';
+            document.getElementById('loader').style.opacity = '1';
+            stagedComments = [];
+            renderComments();
+            document.getElementById('c-message').value = '';
+            document.getElementById('c-suggestion').value = '';
+            document.getElementById('c-line').value = '';
+            document.getElementById('summary').value = '';
+            try {
+                const res = await fetch('/reset', {
+                    method: 'POST',
+                    headers: {'Content-Type': 'application/json'},
+                    body: JSON.stringify({task_id: taskId, session_id: currentSessionId})
+                });
+                const data = await res.json();
+                const obs = data.observation;
+                document.getElementById('instructions').textContent = obs.instructions;
+                document.getElementById('file-name').textContent = obs.snippet.file_name;
+                let diffBadge = document.getElementById('task-difficulty');
+                diffBadge.className = 'badge';
+                if(taskId.includes('easy')) diffBadge.classList.add('easy');
+                else if(taskId.includes('medium')) diffBadge.classList.add('medium');
+                else diffBadge.classList.add('hard');
+                diffBadge.textContent = diffBadge.classList[1].toUpperCase();
+                const codeBlock = document.getElementById('code-block');
+                codeBlock.textContent = obs.snippet.source;
+                Prism.highlightElement(codeBlock);
+                document.getElementById('review-panel').style.display = 'block';
+            } catch (e) {
+                alert("Error starting session.");
+            } finally {
+                document.getElementById('loader').style.opacity = '0';
+                setTimeout(() => document.getElementById('loader').style.display = 'none', 500);
+            }
+        }
+        function stageComment() {
+            const line = document.getElementById('c-line').value;
+            const category = document.getElementById('c-category').value;
+            const severity = document.getElementById('c-severity').value;
+            const message = document.getElementById('c-message').value;
+            const suggestion = document.getElementById('c-suggestion').value;
+            if(!message) {
+                alert("Message is required.");
+                return;
+            }
+            const comment = {
+                category, severity, message
+            };
+            if(line) comment.line = parseInt(line);
+            if(suggestion) comment.suggestion = suggestion;
+            stagedComments.push(comment);
+            renderComments();
+            // clear form
+            document.getElementById('c-message').value = '';
+            document.getElementById('c-suggestion').value = '';
+            document.getElementById('c-line').value = '';
+        }
+        function removeComment(index) {
+            stagedComments.splice(index, 1);
+            renderComments();
+        }
+        function renderComments() {
+            const list = document.getElementById('staged-comments');
+            list.innerHTML = '';
+            stagedComments.forEach((c, i) => {
+                const color = c.category === 'bug' ? 'var(--danger)' : c.category === 'security' ? '#ef4444' : 'var(--primary-accent)';
+                list.innerHTML += `
+                    <div class="comment-item" style="border-left-color: ${color}">
+                        <button class="remove-btn" onclick="removeComment(${i})">×</button>
+                        <div class="meta">
+                            <span style="color: ${color}; font-weight: bold;">[${c.category.toUpperCase()}] ${c.severity}</span>
+                            <span>Line: ${c.line || 'global'}</span>
+                        </div>
+                        <div style="margin-bottom: ${c.suggestion ? '5px' : '0'}">${c.message}</div>
+                        ${c.suggestion ? `<div style="font-family: monospace; font-size: 0.8rem; background: rgba(0,0,0,0.3); padding: 5px; border-radius: 4px;">Fix: ${c.suggestion}</div>` : ''}
+                    </div>
+                `;
+            });
+        }
+        async function submitReview() {
+            const summary = document.getElementById('summary').value;
+            const action = {
+                comments: stagedComments,
+                submit: true
+            };
+            if(summary) action.summary = summary;
+            document.getElementById('loader').style.display = 'flex';
+            document.getElementById('loader').style.opacity = '1';
+            try {
+                const res = await fetch('/step', {
+                    method: 'POST',
+                    headers: {'Content-Type': 'application/json'},
+                    body: JSON.stringify({
+                        session_id: currentSessionId,
+                        action: action
+                    })
+                });
+                const data = await res.json();
+                showResults(data);
+            } catch(e) {
+                alert("Failed to submit review.");
+            } finally {
+                document.getElementById('loader').style.opacity = '0';
+                setTimeout(() => document.getElementById('loader').style.display = 'none', 500);
+            }
+        }
+        function showResults(data) {
+            const modal = document.getElementById('result-modal');
+            const scoreVal = document.getElementById('modal-score');
+            const circle = document.getElementById('modal-score-circle');
+            const status = document.getElementById('modal-status');
+            const desc = document.getElementById('modal-desc');
+            const breakdown = document.getElementById('modal-breakdown');
+            const score = data.info.grader?.score || 0;
+            const threshold = data.info.grader?.threshold || 0.5;
+            const passed = score >= threshold;
+            scoreVal.textContent = score.toFixed(2);
+            if(passed) {
+                circle.classList.remove('failed');
+                status.textContent = "Great Review!";
+                status.style.color = "var(--success)";
+            } else {
+                circle.classList.add('failed');
+                status.textContent = "Needs Improvement";
+                status.style.color = "var(--danger)";
+            }
+            desc.textContent = `Passing threshold was ${threshold.toFixed(2)}`;
+            let bkHTML = `<h4 style="margin-top:0; color:var(--text-secondary);">Reward Breakdown</h4>`;
+            if(data.reward.breakdown) {
+                for (const [key, val] of Object.entries(data.reward.breakdown)) {
+                    const color = val >= 0 ? 'var(--success)' : 'var(--danger)';
+                    bkHTML += `<div style="display:flex; justify-content:space-between; margin-bottom:5px;">
+                        <span>${key}</span>
+                        <strong style="color:${color}">${val > 0 ? '+'+val.toFixed(2) : val.toFixed(2)}</strong>
+                    </div>`;
+                }
+            }
+            bkHTML += `<div style="border-top:1px solid #333; margin-top:10px; padding-top:10px;">
+                <strong>Reason: </strong> <span style="color:#bbb">${data.reward.reason}</span>
+            </div>`;
+            breakdown.innerHTML = bkHTML;
+            modal.style.display = 'flex';
+        }
+        function closeModal() {
+            document.getElementById('result-modal').style.display = 'none';
+        }
+    </script>
+</body>
+</html>

tests/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # tests package

tests/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (154 Bytes). View file

tests/__pycache__/test_env.cpython-313-pytest-9.0.3.pyc ADDED Viewed

Binary file (34.4 kB). View file

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,269 @@

+"""
+Test suite for CodeReview OpenEnv.
+Run with: pytest tests/ -v
+"""
+from __future__ import annotations
+import pytest
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from env.environment import CodeReviewEnv
+from env.models import Action, ReviewCategory, ReviewComment, Severity
+from graders.graders import Task1Grader, Task2Grader, Task3Grader
+from corpus.snippets import CORPUS
+# ---------------------------------------------------------------------------
+# Fixtures
+# ---------------------------------------------------------------------------
+def perfect_action(task_id: str) -> Action:
+    """Build an action containing all ground-truth comments for a task."""
+    issues = CORPUS[task_id]["issues"]
+    return Action(comments=list(issues), summary="Perfect review.", submit=True)
+def empty_action(submit: bool = False) -> Action:
+    return Action(comments=[], submit=submit)
+def single_bug_action() -> Action:
+    return Action(
+        comments=[
+            ReviewComment(
+                line=2,
+                category=ReviewCategory.BUG,
+                severity=Severity.HIGH,
+                message="divide() has no guard against division by zero will raise ZeroDivisionError",
+                suggestion="Add a check for b==0",
+            )
+        ],
+        submit=True,
+    )
+# ---------------------------------------------------------------------------
+# Grader unit tests
+# ---------------------------------------------------------------------------
+class TestTask1Grader:
+    grader = Task1Grader()
+    ground_truth = CORPUS["task_1_easy"]["issues"]
+    def test_perfect_score_close_to_one(self):
+        action = perfect_action("task_1_easy")
+        result = self.grader.grade(action, self.ground_truth)
+        assert result["score"] >= 0.80, f"Expected ≥0.80 got {result['score']}"
+    def test_empty_action_scores_zero(self):
+        result = self.grader.grade(empty_action(submit=True), self.ground_truth)
+        assert result["score"] < 0.15
+    def test_single_correct_bug_gives_positive_score(self):
+        result = self.grader.grade(single_bug_action(), self.ground_truth)
+        assert result["score"] > 0.0
+    def test_wrong_category_penalised(self):
+        action = Action(
+            comments=[
+                ReviewComment(
+                    line=2, category=ReviewCategory.SECURITY,
+                    severity=Severity.HIGH,
+                    message="divide has no guard against division by zero",
+                )
+            ],
+            submit=True,
+        )
+        result_wrong = self.grader.grade(action, self.ground_truth)
+        result_right = self.grader.grade(single_bug_action(), self.ground_truth)
+        assert result_right["score"] >= result_wrong["score"]
+    def test_fabricated_comment_penalised(self):
+        fabricated = Action(
+            comments=[
+                ReviewComment(
+                    line=5, category=ReviewCategory.BUG,
+                    severity=Severity.CRITICAL,
+                    message="Imaginary crash that does not exist in the code at all",
+                )
+            ] * 10,
+            submit=True,
+        )
+        result = self.grader.grade(fabricated, self.ground_truth)
+        assert result["score"] <= 0.1
+    def test_score_in_range(self):
+        action = perfect_action("task_1_easy")
+        result = self.grader.grade(action, self.ground_truth)
+        assert 0.0 <= result["score"] <= 1.0
+class TestTask2Grader:
+    grader = Task2Grader()
+    ground_truth = CORPUS["task_2_medium"]["issues"]
+    def test_perfect_score_close_to_one(self):
+        action = perfect_action("task_2_medium")
+        result = self.grader.grade(action, self.ground_truth)
+        assert result["score"] >= 0.75
+    def test_missing_critical_sql_injection_penalised(self):
+        # Remove the SQL injection comment from perfect action
+        issues = [i for i in self.ground_truth
+                  if not ("SQL injection" in i.message or "injection" in i.message.lower())]
+        action = Action(comments=issues, submit=True)
+        full_action = perfect_action("task_2_medium")
+        full_result = self.grader.grade(full_action, self.ground_truth)
+        partial_result = self.grader.grade(action, self.ground_truth)
+        assert full_result["score"] > partial_result["score"]
+    def test_score_in_range(self):
+        action = perfect_action("task_2_medium")
+        result = self.grader.grade(action, self.ground_truth)
+        assert 0.0 <= result["score"] <= 1.0
+class TestTask3Grader:
+    grader = Task3Grader()
+    ground_truth = CORPUS["task_3_hard"]["issues"]
+    def test_perfect_with_summary_beats_without(self):
+        with_summary = perfect_action("task_3_hard")
+        without_summary = Action(
+            comments=list(self.ground_truth), summary=None, submit=True
+        )
+        r_with = self.grader.grade(with_summary, self.ground_truth)
+        r_without = self.grader.grade(without_summary, self.ground_truth)
+        assert r_with["score"] >= r_without["score"]
+    def test_summary_penalty_applied_when_missing(self):
+        action = Action(comments=[], summary=None, submit=True)
+        result = self.grader.grade(action, self.ground_truth)
+        assert result["breakdown"].get("summary_penalty", 0) < 0
+    def test_score_in_range(self):
+        action = perfect_action("task_3_hard")
+        result = self.grader.grade(action, self.ground_truth)
+        assert 0.0 <= result["score"] <= 1.0
+# ---------------------------------------------------------------------------
+# Environment integration tests
+# ---------------------------------------------------------------------------
+class TestEnvironmentAPI:
+    def test_reset_returns_observation(self):
+        env = CodeReviewEnv("task_1_easy")
+        obs = env.reset()
+        assert obs.task_id == "task_1_easy"
+        assert obs.step == 0
+        assert obs.snippet.language == "python"
+        assert len(obs.snippet.source) > 0
+    def test_step_increments_step_counter(self):
+        env = CodeReviewEnv("task_1_easy")
+        env.reset()
+        result = env.step(empty_action(submit=False))
+        assert result.observation.step == 1
+    def test_step_submit_ends_episode(self):
+        env = CodeReviewEnv("task_1_easy")
+        env.reset()
+        result = env.step(empty_action(submit=True))
+        assert result.done is True
+    def test_step_after_done_raises(self):
+        env = CodeReviewEnv("task_1_easy")
+        env.reset()
+        env.step(empty_action(submit=True))
+        with pytest.raises(RuntimeError):
+            env.step(empty_action())
+    def test_state_matches_step(self):
+        env = CodeReviewEnv("task_2_medium")
+        env.reset()
+        env.step(single_bug_action())
+        state = env.state()
+        assert state.step == 1
+        assert state.task_id == "task_2_medium"
+    def test_max_steps_auto_terminates(self):
+        env = CodeReviewEnv("task_1_easy")
+        env.reset()
+        result = None
+        for _ in range(env.spec.max_steps):
+            result = env.step(empty_action(submit=False))
+        assert result.done is True
+    def test_reward_in_range(self):
+        env = CodeReviewEnv("task_1_easy")
+        env.reset()
+        result = env.step(single_bug_action())
+        assert -1.0 <= result.reward.value <= 1.0
+    def test_reset_clears_state(self):
+        env = CodeReviewEnv("task_1_easy")
+        env.reset()
+        env.step(single_bug_action())
+        env.reset()
+        state = env.state()
+        assert state.step == 0
+        assert state.total_reward == 0.0
+        assert len(state.comments_so_far) == 0
+    def test_deduplication_prevents_duplicate_comments(self):
+        env = CodeReviewEnv("task_1_easy")
+        env.reset()
+        # First step: submit=False so episode stays open
+        step1_action = Action(comments=[
+            ReviewComment(
+                line=2, category=ReviewCategory.BUG, severity=Severity.HIGH,
+                message="divide() has no guard against division by zero will raise ZeroDivisionError",
+                suggestion="Add a check for b==0",
+            )
+        ], submit=False)
+        env.step(step1_action)
+        # Second step: same comment again (should be deduped)
+        step2_action = Action(comments=[
+            ReviewComment(
+                line=2, category=ReviewCategory.BUG, severity=Severity.HIGH,
+                message="divide() has no guard against division by zero will raise ZeroDivisionError",
+                suggestion="Add a check for b==0",
+            )
+        ], submit=True)
+        env.step(step2_action)
+        state = env.state()
+        assert len(state.comments_so_far) == 1
+    def test_all_three_tasks_init(self):
+        for tid in ["task_1_easy", "task_2_medium", "task_3_hard"]:
+            env = CodeReviewEnv(tid)
+            obs = env.reset()
+            assert obs.task_id == tid
+    def test_invalid_task_raises(self):
+        with pytest.raises(ValueError):
+            CodeReviewEnv("task_9_impossible")
+    def test_hard_task_requires_summary_field(self):
+        env = CodeReviewEnv("task_3_hard")
+        env.reset()
+        # Submit without summary – should still work but score less
+        action = Action(comments=[], summary=None, submit=True)
+        result = env.step(action)
+        assert result.done is True
+        # Verify summary penalty is applied
+        assert result.info["grader"]["breakdown"].get("summary_penalty", 0) < 0
+    def test_full_episode_task1(self):
+        """Full happy-path episode: submit all ground truth → should pass."""
+        env = CodeReviewEnv("task_1_easy")
+        env.reset()
+        action = perfect_action("task_1_easy")
+        result = env.step(action)
+        assert result.done
+        assert result.info["passed"] is True

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

validate-submission.sh ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0