Spaces:

Tejasghatule
/

code-revieww-env

Sleeping

App Files Files Community

codemaverick2 commited on 10 days ago

Commit

e48a1e4

1 Parent(s): 8b75c34

Add 7-task RL env with PBRS, CAMRL curriculum, VL norm, RC-GRPO inference

Browse files

Files changed (10) hide show

README.md +109 -10
inference.py +293 -51
models.py +20 -0
openenv.yaml +50 -3
server/app.py +191 -14
server/environment.py +414 -66
server/graders.py +446 -6
tasks/data.py +523 -0
tests/test_environment.py +526 -0
tests/test_graders.py +403 -1

README.md CHANGED Viewed

@@ -117,6 +117,71 @@ Comprehensive review of a Django e-commerce API across two files (`views.py`, `m
 **Max steps:** 30
 ## Scoring
 ```
@@ -129,8 +194,36 @@ where:
   severity_accuracy = avg(1 − |flag_sev_rank − gt_sev_rank| × 0.34) for matched issues
 Matching tolerance: ±2 lines, same filename, compatible issue type
 ```
 ## API Endpoints
 | Method | Endpoint | Description |
@@ -234,14 +327,18 @@ pytest tests/ -v
 ## Baseline Scores
-| Task | Keyword heuristic | GPT-4o-mini |
-|------|-------------------|-------------|
-| bug-detection | 1.00 | ~0.52 |
-| security-audit | 0.75 | ~0.59 |
-| comprehensive-review | 0.67 | ~0.17 |
-| **Overall** | **0.81** | **~0.43** |
-Keyword heuristic runs via `inference.py` with no API key. LLM scores use `API_BASE_URL` + `HF_TOKEN`.
 ## Project Structure
@@ -258,11 +355,13 @@ code-review-env/
 ├── client.py             ← HTTP client
 ├── models.py             ← ReviewAction, ReviewObservation, ReviewState, Issue
 ├── tasks/
-│   └── data.py           ← 3 task definitions + ground truth
 ├── server/
 │   ├── app.py            ← FastAPI application
-│   ├── environment.py    ← Core environment logic
-│   └── graders.py        ← F1 grading + keyword baseline
 └── tests/
     ├── test_environment.py
     └── test_graders.py

 **Max steps:** 30
+### Task 4: `async-review` — Medium-Hard
+Review an async Python module (`async.py`) for concurrency bugs, resource leaks, and performance issues with `asyncio` and `aiohttp`.
+| Line | Issue | Severity |
+|------|-------|----------|
+| 5 | Shared mutable cache dict without `asyncio.Lock` — race condition | High |
+| 9 | `timeout=5` wrong type for aiohttp; requires `ClientTimeout(total=5)` | Medium |
+| 22 | `ClientSession` created but never closed — resource leak | High |
+| 24 | Sequential `await` in loop — use `asyncio.gather()` for concurrency | High |
+| 37 | Off-by-one in retry condition: `attempt == retries` never true | High |
+| 48 | Tasks awaited sequentially; `self.results` accumulates across calls | Medium |
+**Max steps:** 20
+### Task 5: `data-pipeline` — Hard
+Security and correctness audit of a SQLite data pipeline module (`pipeline.py`).
+| Line | Issue | Severity |
+|------|-------|----------|
+| 20 | MD5 for password hashing — cryptographically broken | High |
+| 27 | SQL injection via f-string in `INSERT` query | Critical |
+| 35 | SQL injection via f-string in `LIKE` query | Critical |
+| 41 | One transaction per row in `bulk_load` — severe performance issue | High |
+| 46 | `float()` conversion without error handling — crashes on bad input | Medium |
+| 52 | `export_records` leaks `password_hash` field in JSON output | High |
+| 59 | SQL injection: `limit` interpolated into `LIMIT` clause | Critical |
+**Max steps:** 25
+### Task 6: `api-security` — Hard
+Security audit of a FastAPI REST API (`api.py`) with authentication, authorization, and injection vulnerabilities.
+| Line | Issue | Severity |
+|------|-------|----------|
+| 12 | Hardcoded `SECRET_KEY` in source | High |
+| 13 | Hardcoded `ADMIN_TOKEN` in source | High |
+| 16 | MD5 for password hashing | High |
+| 27 | JWT issued without `exp` expiry claim | Medium |
+| 33 | IDOR — any user can fetch any other user's data | Critical |
+| 38 | SQL injection via f-string in `SELECT` query | Critical |
+| 47 | Command injection via `os.system()` with env-interpolated path | Critical |
+| 53 | `pickle.loads()` on untrusted user bytes — RCE | Critical |
+**Max steps:** 25
+### Task 7: `js-security` — Hard
+Security audit of an Express.js REST API (`server.js`) in JavaScript/Node.js.
+| Line | Issue | Severity |
+|------|-------|----------|
+| 11 | Hardcoded `JWT_SECRET` in source | High |
+| 16 | SQL injection via template literal in `prepare()` | Critical |
+| 18 | JWT issued without `expiresIn` — tokens valid forever | Medium |
+| 25 | IDOR + SQL injection: unauthenticated user access + unparameterized query | Critical |
+| 31 | XSS: user query param reflected directly in HTML response | High |
+| 36 | Command injection via `execSync()` with user-supplied filename | Critical |
+| 42 | Path traversal: `path.join` with user-supplied filename | High |
+| 48 | `new Function()` with user template — arbitrary code execution | Critical |
+**Max steps:** 25
 ## Scoring
 ```
   severity_accuracy = avg(1 − |flag_sev_rank − gt_sev_rank| × 0.34) for matched issues
 Matching tolerance: ±2 lines, same filename, compatible issue type
+Near-miss (±3-5 lines): graduated partial credit via exponential decay
 ```
+## Reward Design
+### Per-step rewards
+| Event | Reward |
+|-------|--------|
+| True positive (TP) | +0.10 base |
+| TP + severity exact match | +0.02 bonus |
+| TP + early (first 40% of steps) | +0.02 bonus |
+| TP + high confidence (≥0.7) | +0.01 bonus |
+| PBRS potential shaping (Φ(s')−Φ(s)) | +0.03–0.08 |
+| Near-miss (±3-5 lines, exponential decay) | +0.020–0.055 |
+| False positive | −0.05 |
+| False positive flood (4th+ FP) | escalating −0.03 extra |
+| High-confidence FP | −0.03 extra |
+| Clear TP | −0.03 |
+| Clear FP | +0.03 |
+| Hint | −0.01 |
+| Submit / auto-end | Final F1 score |
+### Reward shaping foundations
+- **Potential-Based Reward Shaping** (Ng et al. 1999): Φ(s) = (tp/total_gt) × 0.5. Policy-invariant shaping that improves sample efficiency without changing the optimal policy.
+- **Graduated near-miss** (exponential decay): reward = 0.10 × e^(−0.6 × (line_diff − 2)) for lines 3-5 off. Gives smooth gradient signal for line-number refinement.
+- **Variable-Length Return Normalization** (VL Norm 2025): normalized_return = cumulative_reward / steps_used. Makes return comparable across tasks of different lengths.
+- **Flood protection**: escalating FP penalty prevents reward hacking via flag-spamming.
 ## API Endpoints
 | Method | Endpoint | Description |
 ## Baseline Scores
+| Task | Keyword heuristic |
+|------|-------------------|
+| bug-detection | 1.00 |
+| security-audit | 0.75 |
+| async-review | 0.71 |
+| comprehensive-review | 0.66 |
+| api-security | 0.83 |
+| js-security | 0.70 |
+| data-pipeline | 0.55 |
+| **Overall (7 tasks)** | **0.74** |
+Keyword heuristic runs via `inference.py` with no API key (uses `/baseline` endpoint). LLM scores use `API_BASE_URL` + `HF_TOKEN`.
 ## Project Structure
 ├── client.py             ← HTTP client
 ├── models.py             ← ReviewAction, ReviewObservation, ReviewState, Issue
 ├── tasks/
+│   └── data.py           ← 5 task definitions + ground truth
+│                            (bug-detection, security-audit, comprehensive-review,
+│                             async-review, data-pipeline)
 ├── server/
 │   ├── app.py            ← FastAPI application
+│   ├── environment.py    ← Core environment logic (adaptive hints, rich rewards)
+│   └── graders.py        ← F1 grading + detailed grading + keyword baseline
 └── tests/
     ├── test_environment.py
     └── test_graders.py

inference.py CHANGED Viewed

@@ -4,7 +4,7 @@ Inference script for the Code Review Environment.
 Environment variables:
     API_BASE_URL  — LLM API endpoint (e.g. https://openrouter.ai/api/v1)
     MODEL_NAME    — Model identifier (e.g. openai/gpt-4o-mini)
-    HF_TOKEN      — API key for the LLM provider
     ENV_URL       — Environment base URL (default: localhost:7860)
 Usage:
@@ -19,6 +19,7 @@ import os
 import sys
 import json
 import time
 import httpx
@@ -27,24 +28,76 @@ MODEL_NAME: str = os.environ.get("MODEL_NAME", "gpt-4o-mini")
 HF_TOKEN: str = os.environ.get("HF_TOKEN") or os.environ.get("OPENAI_API_KEY", "")
 ENV_URL: str = os.environ.get("ENV_URL", "http://localhost:7860").rstrip("/")
-TASK_IDS = ["bug-detection", "security-audit", "comprehensive-review"]
 SYSTEM_PROMPT = """\
-You are an expert software engineer performing a thorough code review.
-Your job is to identify bugs, security vulnerabilities, and performance issues in code.
-For each issue you find, respond with a single JSON object:
-  {"action_type": "flag_issue", "line_number": <int>, "filename": "<file>", "issue_type": "bug|security|performance|logic", "severity": "low|medium|high|critical", "description": "<explanation>", "fix_suggestion": "<fix>"}
-When done, respond with:
-  {"action_type": "submit_review"}
-Rules:
-- Respond with raw JSON only — no markdown fences, no extra text
 - One action per response
-- Be precise with line numbers (count from line 1)
-- Only flag real issues, not style preferences
 """
@@ -59,13 +112,17 @@ def chat_completion(messages: list) -> str:
         kwargs["base_url"] = API_BASE_URL
     client = OpenAI(**kwargs)
-    response = client.chat.completions.create(
-        model=MODEL_NAME,
-        messages=messages,
-        temperature=0.0,
-        max_tokens=400,
-    )
-    return response.choices[0].message.content.strip()
 def parse_action(text: str) -> dict:
@@ -100,45 +157,217 @@ def parse_action(text: str) -> dict:
 def run_keyword_fallback(base_url: str, task_id: str) -> dict:
     """Fallback: use the built-in /baseline endpoint (no LLM needed)."""
-    with httpx.Client(timeout=30) as client:
-        resp = client.post(f"{base_url}/baseline")
-        resp.raise_for_status()
-        results = resp.json()
-        score = results["baseline_scores"].get(task_id, {}).get("score", 0.0)
-        return {"task_id": task_id, "score": score, "steps": 0, "method": "keyword_heuristic"}
 def run_task(task_id: str, http_client: httpx.Client) -> dict:
-    resp = http_client.post(f"{ENV_URL}/reset", json={"task_id": task_id}, timeout=30)
-    resp.raise_for_status()
-    obs = resp.json()
     code_display = "\n\n".join(
-        f"=== {fname} ===\n{code}"
         for fname, code in obs.get("code_files", {}).items()
     )
     messages = [
         {"role": "system", "content": SYSTEM_PROMPT},
         {
             "role": "user",
             "content": (
-                f"Task: {obs.get('task_description', '')}\n\n"
-                f"{code_display}\n\n"
-                f"Review this code carefully. Flag every issue you find. "
-                f"You have {obs.get('max_steps', 20)} steps total."
             ),
         },
     ]
     done = False
     step_count = 0
-    max_steps = obs.get("max_steps", 20)
     final_score = 0.0
     while not done and step_count < max_steps:
-        action_text = chat_completion(messages)
-        action = parse_action(action_text)
         try:
             step_resp = http_client.post(f"{ENV_URL}/step", json=action, timeout=30)
@@ -150,20 +379,33 @@ def run_task(task_id: str, http_client: httpx.Client) -> dict:
         done = obs.get("done", False)
         step_count += 1
-        final_score = obs.get("current_score", 0.0)
-        reward = obs.get("reward")
         messages.append({"role": "assistant", "content": action_text})
-        messages.append({
-            "role": "user",
-            "content": (
-                f"Feedback: {obs.get('feedback', '')} "
-                f"(step {step_count}/{max_steps}, score: {obs.get('current_score', 0.0):.3f})"
-            ),
-        })
         atype = action.get("action_type", "")
-        print(f"    Step {step_count:2d}: {atype:20s} | reward={str(reward):8s} | score={obs.get('current_score', 0.0):.3f}")
         if atype == "submit_review":
             final_score = obs.get("reward", obs.get("current_score", 0.0)) or 0.0
@@ -205,7 +447,7 @@ def main():
                 print(f"Running task: {task_id}")
                 result = run_task(task_id, client)
                 results[task_id] = result
-                print(f"  → score: {result['score']:.4f}  ({result['steps']} steps)\n")
     else:
         print("HF_TOKEN / API_BASE_URL not set — using built-in keyword heuristic baseline.\n")
         for task_id in TASK_IDS:

 Environment variables:
     API_BASE_URL  — LLM API endpoint (e.g. https://openrouter.ai/api/v1)
     MODEL_NAME    — Model identifier (e.g. openai/gpt-4o-mini)
+    HF_TOKEN      — API key for the LLM provider (also accepts OPENAI_API_KEY)
     ENV_URL       — Environment base URL (default: localhost:7860)
 Usage:
 import sys
 import json
 import time
+from typing import Optional
 import httpx
 HF_TOKEN: str = os.environ.get("HF_TOKEN") or os.environ.get("OPENAI_API_KEY", "")
 ENV_URL: str = os.environ.get("ENV_URL", "http://localhost:7860").rstrip("/")
+# Curriculum ordering: easy → medium → medium-hard → hard
+# Research (CAMRL, Curriculum RL): start with simpler tasks to build
+# foundational skills, progress to harder multi-file and multi-language tasks.
+TASK_IDS = [
+    "bug-detection",        # easy: pure logic bugs, single file
+    "security-audit",       # medium: OWASP Top-10, single file
+    "async-review",         # medium-hard: async concurrency, subtle bugs
+    "data-pipeline",        # hard: SQL injection + crypto + performance
+    "comprehensive-review", # hard: multi-file Django, mixed issue types
+    "api-security",         # hard: FastAPI auth/authz/injection
+    "js-security",          # hard: JavaScript (cross-language generalization)
+]
 SYSTEM_PROMPT = """\
+You are an expert software engineer performing a thorough, methodical code review.
+Your mission: identify ALL real bugs, security vulnerabilities, and performance issues.
+## REVIEW CHECKLIST — work through EVERY category for EVERY function:
+### Security (check EVERY function for these)
+- Hardcoded secrets / API keys / passwords / tokens
+- SQL injection: f-strings/template literals/string concat in queries
+- Command injection: shell=True, os.system(), execSync() with user input
+- XSS: unsanitized user input in HTML templates / res.send()
+- Path traversal: path.join/os.path.join with user-supplied paths
+- IDOR: missing authorization — authenticated vs authorized
+- Insecure deserialization: pickle.loads(), new Function(), eval() on user input
+- Broken crypto: MD5/SHA1 for passwords; missing salt; weak PRNG
+- JWT issues: missing expiry ('exp'), algorithm confusion, hardcoded secret
+- Missing authentication on sensitive endpoints
+### Bugs & Logic Errors (check EVERY function for these)
+- Off-by-one errors in ranges, slices, loop bounds, retry conditions
+- Wrong initial values (counters starting at 0 instead of 1)
+- Race conditions (shared mutable state without locks/atomicity)
+- Missing transaction atomicity (partial writes to DB)
+- Wrong type arguments (int where object required, e.g. aiohttp timeout)
+- State that accumulates across calls (class fields not reset)
+### Performance (check EVERY loop and DB call)
+- N+1 database queries (DB call inside a loop)
+- Sequential async where gather() should be used
+- One transaction per row in bulk operations
+- Uncapped pagination (no max limit on per_page)
+### Resource Management
+- Unclosed sessions/connections/file handles
+- Missing context managers (async with, with)
+## RESPONSE FORMAT
+For each issue you find, respond with ONE raw JSON object:
+{"action_type": "flag_issue", "line_number": <int>, "filename": "<file>",
+ "issue_type": "bug|security|performance|logic",
+ "severity": "low|medium|high|critical",
+ "description": "<specific explanation>",
+ "fix_suggestion": "<concrete fix>",
+ "confidence": <0.0-1.0>}
+When finished, respond with:
+{"action_type": "submit_review"}
+## RULES
+- Raw JSON only — no markdown fences, no extra text
 - One action per response
+- Count lines carefully from line 1 (including blank lines and comments)
+- Only flag REAL issues — no style preferences, no hypothetical issues
+- Be precise: "SQL injection at line 19 via f-string in SELECT query" not just "SQL injection"
+- Flag the EXACT line where the problem code is (the f-string line, not the function def)
 """
         kwargs["base_url"] = API_BASE_URL
     client = OpenAI(**kwargs)
+    try:
+        response = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=messages,
+            temperature=0.0,
+            max_tokens=500,
+        )
+        return response.choices[0].message.content.strip()
+    except Exception as e:
+        print(f"    LLM call failed: {e}")
+        raise
 def parse_action(text: str) -> dict:
 def run_keyword_fallback(base_url: str, task_id: str) -> dict:
     """Fallback: use the built-in /baseline endpoint (no LLM needed)."""
+    try:
+        with httpx.Client(timeout=30) as client:
+            resp = client.post(f"{base_url}/baseline")
+            resp.raise_for_status()
+            results = resp.json()
+            score = results["baseline_scores"].get(task_id, {}).get("score", 0.0)
+            return {"task_id": task_id, "score": score, "steps": 0, "method": "keyword_heuristic"}
+    except Exception as e:
+        print(f"    Keyword fallback failed: {e}")
+        return {"task_id": task_id, "score": 0.0, "steps": 0, "method": "error"}
+def _build_progress_feedback(obs: dict) -> str:
+    """Build a rich feedback string from observation progress data."""
+    progress = obs.get("progress") or {}
+    flagged_summary = obs.get("flagged_summary") or {}
+    parts = []
+    if progress:
+        f1 = progress.get("f1", 0)
+        precision = progress.get("precision", 0)
+        recall = progress.get("recall", 0)
+        tp = int(progress.get("true_positives", 0))
+        total_gt = int(progress.get("total_ground_truth", 0))
+        steps_rem = int(progress.get("steps_remaining", 0))
+        unfound = progress.get("unfound_issue_types", [])
+        parts.append(
+            f"Score progress: {tp}/{total_gt} issues confirmed | "
+            f"F1={f1:.2f} Precision={precision:.2f} Recall={recall:.2f} | "
+            f"{steps_rem} steps remaining"
+        )
+        if unfound:
+            parts.append(
+                f"IMPORTANT — still need to find: {unfound}. "
+                f"Search specifically for those issue types."
+            )
+    if flagged_summary:
+        incorrect = flagged_summary.get("incorrect", 0)
+        near = flagged_summary.get("near_misses", 0)
+        if incorrect > 0:
+            parts.append(
+                f"WARNING: {incorrect} false positive(s) hurting precision. "
+                f"Consider using clear_flag to remove uncertain flags."
+            )
+        if near > 0:
+            parts.append(
+                f"NOTE: {near} near-miss(es) — you're close on line numbers, "
+                f"but slightly off. Re-check exact line and try reflagging."
+            )
+    return "\n".join(parts) if parts else ""
+def _should_submit(obs: dict, step_count: int, max_steps: int) -> bool:
+    """
+    Smart submission: submit when recall is high or steps are nearly exhausted.
+    Avoids wasting steps after all real issues are found.
+    """
+    progress = obs.get("progress", {})
+    recall = progress.get("recall", 0.0)
+    tp = int(progress.get("true_positives", 0))
+    total_gt = int(progress.get("total_ground_truth", 0))
+    steps_rem = int(progress.get("steps_remaining", 0))
+    unfound = progress.get("unfound_issue_types", [])
+    fp = int(progress.get("false_positives", 0))
+    # All issues found
+    if total_gt > 0 and tp >= total_gt:
+        return True
+    # No unfound categories and high recall
+    if not unfound and recall >= 0.85:
+        return True
+    # High recall overall (≥80%) and precision is decent (not too many FPs)
+    if recall >= 0.80 and (fp <= 2 or tp / max(tp + fp, 1) >= 0.6):
+        return True
+    # Very few steps left and we've done a reasonable scan
+    if steps_rem <= 2 and step_count >= 5:
+        return True
+    return False
+def _should_clear_flag(obs: dict, last_reward: float, last_action: dict) -> Optional[dict]:
+    """
+    Recovery strategy: if the last flag was a false positive with high penalty,
+    suggest clearing it to recover partial reward and improve precision.
+    Returns a clear_flag action dict if we should recover, else None.
+    """
+    if last_reward is None or last_reward >= 0:
+        return None
+    if last_action.get("action_type") != "flag_issue":
+        return None
+    # Only clear if it was a clear FP (no near-miss indicator in feedback)
+    # and we've got too many false positives
+    progress = obs.get("progress", {})
+    fp = int(progress.get("false_positives", 0))
+    tp = int(progress.get("true_positives", 0))
+    # If FP > TP and last reward was notably negative, clear the bad flag
+    if fp > tp and last_reward <= -0.05:
+        return {
+            "action_type": "clear_flag",
+            "line_number": last_action.get("line_number"),
+            "filename": last_action.get("filename"),
+        }
+    return None
 def run_task(task_id: str, http_client: httpx.Client) -> dict:
+    try:
+        resp = http_client.post(f"{ENV_URL}/reset", json={"task_id": task_id}, timeout=30)
+        resp.raise_for_status()
+        obs = resp.json()
+    except Exception as e:
+        print(f"    Reset failed: {e} — falling back to keyword heuristic")
+        return run_keyword_fallback(ENV_URL, task_id)
     code_display = "\n\n".join(
+        f"=== {fname} (starting at line 1) ===\n{code}"
         for fname, code in obs.get("code_files", {}).items()
     )
+    # Include function map hint if available
+    code_metadata = obs.get("code_metadata") or {}
+    function_ranges = code_metadata.get("function_ranges", [])
+    fn_map_hint = ""
+    if function_ranges:
+        fn_lines = [f"  {fr['name']}() in {fr['file']} (lines {fr['start']}-{fr['end']})"
+                    for fr in function_ranges]
+        fn_map_hint = "\n\nFunction map:\n" + "\n".join(fn_lines)
+    task_desc = obs.get("task_description", "")
+    max_steps = obs.get("max_steps", 20)
+    issue_categories = code_metadata.get("issue_categories", [])
+    n_gt = len(obs.get("code_files", {}))  # rough complexity hint
+    category_hint = ""
+    if issue_categories:
+        category_hint = f"\nIssue categories to look for: {sorted(set(issue_categories))}"
+    # RC-GRPO style reward conditioning (2025): tell the agent what quality level
+    # it should aim for, so it calibrates confidence appropriately.
+    state_features = code_metadata.get("state_features", [])
+    complexity_label = "medium"
+    if state_features and len(state_features) >= 4:
+        complexity_score = state_features[3]
+        complexity_label = "high" if complexity_score >= 1.0 else "medium" if complexity_score >= 0.5 else "low"
+    reward_conditioning = (
+        f"[TARGET: high-quality review, score ≥ 0.85. "
+        f"Code complexity: {complexity_label}. "
+        f"Be thorough — missing issues costs more than a single FP.]"
+    )
     messages = [
         {"role": "system", "content": SYSTEM_PROMPT},
         {
             "role": "user",
             "content": (
+                f"{reward_conditioning}\n\n"
+                f"Task: {task_desc}\n\n"
+                f"{code_display}"
+                f"{fn_map_hint}"
+                f"{category_hint}\n\n"
+                f"You have {max_steps} steps total. "
+                f"Work through the checklist systematically, function by function. "
+                f"Flag each issue one at a time as a raw JSON object."
             ),
         },
     ]
     done = False
     step_count = 0
     final_score = 0.0
+    last_action: dict = {}
+    last_reward: Optional[float] = None
+    consecutive_fp = 0
     while not done and step_count < max_steps:
+        # --- Auto clear_flag recovery: undo recent FP if hurting precision ---
+        recovery_action = _should_clear_flag(obs, last_reward, last_action)
+        if recovery_action and step_count < max_steps - 1:
+            action = recovery_action
+            action_text = json.dumps(action)
+            print(f"    Auto-recovery: clearing FP at {action.get('filename')}:{action.get('line_number')}")
+        else:
+            # --- Normal LLM action ---
+            try:
+                action_text = chat_completion(messages)
+            except Exception as e:
+                print(f"    LLM unavailable ({e}) — submitting and falling back to keyword heuristic")
+                try:
+                    http_client.post(f"{ENV_URL}/step", json={"action_type": "submit_review"}, timeout=30)
+                except Exception:
+                    pass
+                return run_keyword_fallback(ENV_URL, task_id)
+            action = parse_action(action_text)
+            # Smart submission: inject submit if progress shows we're done
+            if action.get("action_type") != "submit_review" and _should_submit(obs, step_count, max_steps):
+                print(f"    Smart submit at step {step_count + 1} (recall target met)")
+                action = {"action_type": "submit_review"}
+                action_text = json.dumps(action)
         try:
             step_resp = http_client.post(f"{ENV_URL}/step", json=action, timeout=30)
         done = obs.get("done", False)
         step_count += 1
+        last_reward = obs.get("reward")
+        # Use terminal reward (final grade) when done, else intermediate score
+        if done:
+            final_score = last_reward or obs.get("current_score", 0.0)
+        else:
+            final_score = obs.get("current_score", 0.0)
+        last_action = action
+        # Track consecutive FPs for logging
+        if last_reward is not None and last_reward < 0 and action.get("action_type") == "flag_issue":
+            consecutive_fp += 1
+        else:
+            consecutive_fp = 0
+        # Build rich feedback for next LLM turn
+        progress_feedback = _build_progress_feedback(obs)
+        env_feedback = obs.get("feedback", "")
+        combined_feedback = env_feedback
+        if progress_feedback:
+            combined_feedback += f"\n{progress_feedback}"
         messages.append({"role": "assistant", "content": action_text})
+        if combined_feedback:
+            messages.append({"role": "user", "content": combined_feedback})
         atype = action.get("action_type", "")
+        print(f"    Step {step_count:2d}: {atype:20s} | reward={str(last_reward):8s} | score={obs.get('current_score', 0.0):.3f}")
         if atype == "submit_review":
             final_score = obs.get("reward", obs.get("current_score", 0.0)) or 0.0
                 print(f"Running task: {task_id}")
                 result = run_task(task_id, client)
                 results[task_id] = result
+                print(f"  → score: {result['score']:.4f}  ({result['steps']} steps, method={result['method']})\n")
     else:
         print("HF_TOKEN / API_BASE_URL not set — using built-in keyword heuristic baseline.\n")
         for task_id in TASK_IDS:

models.py CHANGED Viewed

@@ -80,6 +80,8 @@ class ReviewAction(_BaseAction):
     severity: Optional[str] = None
     description: str = ""
     fix_suggestion: Optional[str] = None
     metadata: Dict[str, Any] = field(default_factory=dict)
     def to_dict(self) -> dict:
@@ -91,6 +93,8 @@ class ReviewAction(_BaseAction):
             "severity": self.severity,
             "description": self.description,
             "fix_suggestion": self.fix_suggestion,
         }
     @classmethod
@@ -103,6 +107,8 @@ class ReviewAction(_BaseAction):
             severity=d.get("severity"),
             description=str(d.get("description", "")),
             fix_suggestion=d.get("fix_suggestion"),
         )
@@ -125,6 +131,11 @@ class ReviewObservation(_BaseObservation):
     done: bool = False
     reward: Optional[float] = None
     metadata: Dict[str, Any] = field(default_factory=dict)
     def to_dict(self) -> dict:
         return {
@@ -141,6 +152,10 @@ class ReviewObservation(_BaseObservation):
             "done": self.done,
             "reward": self.reward,
             "metadata": self.metadata,
         }
     @classmethod
@@ -158,6 +173,11 @@ class ReviewObservation(_BaseObservation):
             current_score=d.get("current_score", 0.0),
             done=d.get("done", False),
             reward=d.get("reward"),
         )

     severity: Optional[str] = None
     description: str = ""
     fix_suggestion: Optional[str] = None
+    confidence: Optional[float] = None       # agent's confidence 0.0–1.0
+    related_lines: Optional[List[int]] = None  # multi-line issues
     metadata: Dict[str, Any] = field(default_factory=dict)
     def to_dict(self) -> dict:
             "severity": self.severity,
             "description": self.description,
             "fix_suggestion": self.fix_suggestion,
+            "confidence": self.confidence,
+            "related_lines": self.related_lines,
         }
     @classmethod
             severity=d.get("severity"),
             description=str(d.get("description", "")),
             fix_suggestion=d.get("fix_suggestion"),
+            confidence=d.get("confidence"),
+            related_lines=d.get("related_lines"),
         )
     done: bool = False
     reward: Optional[float] = None
     metadata: Dict[str, Any] = field(default_factory=dict)
+    # New fields
+    reward_breakdown: Dict[str, float] = field(default_factory=dict)
+    progress: Dict[str, float] = field(default_factory=dict)
+    flagged_summary: Dict[str, Any] = field(default_factory=dict)
+    code_metadata: Dict[str, Any] = field(default_factory=dict)
     def to_dict(self) -> dict:
         return {
             "done": self.done,
             "reward": self.reward,
             "metadata": self.metadata,
+            "reward_breakdown": self.reward_breakdown,
+            "progress": self.progress,
+            "flagged_summary": self.flagged_summary,
+            "code_metadata": self.code_metadata,
         }
     @classmethod
             current_score=d.get("current_score", 0.0),
             done=d.get("done", False),
             reward=d.get("reward"),
+            metadata=d.get("metadata", {}),
+            reward_breakdown=d.get("reward_breakdown", {}),
+            progress=d.get("progress", {}),
+            flagged_summary=d.get("flagged_summary", {}),
+            code_metadata=d.get("code_metadata", {}),
         )

openenv.yaml CHANGED Viewed

@@ -1,11 +1,58 @@
 spec_version: 1
 name: code_review_env
-version: "1.0.0"
 description: >
-  A code review and security audit environment for training AI agents.
   The agent identifies bugs, security vulnerabilities, and performance issues
-  across three tasks of increasing difficulty (easy → medium → hard).
 type: space
 runtime: fastapi
 app: server.app:app
 port: 7860

 spec_version: 1
 name: code_review_env
+version: "2.0.0"
 description: >
+  A code review and security audit RL environment for training AI agents.
   The agent identifies bugs, security vulnerabilities, and performance issues
+  across 7 tasks of increasing difficulty (easy → medium → medium-hard → hard).
+  Features: PBRS reward shaping, graduated near-miss rewards, flood protection,
+  CAMRL curriculum selector, VL return normalization, and cross-language tasks.
 type: space
 runtime: fastapi
 app: server.app:app
+entry_point: server
 port: 7860
+tasks:
+  - id: bug-detection
+    difficulty: easy
+    language: python
+    num_issues: 3
+    max_steps: 15
+  - id: security-audit
+    difficulty: medium
+    language: python
+    num_issues: 7
+    max_steps: 20
+  - id: async-review
+    difficulty: medium-hard
+    language: python
+    num_issues: 6
+    max_steps: 20
+  - id: data-pipeline
+    difficulty: hard
+    language: python
+    num_issues: 7
+    max_steps: 25
+  - id: comprehensive-review
+    difficulty: hard
+    language: python
+    num_issues: 9
+    max_steps: 30
+  - id: api-security
+    difficulty: hard
+    language: python
+    num_issues: 8
+    max_steps: 25
+  - id: js-security
+    difficulty: hard
+    language: javascript
+    num_issues: 8
+    max_steps: 25
+reward_design:
+  terminal: "0.70 * F1 + 0.30 * severity_accuracy"
+  shaping: "PBRS (Ng et al. 1999): phi(s) = (tp/total_gt) * 0.5"
+  near_miss: "exponential decay: 0.10 * exp(-0.6 * (line_diff - 2))"
+  flood_protection: "escalating FP penalty after 3rd false positive"
+  normalization: "VL Norm (2025): normalized_return = cumulative / steps_used"

server/app.py CHANGED Viewed

@@ -21,6 +21,7 @@ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 import json
 import asyncio
 import dataclasses
 from typing import Optional, List, Dict, Any
 from fastapi import FastAPI, WebSocket, WebSocketDisconnect, HTTPException
@@ -29,7 +30,10 @@ from pydantic import BaseModel
 from models import ReviewAction, Issue
 from server.environment import CodeReviewEnvironment
-from server.graders import grade_episode, run_keyword_baseline
 from tasks.data import ALL_TASKS, TASK_IDS
@@ -45,6 +49,7 @@ def _serialize(obj) -> dict:
 _env_instance = CodeReviewEnvironment()
 def _make_app() -> FastAPI:
@@ -245,27 +250,25 @@ async def run_grader(request: GraderRequest):
     flagged = [Issue.from_dict(i) for i in request.flagged_issues]
     ground_truth = [Issue.from_dict(gt) for gt in task["ground_truth_issues"]]
-    score = grade_episode(flagged, ground_truth)
-    tp = sum(
-        1 for f in flagged
-        if any(
-            True for gt in ground_truth
-            if abs(f.line_number - gt.line_number) <= 2
-            and f.filename == gt.filename
-        )
-    )
     return {
         "task_id": request.task_id,
         "difficulty": task["difficulty"],
-        "score": score,
         "max_score": 1.0,
         "details": {
             "total_flagged": len(flagged),
-            "true_positives": tp,
-            "false_positives": len(flagged) - tp,
             "total_ground_truth": len(ground_truth),
         },
     }
@@ -296,6 +299,180 @@ async def run_baseline():
     }
 def main():
     import uvicorn
     port = int(os.environ.get("PORT", 7860))

 import json
 import asyncio
 import dataclasses
+import random
 from typing import Optional, List, Dict, Any
 from fastapi import FastAPI, WebSocket, WebSocketDisconnect, HTTPException
 from models import ReviewAction, Issue
 from server.environment import CodeReviewEnvironment
+from server.graders import (
+    grade_episode, grade_episode_detailed, run_keyword_baseline,
+    compute_code_state_features, RewardNormalizer,
+)
 from tasks.data import ALL_TASKS, TASK_IDS
 _env_instance = CodeReviewEnvironment()
+_reward_normalizer = RewardNormalizer(window_size=100)
 def _make_app() -> FastAPI:
     flagged = [Issue.from_dict(i) for i in request.flagged_issues]
     ground_truth = [Issue.from_dict(gt) for gt in task["ground_truth_issues"]]
+    detailed = grade_episode_detailed(flagged, ground_truth)
     return {
         "task_id": request.task_id,
         "difficulty": task["difficulty"],
+        "score": detailed["score"],
         "max_score": 1.0,
+        "f1": detailed["f1"],
+        "precision": detailed["precision"],
+        "recall": detailed["recall"],
+        "severity_accuracy": detailed["severity_accuracy"],
         "details": {
             "total_flagged": len(flagged),
+            "true_positives": detailed["true_positives"],
+            "false_positives": detailed["false_positives"],
+            "false_negatives": detailed["false_negatives"],
+            "near_misses": detailed["near_misses"],
             "total_ground_truth": len(ground_truth),
+            "per_file": detailed["per_file"],
         },
     }
     }
+class CurriculumRequest(BaseModel):
+    agent_performance: Optional[Dict[str, Any]] = None
+    easy_threshold: float = 0.30
+    hard_threshold: float = 0.70
+@app.post("/curriculum")
+async def curriculum_task_selector(request: CurriculumRequest):
+    """
+    CAMRL-style curriculum task selector (Curriculum-based Asymmetric Multi-Task RL, TPAMI 2023).
+    Given agent performance metrics per task, returns the recommended next task_id
+    based on curriculum phase:
+      - easy phase  (avg_success < 0.30): focus on task with fewest issues
+      - medium phase (0.30-0.70):         mix easy/hard (70% easy, 30% hard)
+      - hard phase  (avg_success > 0.70): focus on least-solved hard tasks
+    Body:
+      agent_performance: {task_id: {success_rate: 0.5, episodes: 10, avg_score: 0.4}}
+      easy_threshold: float (default 0.3)
+      hard_threshold: float (default 0.7)
+    """
+    perf = request.agent_performance or {}
+    easy_thresh = request.easy_threshold
+    hard_thresh = request.hard_threshold
+    # Build difficulty estimate per task: (1 - success_rate) × complexity
+    task_difficulty: Dict[str, float] = {}
+    for task_id, task in ALL_TASKS.items():
+        n_issues = len(task["ground_truth_issues"])
+        complexity = min(1.0, n_issues / 10.0)
+        task_perf = perf.get(task_id, {})
+        success_rate = float(task_perf.get("success_rate", task_perf.get("avg_score", 0.0)))
+        task_difficulty[task_id] = round((1.0 - success_rate) * complexity, 4)
+    # Determine curriculum phase
+    if perf:
+        all_success = [float(p.get("success_rate", p.get("avg_score", 0.0))) for p in perf.values()]
+        avg_success = sum(all_success) / len(all_success)
+    else:
+        avg_success = 0.0
+    if avg_success < easy_thresh:
+        phase = "easy"
+        # Focus on task with lowest ground truth issue count (most approachable)
+        recommended = min(ALL_TASKS.keys(), key=lambda t: len(ALL_TASKS[t]["ground_truth_issues"]))
+    elif avg_success > hard_thresh:
+        phase = "hard"
+        # Focus on hardest unsolved task (highest difficulty score)
+        recommended = max(task_difficulty, key=task_difficulty.get)
+    else:
+        phase = "medium"
+        # Mix: pick a task proportional to difficulty (harder = more likely)
+        import random
+        weights = list(task_difficulty.values())
+        total_w = sum(weights) or 1.0
+        probs = [w / total_w for w in weights]
+        recommended = random.choices(list(task_difficulty.keys()), weights=probs, k=1)[0]
+    return {
+        "recommended_task_id": recommended,
+        "curriculum_phase": phase,
+        "avg_success_rate": round(avg_success, 4),
+        "task_difficulty_scores": task_difficulty,
+        "thresholds": {"easy": easy_thresh, "hard": hard_thresh},
+        "method": "CAMRL",
+    }
+@app.get("/reward_normalizer")
+async def get_reward_normalizer_stats():
+    """
+    Return current RewardNormalizer statistics for the running environment.
+    Useful for monitoring VL Norm across training runs.
+    """
+    return _reward_normalizer.to_dict()
+@app.post("/record_episode")
+async def record_episode(body: Dict[str, Any]):
+    """
+    Record a completed episode's return and length for VL Norm statistics.
+    Body: {"episode_return": 0.72, "episode_length": 12}
+    """
+    episode_return = float(body.get("episode_return", 0.0))
+    episode_length = int(body.get("episode_length", 1))
+    _reward_normalizer.update(episode_return, episode_length)
+    normalized = _reward_normalizer.normalize(episode_return, episode_length)
+    return {
+        "normalized_return": normalized,
+        "stats": _reward_normalizer.to_dict(),
+    }
+class TRLRolloutRequest(BaseModel):
+    task_id: Optional[str] = None
+    seed: Optional[int] = None
+    actions: List[Dict[str, Any]]  # Pre-generated action sequence from LLM
+@app.post("/trl_rollout")
+async def trl_rollout(request: TRLRolloutRequest):
+    """
+    Run a full episode from a pre-generated action sequence.
+    Designed for TRL GRPOTrainer custom rollout_fn integration:
+    - Takes a sequence of LLM-generated actions
+    - Runs them through the environment
+    - Returns trajectory dict with per-step rewards and final score
+    This enables offline rollout: LLM generates all actions first,
+    then this endpoint evaluates them, matching TRL's batch-rollout pattern.
+    Body:
+      task_id: str (optional, random if not set)
+      seed: int (optional)
+      actions: [{action_type, line_number, filename, ...}, ...]
+    Returns:
+      trajectory: [{step, action, reward, feedback, done}]
+      episode_return: float (sum of step rewards)
+      final_score: float (terminal grade)
+      normalized_return: float (episode_return / num_steps)
+      state_features: [float] (12-dim feature vector at end of episode)
+    """
+    rollout_env = CodeReviewEnvironment()
+    obs = rollout_env.reset(task_id=request.task_id, seed=request.seed)
+    trajectory = []
+    episode_return = 0.0
+    final_score = 0.0
+    for step_idx, action_dict in enumerate(request.actions):
+        action = ReviewAction.from_dict(action_dict)
+        obs_step = rollout_env.step(action)
+        step_data = _serialize(obs_step)
+        reward = step_data.get("reward") or 0.0
+        episode_return += reward
+        trajectory.append({
+            "step": step_idx + 1,
+            "action": action_dict,
+            "reward": reward,
+            "reward_breakdown": step_data.get("reward_breakdown", {}),
+            "feedback": step_data.get("feedback", ""),
+            "current_score": step_data.get("current_score", 0.0),
+            "done": step_data.get("done", False),
+        })
+        if step_data.get("done", False):
+            final_score = step_data.get("reward", step_data.get("current_score", 0.0)) or 0.0
+            break
+    n_steps = max(len(trajectory), 1)
+    # Record in global normalizer for VL Norm statistics
+    _reward_normalizer.update(episode_return, n_steps)
+    normalized = _reward_normalizer.normalize(episode_return, n_steps)
+    # Get final state features
+    final_progress = rollout_env._compute_progress(rollout_env._task["max_steps"] if rollout_env._task else 20)
+    return {
+        "task_id": request.task_id,
+        "trajectory": trajectory,
+        "episode_return": round(episode_return, 4),
+        "final_score": round(final_score, 4),
+        "normalized_return": normalized,
+        "num_steps": n_steps,
+        "state_features": final_progress.get("state_features", []),
+        "final_progress": {k: v for k, v in final_progress.items() if k != "state_features"},
+    }
 def main():
     import uvicorn
     port = int(os.environ.get("PORT", 7860))

server/environment.py CHANGED Viewed

@@ -9,11 +9,15 @@ import sys
 import os
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-from typing import Optional, List
 from models import Issue, ReviewAction, ReviewObservation, ReviewState
 from tasks.data import ALL_TASKS, TASK_IDS
-from server.graders import grade_episode, compute_live_score, match_issue
 try:
     from openenv.core.env_server import Environment as _BaseEnv
@@ -25,21 +29,44 @@ except ImportError:
         pass
 class CodeReviewEnvironment(_BaseEnv):
     """
-    A code review and security audit environment.
     The agent receives code files and must identify bugs, security
     vulnerabilities, and performance issues by flagging them with
     exact line numbers, types, and severity ratings.
-    Episode flow:
-      1. reset(task_id) — agent sees code files and task description
-      2. step(flag_issue) — flag a problem; get per-step reward
-      3. step(clear_flag) — remove an incorrectly flagged issue
-      4. step(request_hint) — get a hint (costs -0.01 reward)
-      5. step(submit_review) — episode ends, final grade is returned
-         (or auto-ends when max_steps is reached)
     """
     SUPPORTS_CONCURRENT_SESSIONS = False
@@ -49,6 +76,10 @@ class CodeReviewEnvironment(_BaseEnv):
         self._task: Optional[dict] = None
         self._ground_truth: List[Issue] = []
         self._hint_index: int = 0
     def reset(
         self,
@@ -70,6 +101,9 @@ class CodeReviewEnvironment(_BaseEnv):
             for gt in self._task["ground_truth_issues"]
         ]
         self._hint_index = 0
         self._state = ReviewState(
             task_id=task_id,
@@ -81,6 +115,16 @@ class CodeReviewEnvironment(_BaseEnv):
             submitted=False,
         )
         return ReviewObservation(
             task_id=task_id,
             task_description=self._task["description"],
@@ -93,11 +137,16 @@ class CodeReviewEnvironment(_BaseEnv):
             feedback=(
                 f"New episode started. Task: {self._task['difficulty'].upper()}. "
                 f"Review the code carefully and flag all issues you find. "
-                f"Use 'submit_review' when done."
             ),
             current_score=0.0,
             done=False,
             reward=None,
         )
     def step(
@@ -133,26 +182,43 @@ class CodeReviewEnvironment(_BaseEnv):
             action = ReviewAction.from_dict(action)
         self._state.step_count += 1
-        reward, feedback = self._process_action(action)
         max_steps = self._task["max_steps"]
         auto_end = self._state.step_count >= max_steps and not self._state.submitted
         done = self._state.submitted or auto_end
         if auto_end and not self._state.submitted:
-            # Grade what was submitted so far
             final = grade_episode(self._state.flagged_issues, self._ground_truth)
             self._state.current_score = final
-            reward = final * 0.5  # partial credit for auto-end
             feedback += (
-                f" Max steps reached. Auto-graded: {final:.3f}. "
-                f"Submit earlier for best score."
             )
             self._state.submitted = True
         live = compute_live_score(self._state.flagged_issues, self._ground_truth)
         self._state.current_score = live
         return ReviewObservation(
             task_id=self._state.task_id,
             task_description="",
@@ -166,12 +232,130 @@ class CodeReviewEnvironment(_BaseEnv):
             current_score=live,
             done=done,
             reward=reward,
         )
     @property
     def state(self) -> ReviewState:
         return self._state
     def _process_action(self, action: ReviewAction):
         atype = (action.action_type or "").strip().lower()
@@ -187,25 +371,26 @@ class CodeReviewEnvironment(_BaseEnv):
             return 0.0, (
                 f"Unknown action_type '{action.action_type}'. "
                 "Use: flag_issue | clear_flag | request_hint | submit_review"
-            )
     def _handle_flag(self, action: ReviewAction):
         if action.line_number is None:
-            return -0.02, "flag_issue requires 'line_number'."
         if not action.filename:
-            return -0.02, "flag_issue requires 'filename'."
         if action.issue_type not in ("bug", "security", "performance", "logic", None):
             action.issue_type = "bug"
         if action.severity not in ("low", "medium", "high", "critical", None):
             action.severity = "medium"
         for existing in self._state.flagged_issues:
             if (existing.line_number == action.line_number
                     and existing.filename == action.filename):
                 return 0.0, (
                     f"Line {action.line_number} in {action.filename} already flagged. "
-                    "Use clear_flag first if you want to change the finding."
-                )
         new_issue = Issue(
             line_number=action.line_number,
@@ -216,95 +401,258 @@ class CodeReviewEnvironment(_BaseEnv):
             fix_suggestion=action.fix_suggestion,
         )
-        is_tp = any(
-            match_issue(new_issue, gt)
-            for gt in self._ground_truth
-        )
         self._state.flagged_issues.append(new_issue)
-        if is_tp:
-            reward = 0.10
             feedback = (
-                f"Good catch! Issue flagged at {action.filename}:{action.line_number}. "
-                f"[+0.10 reward — correct finding]"
             )
         else:
-            reward = -0.05
             feedback = (
-                f"Issue flagged at {action.filename}:{action.line_number}. "
-                f"[-0.05 reward — no matching ground-truth issue nearby]"
             )
-        return reward, feedback
     def _handle_clear(self, action: ReviewAction):
         if action.line_number is None or not action.filename:
-            return -0.02, "clear_flag requires 'line_number' and 'filename'."
-        before = len(self._state.flagged_issues)
-        removed = None
-        self._state.flagged_issues = [
-            f for f in self._state.flagged_issues
-            if not (f.line_number == action.line_number
-                    and f.filename == action.filename)
-        ]
-        if len(self._state.flagged_issues) == before:
             return 0.0, (
                 f"No flagged issue found at {action.filename}:{action.line_number}."
-            )
-        removed_issue = Issue(
-            line_number=action.line_number,
-            filename=action.filename,
-            issue_type="bug",
-            severity="medium",
-        )
         was_tp = any(match_issue(removed_issue, gt) for gt in self._ground_truth)
         if was_tp:
-            reward = -0.03
             feedback = (
                 f"Removed a correct finding at {action.filename}:{action.line_number}. "
-                f"[-0.03 reward]"
             )
         else:
-            reward = 0.03
             feedback = (
                 f"Removed a false positive at {action.filename}:{action.line_number}. "
-                f"[+0.03 reward — good correction]"
             )
-        return reward, feedback
     def _handle_hint(self):
         hints = self._task.get("hints", [])
         if self._hint_index >= len(hints):
-            return -0.01, "No more hints available for this task."
         hint = hints[self._hint_index]
         self._hint_index += 1
         remaining = len(hints) - self._hint_index
-        return -0.01, f"Hint {self._hint_index}/{len(hints)}: {hint} ({remaining} hints left)"
     def _handle_submit(self):
         self._state.submitted = True
         final_score = grade_episode(self._state.flagged_issues, self._ground_truth)
         self._state.current_score = final_score
-        tp_count = sum(
-            1 for f in self._state.flagged_issues
-            if any(match_issue(f, gt) for gt in self._ground_truth)
-        )
         total_gt = len(self._ground_truth)
         total_flagged = len(self._state.flagged_issues)
         feedback = (
             f"Review submitted! Final score: {final_score:.3f}. "
-            f"Found {tp_count}/{total_gt} real issues. "
-            f"Total flags: {total_flagged} "
-            f"({'perfect' if total_flagged == tp_count else f'{total_flagged - tp_count} false positives'})."
         )
-        return final_score, feedback

 import os
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from typing import Optional, List, Dict, Any, Set
 from models import Issue, ReviewAction, ReviewObservation, ReviewState
 from tasks.data import ALL_TASKS, TASK_IDS
+from server.graders import (
+    grade_episode, compute_live_score, match_issue, match_quality,
+    compute_code_metadata, grade_episode_detailed,
+    graduated_near_reward, compute_potential, compute_code_state_features,
+)
 try:
     from openenv.core.env_server import Environment as _BaseEnv
         pass
+# Reward constants
+_BASE_TP_REWARD = 0.10
+_NEAR_MISS_REWARD = 0.03
+_BASE_FP_PENALTY = -0.05
+_SEVERITY_EXACT_BONUS = 0.02        # when severity exactly matches GT
+_TEMPORAL_BONUS = 0.02              # early correct flag (first 40% of steps)
+_CONFIDENCE_TP_BONUS = 0.01         # high-confidence TP
+_CONFIDENCE_FP_EXTRA = -0.03        # high-confidence FP (penalty multiplier)
+_HINT_COST = -0.01
+_REMOVE_TP_PENALTY = -0.03
+_REMOVE_FP_REWARD = 0.03
+_VALIDATION_PENALTY = -0.02
+# Flood protection: escalating FP penalty
+_FP_FLOOD_THRESHOLD = 3             # FPs before escalation kicks in
+_FP_FLOOD_MULTIPLIER = 1.5          # each extra FP beyond threshold costs 1.5x more
+_SEV_RANK = {"low": 0, "medium": 1, "high": 2, "critical": 3}
 class CodeReviewEnvironment(_BaseEnv):
     """
+    A code review and security audit RL environment.
     The agent receives code files and must identify bugs, security
     vulnerabilities, and performance issues by flagging them with
     exact line numbers, types, and severity ratings.
+    Reward design:
+    - True positive flag: +0.10 base, +0.02 severity exact match,
+      +0.02 early (first 40% steps), +0.01 high-confidence TP
+    - Near-miss (±3-5 lines): +0.03 partial credit
+    - False positive: -0.05 base, escalating penalty after 3rd FP,
+      extra -0.03 for high-confidence FP
+    - Clear false positive: +0.03
+    - Clear true positive: -0.03
+    - Hint: -0.01
+    - Submit: final F1+severity score (0.0–1.0)
+    - Auto-end (max_steps): full grade score (no penalty)
     """
     SUPPORTS_CONCURRENT_SESSIONS = False
         self._task: Optional[dict] = None
         self._ground_truth: List[Issue] = []
         self._hint_index: int = 0
+        self._code_metadata: Dict[str, Any] = {}
+        self._fp_count: int = 0           # total false positives this episode
+        self._matched_gt_indices: Set[int] = set()  # GT indices already matched
+        self._episode_rewards: List[float] = []  # for VL return normalization
     def reset(
         self,
             for gt in self._task["ground_truth_issues"]
         ]
         self._hint_index = 0
+        self._fp_count = 0
+        self._matched_gt_indices = set()
+        self._episode_rewards = []
         self._state = ReviewState(
             task_id=task_id,
             submitted=False,
         )
+        issue_categories = list({gt.issue_type for gt in self._ground_truth})
+        self._code_metadata = compute_code_metadata(
+            self._task["code_files"],
+            issue_categories=issue_categories,
+        )
+        # Pre-compute initial state features (progress=empty at reset)
+        self._code_metadata["state_features"] = compute_code_state_features(
+            self._code_metadata, progress={}
+        )
         return ReviewObservation(
             task_id=task_id,
             task_description=self._task["description"],
             feedback=(
                 f"New episode started. Task: {self._task['difficulty'].upper()}. "
                 f"Review the code carefully and flag all issues you find. "
+                f"Use 'submit_review' when done. "
+                f"Issue categories present: {sorted(set(issue_categories))}."
             ),
             current_score=0.0,
             done=False,
             reward=None,
+            reward_breakdown={},
+            progress={},
+            flagged_summary={},
+            code_metadata=self._code_metadata,
         )
     def step(
             action = ReviewAction.from_dict(action)
         self._state.step_count += 1
+        reward, feedback, reward_breakdown = self._process_action(action)
+        # Track episode rewards for VL return normalization
+        if reward is not None:
+            self._episode_rewards.append(float(reward))
         max_steps = self._task["max_steps"]
         auto_end = self._state.step_count >= max_steps and not self._state.submitted
         done = self._state.submitted or auto_end
         if auto_end and not self._state.submitted:
+            # Auto-end: grade in full (no penalty for hitting step limit)
             final = grade_episode(self._state.flagged_issues, self._ground_truth)
             self._state.current_score = final
+            reward = final  # full score, no 0.5x penalty
+            reward_breakdown = {"auto_end_grade": final, "total": final}
             feedback += (
+                f" Step budget exhausted — auto-graded: {final:.3f}. "
+                f"Submit earlier next time for slightly cleaner feedback."
             )
             self._state.submitted = True
         live = compute_live_score(self._state.flagged_issues, self._ground_truth)
         self._state.current_score = live
+        progress = self._compute_progress(max_steps)
+        flagged_summary = self._compute_flagged_summary()
+        # PRM-style dense signal: expected reward-to-go
+        # Based on Process Reward Models research: give agent an estimate of
+        # how much reward is still available, so it can plan remaining steps.
+        tp_found = len(self._matched_gt_indices)
+        total_gt = len(self._ground_truth)
+        issues_remaining = total_gt - tp_found
+        # Expected: each remaining TP gives ~0.12 (base + avg severity bonus)
+        expected_reward_to_go = round(issues_remaining * 0.12, 3)
         return ReviewObservation(
             task_id=self._state.task_id,
             task_description="",
             current_score=live,
             done=done,
             reward=reward,
+            reward_breakdown=reward_breakdown,
+            progress=progress,
+            flagged_summary=flagged_summary,
+            code_metadata={},  # Only populated on reset
+            metadata={
+                "issues_remaining": issues_remaining,
+                "expected_reward_to_go": expected_reward_to_go,
+            },
         )
     @property
     def state(self) -> ReviewState:
         return self._state
+    # ------------------------------------------------------------------
+    # Progress and summary helpers
+    # ------------------------------------------------------------------
+    def _compute_progress(self, max_steps: int) -> Dict[str, Any]:
+        """Compute live precision/recall/f1, step stats, and unfound issue types."""
+        flagged = self._state.flagged_issues
+        gt = self._ground_truth
+        tp = 0
+        fp = 0
+        matched: Set[int] = set()
+        found_types: Set[str] = set()
+        for flag in flagged:
+            hit = False
+            for i, g in enumerate(gt):
+                if i not in matched and match_issue(flag, g):
+                    tp += 1
+                    matched.add(i)
+                    found_types.add(g.issue_type)
+                    hit = True
+                    break
+            if not hit:
+                fp += 1
+        fn = len(gt) - len(matched)
+        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
+        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
+        f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0.0
+        all_types = {g.issue_type for g in gt}
+        unfound_types = sorted(all_types - found_types)
+        steps_used = self._state.step_count
+        steps_remaining = max(0, max_steps - steps_used)
+        # Variable-Length Return Normalization (VL Norm 2025):
+        # normalized_return = cumulative_reward / max(steps_used, 1)
+        # This makes return comparable across episodes of different length,
+        # which is key for multi-task RL where tasks have different max_steps.
+        cumulative_reward = sum(self._episode_rewards)
+        normalized_return = round(cumulative_reward / max(steps_used, 1), 4)
+        progress = {
+            "precision": round(precision, 4),
+            "recall": round(recall, 4),
+            "f1": round(f1, 4),
+            "true_positives": float(tp),
+            "false_positives": float(fp),
+            "total_ground_truth": float(len(gt)),
+            "steps_used": float(steps_used),
+            "steps_remaining": float(steps_remaining),
+            "unfound_issue_types": unfound_types,
+            "normalized_return": normalized_return,
+            "cumulative_reward": round(cumulative_reward, 4),
+        }
+        # 12-dim state feature vector for RL policy/value networks (code2vec/PBRS literature)
+        progress["state_features"] = compute_code_state_features(
+            self._code_metadata, progress=progress
+        )
+        return progress
+    def _compute_flagged_summary(self) -> Dict[str, Any]:
+        """Compute correct/incorrect/near_miss counts."""
+        flagged = self._state.flagged_issues
+        gt = self._ground_truth
+        correct = 0
+        near_misses = 0
+        incorrect = 0
+        matched_gt: Set[int] = set()
+        for flag in flagged:
+            matched = False
+            for i, g in enumerate(gt):
+                if i in matched_gt:
+                    continue
+                if match_issue(flag, g):
+                    correct += 1
+                    matched_gt.add(i)
+                    matched = True
+                    break
+            if not matched:
+                is_near = False
+                for i, g in enumerate(gt):
+                    if i in matched_gt:
+                        continue
+                    if match_quality(flag, g) == "near":
+                        is_near = True
+                        break
+                if is_near:
+                    near_misses += 1
+                else:
+                    incorrect += 1
+        return {
+            "total_flagged": len(flagged),
+            "correct": correct,
+            "incorrect": incorrect,
+            "near_misses": near_misses,
+        }
+    # ------------------------------------------------------------------
+    # Action handlers
+    # ------------------------------------------------------------------
     def _process_action(self, action: ReviewAction):
         atype = (action.action_type or "").strip().lower()
             return 0.0, (
                 f"Unknown action_type '{action.action_type}'. "
                 "Use: flag_issue | clear_flag | request_hint | submit_review"
+            ), {}
     def _handle_flag(self, action: ReviewAction):
         if action.line_number is None:
+            return _VALIDATION_PENALTY, "flag_issue requires 'line_number'.", {"validation_penalty": _VALIDATION_PENALTY}
         if not action.filename:
+            return _VALIDATION_PENALTY, "flag_issue requires 'filename'.", {"validation_penalty": _VALIDATION_PENALTY}
         if action.issue_type not in ("bug", "security", "performance", "logic", None):
             action.issue_type = "bug"
         if action.severity not in ("low", "medium", "high", "critical", None):
             action.severity = "medium"
+        # Duplicate check
         for existing in self._state.flagged_issues:
             if (existing.line_number == action.line_number
                     and existing.filename == action.filename):
                 return 0.0, (
                     f"Line {action.line_number} in {action.filename} already flagged. "
+                    "Use clear_flag first to change it."
+                ), {"duplicate": 0.0}
         new_issue = Issue(
             line_number=action.line_number,
             fix_suggestion=action.fix_suggestion,
         )
+        # Classify: TP, near-miss (with line distance), or FP
+        is_tp = False
+        is_near = False
+        near_line_diff = 0
+        matched_gt_issue: Optional[Issue] = None
+        matched_gt_idx: Optional[int] = None
+        for i, gt in enumerate(self._ground_truth):
+            q = match_quality(new_issue, gt)
+            if q == "exact" and i not in self._matched_gt_indices:
+                is_tp = True
+                matched_gt_issue = gt
+                matched_gt_idx = i
+                break
+            elif q == "near" and not is_near:
+                is_near = True
+                near_line_diff = abs(new_issue.line_number - gt.line_number)
         self._state.flagged_issues.append(new_issue)
+        # PBRS: compute potential before and after this flag
+        tp_before = len(self._matched_gt_indices)
+        total_gt = len(self._ground_truth)
+        reward_breakdown: Dict[str, float] = {}
+        if is_tp and matched_gt_issue is not None and matched_gt_idx is not None:
+            self._matched_gt_indices.add(matched_gt_idx)
+            tp_after = len(self._matched_gt_indices)
+            base_reward = _BASE_TP_REWARD
+            reward_breakdown["base_tp"] = base_reward
+            # Severity exact match bonus
+            severity_bonus = 0.0
+            if new_issue.severity == matched_gt_issue.severity:
+                severity_bonus = _SEVERITY_EXACT_BONUS
+                reward_breakdown["severity_exact"] = severity_bonus
+            # Temporal bonus: TP caught in first 40% of max_steps
+            max_steps = self._task["max_steps"]
+            early_threshold = max(1, int(max_steps * 0.4))
+            temporal_bonus = 0.0
+            if self._state.step_count <= early_threshold:
+                temporal_bonus = _TEMPORAL_BONUS
+                reward_breakdown["temporal_bonus"] = temporal_bonus
+            # Confidence calibration: high confidence TP → small bonus
+            confidence_bonus = 0.0
+            if action.confidence is not None and action.confidence >= 0.7:
+                confidence_bonus = _CONFIDENCE_TP_BONUS
+                reward_breakdown["confidence_bonus"] = confidence_bonus
+            # PBRS: Φ(s') - Φ(s)  (potential-based shaping, policy-invariant)
+            phi_before = compute_potential(tp_before, total_gt)
+            phi_after = compute_potential(tp_after, total_gt)
+            pbrs_bonus = round(phi_after - phi_before, 4)
+            reward_breakdown["pbrs_shaping"] = pbrs_bonus
+            reward = base_reward + severity_bonus + temporal_bonus + confidence_bonus + pbrs_bonus
+            reward_breakdown["total"] = round(reward, 4)
+            sev_note = f", severity +{severity_bonus:.2f}" if severity_bonus else ""
+            temp_note = f", early +{temporal_bonus:.2f}" if temporal_bonus else ""
+            conf_note = f", conf +{confidence_bonus:.2f}" if confidence_bonus else ""
+            pbrs_note = f", progress +{pbrs_bonus:.2f}" if pbrs_bonus > 0 else ""
             feedback = (
+                f"Correct! Issue at {action.filename}:{action.line_number} confirmed. "
+                f"[+{reward:.2f}{sev_note}{temp_note}{conf_note}{pbrs_note}]"
             )
+        elif is_near:
+            # Graduated near-miss: smooth exponential decay by line distance
+            near_reward = graduated_near_reward(near_line_diff)
+            reward_breakdown["near_miss"] = near_reward
+            reward_breakdown["line_diff"] = float(near_line_diff)
+            reward_breakdown["total"] = near_reward
+            feedback = (
+                f"Close! Near a real issue at {action.filename}:{action.line_number}. "
+                f"[+{near_reward:.3f} — {near_line_diff} lines off, adjust line number]"
+            )
+            reward = near_reward
         else:
+            # False positive — with flood protection
+            self._fp_count += 1
+            base_penalty = _BASE_FP_PENALTY
+            reward_breakdown["base_fp"] = base_penalty
+            # Escalating penalty after FP_FLOOD_THRESHOLD FPs
+            flood_penalty = 0.0
+            if self._fp_count > _FP_FLOOD_THRESHOLD:
+                extra = self._fp_count - _FP_FLOOD_THRESHOLD
+                flood_penalty = round(-0.02 * extra * _FP_FLOOD_MULTIPLIER, 3)
+                reward_breakdown["flood_penalty"] = flood_penalty
+            # High-confidence FP: extra penalty
+            confidence_penalty = 0.0
+            if action.confidence is not None and action.confidence >= 0.7:
+                confidence_penalty = _CONFIDENCE_FP_EXTRA
+                reward_breakdown["confidence_penalty"] = confidence_penalty
+            reward = base_penalty + flood_penalty + confidence_penalty
+            reward_breakdown["total"] = round(reward, 4)
+            flood_note = f", over-flagging -{abs(flood_penalty):.2f}" if flood_penalty else ""
+            conf_note = f", high-confidence penalty {confidence_penalty:.2f}" if confidence_penalty else ""
             feedback = (
+                f"No match at {action.filename}:{action.line_number}. "
+                f"[{reward:.2f} — false positive{flood_note}{conf_note}]"
             )
+        return reward, feedback, reward_breakdown
     def _handle_clear(self, action: ReviewAction):
         if action.line_number is None or not action.filename:
+            return _VALIDATION_PENALTY, "clear_flag requires 'line_number' and 'filename'.", {"validation_penalty": _VALIDATION_PENALTY}
+        removed_issue = None
+        new_list = []
+        for f in self._state.flagged_issues:
+            if f.line_number == action.line_number and f.filename == action.filename:
+                removed_issue = f
+            else:
+                new_list.append(f)
+        if removed_issue is None:
             return 0.0, (
                 f"No flagged issue found at {action.filename}:{action.line_number}."
+            ), {"no_op": 0.0}
+        self._state.flagged_issues = new_list
+        # Check if removed issue was TP
         was_tp = any(match_issue(removed_issue, gt) for gt in self._ground_truth)
         if was_tp:
+            # Un-track it from matched set
+            for i, gt in enumerate(self._ground_truth):
+                if match_issue(removed_issue, gt):
+                    self._matched_gt_indices.discard(i)
+                    break
+            reward = _REMOVE_TP_PENALTY
+            reward_breakdown = {"removed_tp": reward, "total": reward}
             feedback = (
                 f"Removed a correct finding at {action.filename}:{action.line_number}. "
+                f"[{reward:.2f}]"
             )
         else:
+            # Removing a FP — decrement counter
+            self._fp_count = max(0, self._fp_count - 1)
+            reward = _REMOVE_FP_REWARD
+            reward_breakdown = {"removed_fp": reward, "total": reward}
             feedback = (
                 f"Removed a false positive at {action.filename}:{action.line_number}. "
+                f"[+{reward:.2f} — good correction]"
             )
+        return reward, feedback, reward_breakdown
     def _handle_hint(self):
         hints = self._task.get("hints", [])
+        adaptive_hint = self._get_adaptive_hint()
+        if adaptive_hint:
+            return _HINT_COST, f"Hint: {adaptive_hint} ({_HINT_COST} reward)", {"hint_cost": _HINT_COST}
         if self._hint_index >= len(hints):
+            return _HINT_COST, "No more hints available for this task.", {"hint_cost": _HINT_COST}
         hint = hints[self._hint_index]
         self._hint_index += 1
         remaining = len(hints) - self._hint_index
+        return _HINT_COST, f"Hint {self._hint_index}/{len(hints)}: {hint} ({remaining} hints left)", {"hint_cost": _HINT_COST}
+    def _get_adaptive_hint(self) -> Optional[str]:
+        """Generate a context-aware hint based on current episode state."""
+        flagged = self._state.flagged_issues
+        gt = self._ground_truth
+        if not gt:
+            return None
+        tp_count = len(self._matched_gt_indices)
+        fp_count = len(flagged) - tp_count - sum(
+            1 for f in flagged
+            if any(match_quality(f, g) == "near" for g in gt)
+        )
+        issue_categories = self._code_metadata.get("issue_categories", [])
+        # Many false positives: over-flagging
+        if fp_count > tp_count and fp_count >= 2:
+            return (
+                "You are over-flagging. Focus only on confident, concrete findings. "
+                "Consider using clear_flag to remove uncertain flags."
+            )
+        # No correct flags at all yet
+        if len(flagged) > 0 and tp_count == 0:
+            if issue_categories:
+                cats = ", ".join(sorted(set(issue_categories)))
+                return (
+                    f"Focus on [{cats}] issues. "
+                    "None of your current flags match real issues. Re-examine carefully."
+                )
+        # Found some but missed whole categories
+        if tp_count > 0 and issue_categories:
+            found_types: Set[str] = set()
+            for i in self._matched_gt_indices:
+                found_types.add(gt[i].issue_type)
+            missed = sorted(set(issue_categories) - found_types)
+            if missed:
+                missed_str = ", ".join(missed)
+                return (
+                    f"Good progress! You've found some issues but haven't flagged any "
+                    f"[{missed_str}] issues yet — look again for those specifically."
+                )
+        return None  # Fall through to static hints
     def _handle_submit(self):
         self._state.submitted = True
         final_score = grade_episode(self._state.flagged_issues, self._ground_truth)
         self._state.current_score = final_score
+        tp_count = len(self._matched_gt_indices)
         total_gt = len(self._ground_truth)
         total_flagged = len(self._state.flagged_issues)
+        fp_count = total_flagged - tp_count
+        # Breakdown for detailed feedback
+        detailed = grade_episode_detailed(self._state.flagged_issues, self._ground_truth)
         feedback = (
             f"Review submitted! Final score: {final_score:.3f}. "
+            f"Found {tp_count}/{total_gt} issues. "
+            f"Precision: {detailed['precision']:.2f}, Recall: {detailed['recall']:.2f}, "
+            f"F1: {detailed['f1']:.2f}. "
         )
+        if fp_count > 0:
+            feedback += f"{fp_count} false positive(s). "
+        if detailed["false_negatives"] > 0:
+            fn = detailed["false_negatives"]
+            feedback += f"{fn} issue(s) missed."
+        reward_breakdown = {
+            "final_f1": detailed["f1"],
+            "severity_accuracy": detailed["severity_accuracy"],
+            "final_score": final_score,
+            "total": final_score,
+        }
+        return final_score, feedback, reward_breakdown

server/graders.py CHANGED Viewed

@@ -1,10 +1,21 @@
 """
 Grading logic for the Code Review Environment.
 """
 from __future__ import annotations
 import re
-from typing import List, Tuple, Set
 import sys
 import os
@@ -21,8 +32,18 @@ _TYPE_COMPAT = {
     "performance": {"performance"},
 }
-def match_issue(flagged: Issue, gt: Issue, line_tolerance: int = 2) -> bool:
     if flagged.filename != gt.filename:
         return False
     if abs(flagged.line_number - gt.line_number) > line_tolerance:
@@ -33,6 +54,274 @@ def match_issue(flagged: Issue, gt: Issue, line_tolerance: int = 2) -> bool:
     return True
 def grade_episode(
     flagged: List[Issue],
     ground_truth: List[Issue],
@@ -79,6 +368,105 @@ def grade_episode(
     return round(min(1.0, max(0.0, final)), 4)
 def compute_live_score(flagged: List[Issue], ground_truth: List[Issue]) -> float:
     """F1-only score for per-step feedback (no severity bonus)."""
     if not ground_truth:
@@ -107,6 +495,7 @@ def compute_live_score(flagged: List[Issue], ground_truth: List[Issue]) -> float
 _PATTERNS = [
     (r"range\(len\(\w+\)\s*\+\s*1\)", None, "bug", "high",
      "Off-by-one error: range(len(x) + 1) iterates one past the end"),
     (r"left,\s*right\s*=\s*0,\s*len\(", None, "bug", "medium",
@@ -114,30 +503,81 @@ _PATTERNS = [
     (r"counts\[word\]\s*=\s*0\b", None, "bug", "low",
      "Counter initialized to 0 instead of 1"),
     (r'SECRET_KEY\s*=\s*["\']', None, "security", "high",
      "Hardcoded SECRET_KEY in source code"),
     (r'PASSWORD\s*=\s*["\']', None, "security", "high",
      "Hardcoded password in source code"),
     (r"f['\"].*SELECT.*\{", None, "security", "critical",
      "SQL injection via f-string query construction"),
     (r"f['\"].*DELETE.*\{", None, "security", "critical",
      "SQL injection via f-string DELETE query"),
     (r"render_template_string\(f['\"]", None, "security", "high",
      "XSS: unsanitized user input in render_template_string"),
     (r"shell\s*=\s*True", None, "security", "critical",
      "Command injection risk: shell=True with user input"),
-    (r"hashlib\.md5\(", None, "security", "medium",
-     "MD5 is cryptographically broken, use SHA-256 or HMAC-SHA256"),
     (r"expected\s*==\s*\w+_hash", None, "security", "medium",
      "Timing attack: use hmac.compare_digest() for constant-time comparison"),
     (r"password\s*=\s*models\.CharField", None, "security", "critical",
      "Plaintext password storage in database"),
-    (r"os\.path\.join\(['\"]\/", None, "security", "high",
-     "Path traversal: os.path.join with absolute prefix doesn't prevent traversal"),
     (r"\.objects\.get\(id=item\.", None, "performance", "high",
      "N+1 query: database lookup inside a loop"),
     (r"FloatField\(\)", None, "bug", "medium",
      "FloatField for monetary values causes precision errors, use DecimalField"),
     (r"BinaryField\(\)", None, "security", "high",

 """
 Grading logic for the Code Review Environment.
+Reward design is grounded in:
+- Potential-Based Reward Shaping (PBRS): Ng et al. 1999
+  R_shaped(s,a,s') = R(s,a,s') + γ·Φ(s') - Φ(s)
+  where Φ(s) = (tp_found / total_gt) · POTENTIAL_SCALE
+- Graduated line-proximity rewards: exponential decay over line distance
+  reward = BASE_TP · exp(-DECAY · max(0, line_diff - EXACT_TOLERANCE))
+  for 0 < line_diff ≤ NEAR_TOLERANCE
+- F1-based terminal scoring: 0.70·F1 + 0.30·severity_accuracy
 """
 from __future__ import annotations
+import ast
+import math
 import re
+from typing import List, Tuple, Set, Dict, Optional
 import sys
 import os
     "performance": {"performance"},
 }
+# Tolerances
+NEAR_TOLERANCE = 5
+EXACT_TOLERANCE = 2
+# Graduated reward constants (PBRS + smooth near-miss)
+BASE_TP_REWARD = 0.10
+NEAR_DECAY = 0.6       # exponential decay per line beyond EXACT_TOLERANCE
+POTENTIAL_SCALE = 0.5  # Φ(s) = (tp/total_gt) * POTENTIAL_SCALE
+def match_issue(flagged: Issue, gt: Issue, line_tolerance: int = EXACT_TOLERANCE, near_tolerance: int = NEAR_TOLERANCE) -> bool:
+    """Return True if flagged matches gt within line_tolerance lines and same type."""
     if flagged.filename != gt.filename:
         return False
     if abs(flagged.line_number - gt.line_number) > line_tolerance:
     return True
+def match_quality(flagged: Issue, gt: Issue) -> str:
+    """
+    Return quality of match between flagged and gt:
+      "exact"  — within ±2 lines and right issue type
+      "near"   — within ±3-5 lines and same file (regardless of type)
+      "none"   — no meaningful match
+    """
+    if flagged.filename != gt.filename:
+        return "none"
+    line_diff = abs(flagged.line_number - gt.line_number)
+    if line_diff <= EXACT_TOLERANCE:
+        compat = _TYPE_COMPAT.get(gt.issue_type, {gt.issue_type})
+        if flagged.issue_type in compat:
+            return "exact"
+    if line_diff <= NEAR_TOLERANCE:
+        return "near"
+    return "none"
+def graduated_near_reward(line_diff: int) -> float:
+    """
+    Graduated reward for near-miss flags using exponential decay.
+    Implements continuous reward shaping based on proximity:
+      line_diff = 0-2  → 0.10 (full TP, handled separately)
+      line_diff = 3    → 0.10 * exp(-0.6*1) ≈ 0.055
+      line_diff = 4    → 0.10 * exp(-0.6*2) ≈ 0.033
+      line_diff = 5    → 0.10 * exp(-0.6*3) ≈ 0.020
+    This gives smooth gradient signal rather than a hard 0.03 step function,
+    encouraging the agent to refine line numbers progressively.
+    """
+    if line_diff <= EXACT_TOLERANCE:
+        return BASE_TP_REWARD
+    extra = line_diff - EXACT_TOLERANCE
+    return round(BASE_TP_REWARD * math.exp(-NEAR_DECAY * extra), 4)
+def compute_potential(tp_count: int, total_gt: int) -> float:
+    """
+    Potential function Φ(s) for Potential-Based Reward Shaping (PBRS).
+    Φ(s) = (tp_found / total_gt) * POTENTIAL_SCALE
+    The shaped reward R_shaped = r + Φ(s') - Φ(s) ensures policy invariance
+    (Ng et al. 1999): the optimal policy under shaped rewards is the same as
+    under the original rewards, but with better intermediate gradient signal.
+    Here we compute just Φ(s); the caller computes Φ(s') - Φ(s).
+    """
+    if total_gt <= 0:
+        return 0.0
+    return (tp_count / total_gt) * POTENTIAL_SCALE
+def compute_function_map(code: str) -> Dict[int, str]:
+    """
+    Map each line number to the name of its enclosing function (or class method).
+    Lines outside any function map to "module". Non-parseable code returns empty dict.
+    """
+    result: Dict[int, str] = {}
+    try:
+        tree = ast.parse(code)
+        for node in ast.walk(tree):
+            if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
+                end = getattr(node, "end_lineno", node.lineno)
+                for lineno in range(node.lineno, end + 1):
+                    result[lineno] = node.name
+    except SyntaxError:
+        pass
+    return result
+def compute_code_metadata(code_files: Dict[str, str], issue_categories: Optional[List[str]] = None) -> Dict:
+    """
+    Extract code structure metadata using Python's ast module.
+    Returns:
+        total_lines, num_functions, function_names, num_classes, class_names,
+        imports, complexity_estimate, issue_categories, function_ranges
+    """
+    total_lines = 0
+    num_functions = 0
+    function_names: List[str] = []
+    num_classes = 0
+    class_names: List[str] = []
+    imports: List[str] = []
+    branch_count = 0
+    function_ranges: List[Dict] = []  # [{name, file, start, end}]
+    for filename, code in code_files.items():
+        lines = code.splitlines()
+        total_lines += len(lines)
+        try:
+            tree = ast.parse(code)
+            for node in ast.walk(tree):
+                if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
+                    num_functions += 1
+                    function_names.append(node.name)
+                    end = getattr(node, "end_lineno", node.lineno)
+                    function_ranges.append({
+                        "name": node.name,
+                        "file": filename,
+                        "start": node.lineno,
+                        "end": end,
+                    })
+                elif isinstance(node, ast.ClassDef):
+                    num_classes += 1
+                    class_names.append(node.name)
+                elif isinstance(node, ast.Import):
+                    for alias in node.names:
+                        imports.append(alias.name.split(".")[0])
+                elif isinstance(node, ast.ImportFrom):
+                    if node.module:
+                        imports.append(node.module.split(".")[0])
+                elif isinstance(node, (ast.If, ast.For, ast.While, ast.Try,
+                                        ast.ExceptHandler, ast.With)):
+                    branch_count += 1
+        except SyntaxError:
+            # If ast can't parse (e.g. non-Python file), just count lines
+            pass
+    # Deduplicate imports
+    imports = list(dict.fromkeys(imports))
+    # Complexity estimate
+    if branch_count <= 5:
+        complexity_estimate = "low"
+    elif branch_count <= 15:
+        complexity_estimate = "medium"
+    else:
+        complexity_estimate = "high"
+    return {
+        "total_lines": total_lines,
+        "num_functions": num_functions,
+        "function_names": function_names,
+        "num_classes": num_classes,
+        "class_names": class_names,
+        "imports": imports,
+        "complexity_estimate": complexity_estimate,
+        "issue_categories": list(set(issue_categories)) if issue_categories else [],
+        "function_ranges": function_ranges,
+    }
+def compute_code_state_features(
+    code_metadata: Dict,
+    progress: Optional[Dict] = None,
+) -> List[float]:
+    """
+    Compute a normalized 12-dimensional feature vector for RL training.
+    Based on state representation research (code2vec, GraphCodeBERT, 2023-2024),
+    combining AST-derived structural features with episode progress metrics.
+    This vector is suitable as input to a policy network or value estimator.
+    Dimensions:
+      0: total_lines / 200          — code size (normalized)
+      1: num_functions / 20         — function count
+      2: num_classes / 10           — class count
+      3: complexity_score           — 0=low, 0.5=medium, 1.0=high
+      4: has_bug_issues             — 1 if "bug" in issue_categories
+      5: has_security_issues        — 1 if "security" in issue_categories
+      6: has_performance_issues     — 1 if "performance" in issue_categories
+      7: has_logic_issues           — 1 if "logic" in issue_categories
+      8: progress_recall            — tp / total_gt (0 if no progress yet)
+      9: progress_precision         — precision so far
+     10: steps_used_frac            — steps_used / max_steps
+     11: fp_pressure               — false_positives / max(total_flagged, 1)
+    """
+    if progress is None:
+        progress = {}
+    complexity_map = {"low": 0.0, "medium": 0.5, "high": 1.0}
+    cats = set(code_metadata.get("issue_categories", []))
+    total_gt = float(progress.get("total_ground_truth", 1.0)) or 1.0
+    tp = float(progress.get("true_positives", 0.0))
+    fp = float(progress.get("false_positives", 0.0))
+    total_flagged = tp + fp
+    steps_used = float(progress.get("steps_used", 0.0))
+    steps_rem = float(progress.get("steps_remaining", 1.0))
+    max_steps = steps_used + steps_rem or 1.0
+    features = [
+        min(1.0, code_metadata.get("total_lines", 0) / 200.0),
+        min(1.0, code_metadata.get("num_functions", 0) / 20.0),
+        min(1.0, code_metadata.get("num_classes", 0) / 10.0),
+        complexity_map.get(code_metadata.get("complexity_estimate", "low"), 0.0),
+        1.0 if "bug" in cats else 0.0,
+        1.0 if "security" in cats else 0.0,
+        1.0 if "performance" in cats else 0.0,
+        1.0 if "logic" in cats else 0.0,
+        min(1.0, tp / total_gt),
+        min(1.0, tp / total_flagged) if total_flagged > 0 else 0.0,
+        min(1.0, steps_used / max_steps),
+        min(1.0, fp / total_flagged) if total_flagged > 0 else 0.0,
+    ]
+    return [round(f, 4) for f in features]
+class RewardNormalizer:
+    """
+    Variable-Length Return Normalizer for multi-task RL training.
+    Based on VL Norm (2025) and Return-based Scaling (2021):
+    Normalizes episode returns accounting for variable episode lengths,
+    preventing long episodes from dominating gradient computation.
+    Usage:
+        normalizer = RewardNormalizer(window_size=100)
+        # After each episode:
+        normalizer.update(episode_return, episode_length)
+        normalized_r = normalizer.normalize(episode_return, episode_length)
+    """
+    def __init__(self, window_size: int = 100, eps: float = 1e-8) -> None:
+        self.window_size = window_size
+        self.eps = eps
+        self._returns: List[float] = []
+        self._lengths: List[int] = []
+        self.mean: float = 0.0
+        self.std: float = 1.0
+    def update(self, episode_return: float, episode_length: int) -> None:
+        """Record a completed episode for running statistics."""
+        self._returns.append(episode_return)
+        self._lengths.append(max(1, episode_length))
+        if len(self._returns) > self.window_size:
+            self._returns.pop(0)
+            self._lengths.pop(0)
+        self._recompute()
+    def _recompute(self) -> None:
+        if len(self._returns) < 2:
+            return
+        returns = [r for r in self._returns]
+        lengths = [l for l in self._lengths]
+        mean_len = sum(lengths) / len(lengths)
+        # Length-adjusted std: longer episodes have proportionally less weight
+        self.mean = sum(returns) / len(returns)
+        raw_std = (sum((r - self.mean) ** 2 for r in returns) / len(returns)) ** 0.5
+        length_factors = [(l / mean_len) ** 0.5 for l in lengths]
+        avg_lf = sum(length_factors) / len(length_factors)
+        self.std = max(self.eps, raw_std * avg_lf)
+    def normalize(self, episode_return: float, episode_length: int) -> float:
+        """Return the length-adjusted normalized return."""
+        if len(self._returns) < 2:
+            return episode_return
+        mean_len = sum(self._lengths) / len(self._lengths)
+        length_factor = (max(1, episode_length) / mean_len) ** 0.5
+        return round((episode_return - self.mean) / (self.std * length_factor + self.eps), 4)
+    def to_dict(self) -> Dict:
+        return {
+            "mean": round(self.mean, 4),
+            "std": round(self.std, 4),
+            "n_episodes": len(self._returns),
+            "window_size": self.window_size,
+        }
 def grade_episode(
     flagged: List[Issue],
     ground_truth: List[Issue],
     return round(min(1.0, max(0.0, final)), 4)
+def grade_episode_detailed(
+    flagged: List[Issue],
+    ground_truth: List[Issue],
+    line_tolerance: int = 2,
+) -> Dict:
+    """
+    Full breakdown of grading results.
+    Returns:
+        score, f1, precision, recall, severity_accuracy,
+        true_positives, false_positives, false_negatives,
+        near_misses, per_file
+    """
+    if not ground_truth:
+        score = 1.0 if not flagged else 0.0
+        return {
+            "score": score,
+            "f1": score,
+            "precision": score,
+            "recall": score,
+            "severity_accuracy": score,
+            "true_positives": 0,
+            "false_positives": len(flagged),
+            "false_negatives": 0,
+            "near_misses": 0,
+            "per_file": {},
+        }
+    tp = 0
+    fp = 0
+    near_misses = 0
+    matched_gt_indices: Set[int] = set()
+    severity_scores: List[float] = []
+    per_file: Dict[str, Dict] = {}
+    for flag in flagged:
+        fname = flag.filename
+        if fname not in per_file:
+            per_file[fname] = {"tp": 0, "fp": 0, "near_miss": 0}
+        matched = False
+        for i, gt in enumerate(ground_truth):
+            if i in matched_gt_indices:
+                continue
+            if match_issue(flag, gt, line_tolerance):
+                tp += 1
+                matched_gt_indices.add(i)
+                matched = True
+                per_file[fname]["tp"] += 1
+                flag_rank = _SEV_RANK.get(flag.severity, 1)
+                gt_rank = _SEV_RANK.get(gt.severity, 1)
+                distance = abs(flag_rank - gt_rank)
+                severity_scores.append(max(0.0, 1.0 - distance * 0.34))
+                break
+        if not matched:
+            # Check for near miss (3-5 lines off, same file)
+            is_near = False
+            for i, gt in enumerate(ground_truth):
+                if i in matched_gt_indices:
+                    continue
+                q = match_quality(flag, gt)
+                if q == "near":
+                    is_near = True
+                    break
+            if is_near:
+                near_misses += 1
+                per_file[fname]["near_miss"] += 1
+            else:
+                fp += 1
+                per_file[fname]["fp"] += 1
+    fn = len(ground_truth) - len(matched_gt_indices)
+    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
+    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
+    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0.0
+    if severity_scores:
+        severity_accuracy = sum(severity_scores) / len(ground_truth)
+    else:
+        severity_accuracy = 0.0
+    score = round(min(1.0, max(0.0, 0.70 * f1 + 0.30 * severity_accuracy)), 4)
+    return {
+        "score": score,
+        "f1": round(f1, 4),
+        "precision": round(precision, 4),
+        "recall": round(recall, 4),
+        "severity_accuracy": round(severity_accuracy, 4),
+        "true_positives": tp,
+        "false_positives": fp,
+        "false_negatives": fn,
+        "near_misses": near_misses,
+        "per_file": per_file,
+    }
 def compute_live_score(flagged: List[Issue], ground_truth: List[Issue]) -> float:
     """F1-only score for per-step feedback (no severity bonus)."""
     if not ground_truth:
 _PATTERNS = [
+    # --- Bug patterns ---
     (r"range\(len\(\w+\)\s*\+\s*1\)", None, "bug", "high",
      "Off-by-one error: range(len(x) + 1) iterates one past the end"),
     (r"left,\s*right\s*=\s*0,\s*len\(", None, "bug", "medium",
     (r"counts\[word\]\s*=\s*0\b", None, "bug", "low",
      "Counter initialized to 0 instead of 1"),
+    # --- Hardcoded secrets ---
     (r'SECRET_KEY\s*=\s*["\']', None, "security", "high",
      "Hardcoded SECRET_KEY in source code"),
+    (r'ADMIN_TOKEN\s*=\s*["\']', None, "security", "high",
+     "Hardcoded ADMIN_TOKEN in source code"),
     (r'PASSWORD\s*=\s*["\']', None, "security", "high",
      "Hardcoded password in source code"),
+    # --- Injection attacks ---
     (r"f['\"].*SELECT.*\{", None, "security", "critical",
      "SQL injection via f-string query construction"),
+    (r"f['\"].*INSERT.*\{", None, "security", "critical",
+     "SQL injection via f-string INSERT query"),
     (r"f['\"].*DELETE.*\{", None, "security", "critical",
      "SQL injection via f-string DELETE query"),
+    (r"f['\"].*LIKE.*%\{", None, "security", "critical",
+     "SQL injection via f-string LIKE clause"),
+    (r"LIMIT\s*\{", None, "security", "critical",
+     "SQL injection: LIMIT clause uses unparameterized variable"),
     (r"render_template_string\(f['\"]", None, "security", "high",
      "XSS: unsanitized user input in render_template_string"),
     (r"shell\s*=\s*True", None, "security", "critical",
      "Command injection risk: shell=True with user input"),
+    (r"os\.system\(", None, "security", "critical",
+     "Command injection risk: os.system() executes shell commands"),
+    (r"os\.path\.join\(['\"]\/", None, "security", "high",
+     "Path traversal: os.path.join with absolute prefix doesn't prevent traversal"),
+    # --- Broken cryptography ---
+    (r"hashlib\.md5\(", None, "security", "high",
+     "MD5 is cryptographically broken for security use; use SHA-256 or bcrypt"),
+    (r"hashlib\.sha1\(", None, "security", "medium",
+     "SHA-1 is deprecated for security use; use SHA-256 or better"),
     (r"expected\s*==\s*\w+_hash", None, "security", "medium",
      "Timing attack: use hmac.compare_digest() for constant-time comparison"),
+    # --- Dangerous deserialization ---
+    (r"pickle\.loads\(", None, "security", "critical",
+     "Unsafe deserialization: pickle.loads() on untrusted data allows remote code execution"),
+    (r"yaml\.load\(", None, "security", "high",
+     "Unsafe YAML deserialization: use yaml.safe_load() instead"),
+    # --- Auth / access control ---
     (r"password\s*=\s*models\.CharField", None, "security", "critical",
      "Plaintext password storage in database"),
+    # --- Async / concurrency bugs ---
+    (r"aiohttp\.ClientSession\(\)", None, "bug", "high",
+     "ClientSession created outside 'async with' — may not be closed (resource leak)"),
+    (r"timeout\s*=\s*\d+\b", None, "bug", "medium",
+     "aiohttp timeout should be aiohttp.ClientTimeout(total=N), not a bare integer"),
+    (r"attempt\s*==\s*retries\b", None, "bug", "high",
+     "Off-by-one: range(retries) yields 0..retries-1, so attempt==retries is never true"),
+    (r"for\s+\w+\s+in\s+\w+_ids\s*:", None, "performance", "high",
+     "Sequential loop over IDs — consider asyncio.gather() for concurrent fetching"),
+    # --- Performance ---
     (r"\.objects\.get\(id=item\.", None, "performance", "high",
      "N+1 query: database lookup inside a loop"),
+    # --- JavaScript-specific patterns ---
+    (r"new\s+Function\(", None, "security", "critical",
+     "Unsafe dynamic code execution: new Function() with user input is equivalent to eval()"),
+    (r"\beval\(", None, "security", "critical",
+     "eval() with user-supplied input allows arbitrary code execution"),
+    (r"execSync\(", None, "security", "critical",
+     "Command injection risk: execSync() with user-supplied data"),
+    (r"jwt\.sign\(.*\{(?!.*expiresIn)", None, "security", "medium",
+     "JWT issued without expiry (expiresIn) — tokens are valid forever"),
+    (r"JWT_SECRET\s*=\s*['\"]", None, "security", "high",
+     "Hardcoded JWT secret in source code"),
+    (r"res\.send\(`.*\$\{", None, "security", "high",
+     "XSS: template literal with user input sent directly in response"),
+    # --- Data model bugs ---
     (r"FloatField\(\)", None, "bug", "medium",
      "FloatField for monetary values causes precision errors, use DecimalField"),
     (r"BinaryField\(\)", None, "security", "high",

tasks/data.py CHANGED Viewed

@@ -418,10 +418,533 @@ TASK_COMPREHENSIVE: Dict[str, Any] = {
 }
 ALL_TASKS: Dict[str, Dict[str, Any]] = {
     TASK_BUG_DETECTION["task_id"]: TASK_BUG_DETECTION,
     TASK_SECURITY_AUDIT["task_id"]: TASK_SECURITY_AUDIT,
     TASK_COMPREHENSIVE["task_id"]: TASK_COMPREHENSIVE,
 }
 TASK_IDS: List[str] = list(ALL_TASKS.keys())

 }
+_ASYNC_CODE = """\
+import asyncio
+import aiohttp
+from typing import List, Optional
+_cache: dict = {}
+async def fetch_json(url: str, session: aiohttp.ClientSession) -> dict:
+    async with session.get(url, timeout=5) as resp:
+        return await resp.json()
+async def get_user(user_id: int, session: aiohttp.ClientSession) -> dict:
+    if user_id in _cache:
+        return _cache[user_id]
+    data = await fetch_json(f"https://api.example.com/users/{user_id}", session)
+    _cache[user_id] = data
+    return data
+async def process_users(user_ids: List[int]) -> List[dict]:
+    session = aiohttp.ClientSession()
+    results = []
+    for uid in user_ids:
+        result = await get_user(uid, session)
+        results.append(result)
+    return results
+async def run_with_retry(url: str, retries: int = 3) -> Optional[str]:
+    for attempt in range(retries):
+        try:
+            async with aiohttp.ClientSession() as session:
+                async with session.get(url) as resp:
+                    return await resp.text()
+        except Exception:
+            if attempt == retries:
+                raise
+    return None
+class TaskRunner:
+    def __init__(self, concurrency: int = 5):
+        self.concurrency = concurrency
+        self.results = []
+    async def run_all(self, tasks: List) -> List:
+        for task in tasks:
+            result = await task
+            self.results.append(result)
+        return self.results
+"""
+TASK_ASYNC_REVIEW: Dict[str, Any] = {
+    "task_id": "async-review",
+    "difficulty": "medium-hard",
+    "description": (
+        "Review this async Python module for concurrency bugs, resource leaks,\n"
+        "and performance issues with asyncio and aiohttp.\n"
+        "The code has subtle async-specific bugs that would cause failures or\n"
+        "degraded performance in production. Identify all issues with exact\n"
+        "line numbers, types, and severity.\n\n"
+        "File to review: async.py"
+    ),
+    "language": "python",
+    "code_files": {
+        "async.py": _ASYNC_CODE,
+    },
+    "ground_truth_issues": [
+        _issue(
+            5, "async.py", "bug", "high",
+            "Shared mutable dict without asyncio.Lock; concurrent coroutines can read "
+            "stale data or overwrite each other's writes. Use async with _lock: around "
+            "cache check and write.",
+            "Add _lock = asyncio.Lock() and use: async with _lock: around cache check and write."
+        ),
+        _issue(
+            9, "async.py", "bug", "medium",
+            "timeout=5 is wrong type for aiohttp; requires aiohttp.ClientTimeout(total=5). "
+            "Passing an int raises TypeError at runtime.",
+            "Use: timeout=aiohttp.ClientTimeout(total=5)"
+        ),
+        _issue(
+            22, "async.py", "bug", "high",
+            "ClientSession created but never closed, causing resource leak. "
+            "Use: async with aiohttp.ClientSession() as session: and pass it in.",
+            "Replace with: async with aiohttp.ClientSession() as session:"
+        ),
+        _issue(
+            24, "async.py", "performance", "high",
+            "Sequential for loop with await serializes all requests. "
+            "Use asyncio.gather(*[get_user(uid, session) for uid in user_ids]) "
+            "for true concurrency.",
+            "Replace loop with: results = await asyncio.gather(*[get_user(uid, session) for uid in user_ids])"
+        ),
+        _issue(
+            37, "async.py", "bug", "high",
+            "Off-by-one: range(retries) yields 0..retries-1, so attempt==retries is never true. "
+            "Exception is never re-raised. Fix: attempt == retries - 1.",
+            "Change: if attempt == retries - 1: raise"
+        ),
+        _issue(
+            48, "async.py", "performance", "medium",
+            "Tasks awaited sequentially instead of concurrently. "
+            "Use asyncio.gather(*tasks). Also self.results accumulates across multiple run_all calls.",
+            "Replace loop with: self.results.extend(await asyncio.gather(*tasks))"
+        ),
+    ],
+    "max_steps": 20,
+    "hints": [
+        "Check all places where ClientSession is created — are they properly closed?",
+        "Look for sequential awaits inside loops where gather() would be more appropriate.",
+        "The retry function has an off-by-one error in its condition.",
+    ],
+}
+_PIPELINE_CODE = """\
+import csv
+import json
+import hashlib
+import sqlite3
+from typing import List, Dict, Optional
+def init_db(path: str) -> sqlite3.Connection:
+    conn = sqlite3.connect(path)
+    conn.execute(
+        "CREATE TABLE IF NOT EXISTS records "
+        "(id INTEGER PRIMARY KEY AUTOINCREMENT, username TEXT NOT NULL, "
+        "email TEXT NOT NULL, password_hash TEXT, score REAL DEFAULT 0)"
+    )
+    conn.commit()
+    return conn
+def hash_password(password: str) -> str:
+    return hashlib.md5(password.encode()).hexdigest()
+def insert_record(conn: sqlite3.Connection, username: str,
+                  email: str, password: str, score: float) -> None:
+    pwd = hash_password(password)
+    conn.execute(
+        f"INSERT INTO records (username, email, password_hash, score) "
+        f"VALUES ('{username}', '{email}', '{pwd}', {score})"
+    )
+    conn.commit()
+def search_records(conn: sqlite3.Connection, query: str) -> List[Dict]:
+    cursor = conn.execute(
+        f"SELECT id, username, email, score FROM records WHERE username LIKE '%{query}%'"
+    )
+    cols = [d[0] for d in cursor.description]
+    return [dict(zip(cols, row)) for row in cursor.fetchall()]
+def bulk_load(conn: sqlite3.Connection, filepath: str) -> int:
+    count = 0
+    with open(filepath, newline='') as f:
+        for row in csv.DictReader(f):
+            insert_record(conn, row['username'], row['email'],
+                          row.get('password', ''), float(row.get('score', 0)))
+            count += 1
+    return count
+def export_records(conn: sqlite3.Connection, out_path: str) -> None:
+    rows = search_records(conn, '')
+    with open(out_path, 'w') as f:
+        json.dump(rows, f, indent=2)
+def get_top_scores(conn: sqlite3.Connection, limit: int) -> List[Dict]:
+    cursor = conn.execute(
+        f"SELECT username, score FROM records ORDER BY score DESC LIMIT {limit}"
+    )
+    return [{'username': r[0], 'score': r[1]} for r in cursor.fetchall()]
+"""
+TASK_DATA_PIPELINE: Dict[str, Any] = {
+    "task_id": "data-pipeline",
+    "difficulty": "hard",
+    "description": (
+        "Perform a security and correctness review of this data pipeline module.\n"
+        "The module handles user records in SQLite. It contains multiple critical\n"
+        "security vulnerabilities, a performance issue, and an error handling gap.\n"
+        "Find ALL issues across the file.\n\n"
+        "File to review: pipeline.py"
+    ),
+    "language": "python",
+    "code_files": {
+        "pipeline.py": _PIPELINE_CODE,
+    },
+    "ground_truth_issues": [
+        _issue(
+            20, "pipeline.py", "security", "high",
+            "MD5 is cryptographically broken for password hashing. "
+            "Use bcrypt, argon2, or hashlib.pbkdf2_hmac instead.",
+            "Use: hashlib.pbkdf2_hmac('sha256', password.encode(), salt, 100000)"
+        ),
+        _issue(
+            27, "pipeline.py", "security", "critical",
+            "SQL injection: username, email, and pwd interpolated directly into query string. "
+            "Use parameterized queries: conn.execute('INSERT INTO records ... VALUES (?,?,?,?)', "
+            "(username, email, pwd, score))",
+            "Use: conn.execute('INSERT INTO records (username, email, password_hash, score) VALUES (?,?,?,?)', (username, email, pwd, score))"
+        ),
+        _issue(
+            35, "pipeline.py", "security", "critical",
+            "SQL injection in LIKE clause: user-supplied query interpolated directly. "
+            "Use: conn.execute('... WHERE username LIKE ?', (f'%{query}%',))",
+            "Use: conn.execute('SELECT ... WHERE username LIKE ?', (f'%{query}%',))"
+        ),
+        _issue(
+            41, "pipeline.py", "performance", "high",
+            "bulk_load commits one transaction per row via insert_record. "
+            "Wrap entire loop in with conn: for a single transaction — 10-100x faster for large imports.",
+            "Wrap loop body with: with conn: conn.executemany(...)"
+        ),
+        _issue(
+            46, "pipeline.py", "bug", "medium",
+            "float() conversion has no error handling. A single malformed score field "
+            "crashes the entire import. Wrap in try/except ValueError.",
+            "Use: float(row.get('score', 0) or 0) inside try/except ValueError"
+        ),
+        _issue(
+            52, "pipeline.py", "security", "high",
+            "export_records calls search_records(conn, '') which returns all records including "
+            "password_hash field. Strip sensitive fields before export.",
+            "Filter out password_hash: rows = [{k: v for k, v in r.items() if k != 'password_hash'} for r in rows]"
+        ),
+        _issue(
+            59, "pipeline.py", "security", "critical",
+            "SQL injection: limit value interpolated into query. Although limit is an int here, "
+            "use parameterized query: conn.execute('... LIMIT ?', (limit,))",
+            "Use: conn.execute('SELECT username, score FROM records ORDER BY score DESC LIMIT ?', (limit,))"
+        ),
+    ],
+    "max_steps": 25,
+    "hints": [
+        "Look for every place user-supplied values touch a SQL query string — are they parameterized?",
+        "The bulk_load function has both a performance issue and an error handling gap.",
+        "Check what fields export_records includes in its output — are any sensitive?",
+    ],
+}
+_API_SECURITY_CODE = """\
+from fastapi import FastAPI, Depends, HTTPException, Header
+from fastapi.security import HTTPBasic, HTTPBasicCredentials
+import jwt
+import hashlib
+import pickle
+import os
+import sqlite3
+app = FastAPI()
+security = HTTPBasic()
+SECRET_KEY = "dev-secret-do-not-use-in-prod"
+ADMIN_TOKEN = "admin-hardcoded-token-123"
+users_db = {
+    "admin": hashlib.md5(b"password123").hexdigest(),
+    "user": hashlib.md5(b"user123").hexdigest(),
+}
+@app.post("/login")
+def login(credentials: HTTPBasicCredentials = Depends(security)):
+    username = credentials.username
+    stored = users_db.get(username, "")
+    if stored != hashlib.md5(credentials.password.encode()).hexdigest():
+        raise HTTPException(status_code=401, detail="Invalid credentials")
+    token = jwt.encode({"user": username, "admin": username == "admin"},
+                       SECRET_KEY, algorithm="HS256")
+    return {"token": token}
+@app.get("/users/{user_id}")
+def get_user(user_id: str, authorization: str = Header(None)):
+    if not authorization:
+        raise HTTPException(status_code=401, detail="Missing token")
+    payload = jwt.decode(authorization, SECRET_KEY, algorithms=["HS256"])
+    conn = sqlite3.connect("app.db")
+    cursor = conn.execute(f"SELECT * FROM users WHERE id = '{user_id}'")
+    return {"user": cursor.fetchone()}
+@app.post("/admin/export")
+def admin_export(authorization: str = Header(None)):
+    if authorization != ADMIN_TOKEN:
+        raise HTTPException(status_code=403, detail="Forbidden")
+    path = os.environ.get("EXPORT_PATH", "/tmp/export")
+    os.system(f"mysqldump mydb > {path}/dump.sql")
+    return {"status": "export complete", "path": path}
+@app.post("/import")
+def import_data(payload: bytes):
+    data = pickle.loads(payload)
+    return {"records": len(data)}
+@app.get("/search")
+def search_users(q: str, limit: int = 100):
+    conn = sqlite3.connect("app.db")
+    rows = conn.execute(
+        f"SELECT id, name, email FROM users WHERE name LIKE '%{q}%' LIMIT {limit}"
+    ).fetchall()
+    return {"results": rows}
+"""
+TASK_API_SECURITY: Dict[str, Any] = {
+    "task_id": "api-security",
+    "difficulty": "hard",
+    "description": (
+        "Perform a security audit on this FastAPI REST API.\n"
+        "The service handles user authentication and data operations.\n"
+        "It contains multiple critical security flaws across authentication,\n"
+        "authorization, injection attacks, and cryptography.\n"
+        "Find ALL issues with exact line numbers and severity ratings.\n\n"
+        "File to review: api.py"
+    ),
+    "language": "python",
+    "code_files": {
+        "api.py": _API_SECURITY_CODE,
+    },
+    "ground_truth_issues": [
+        _issue(
+            12, "api.py", "security", "high",
+            "Hardcoded SECRET_KEY in source code. Any developer with repo access can forge "
+            "JWT tokens and impersonate any user.",
+            "Use: SECRET_KEY = os.environ.get('SECRET_KEY') and rotate it as a secret."
+        ),
+        _issue(
+            13, "api.py", "security", "high",
+            "Hardcoded ADMIN_TOKEN in source code. Static tokens in code are trivially "
+            "leaked via version control, logs, or error messages.",
+            "Use: ADMIN_TOKEN = os.environ.get('ADMIN_TOKEN') and generate it securely."
+        ),
+        _issue(
+            16, "api.py", "security", "high",
+            "MD5 used for password hashing. MD5 is cryptographically broken; precomputed "
+            "rainbow tables can reverse any MD5 hash in seconds.",
+            "Use bcrypt, argon2, or hashlib.pbkdf2_hmac with a random salt."
+        ),
+        _issue(
+            27, "api.py", "security", "medium",
+            "JWT token issued without an expiry claim ('exp'). Tokens are valid forever; "
+            "a stolen token can never be invalidated without rotating the secret.",
+            "Add: {'exp': datetime.utcnow() + timedelta(hours=1)} to the JWT payload."
+        ),
+        _issue(
+            33, "api.py", "security", "critical",
+            "Missing authorization check: any authenticated user can fetch any user_id. "
+            "This is an Insecure Direct Object Reference (IDOR) — user A can read user B's data.",
+            "Check: if payload.get('user') != user_id and not payload.get('admin'): raise 403."
+        ),
+        _issue(
+            38, "api.py", "security", "critical",
+            "SQL injection: user_id is interpolated directly into the query string. "
+            "An attacker can supply user_id = \"' OR '1'='1\" to dump the users table.",
+            "Use parameterized query: conn.execute('SELECT * FROM users WHERE id = ?', (user_id,))"
+        ),
+        _issue(
+            47, "api.py", "security", "critical",
+            "Command injection: EXPORT_PATH from environment is interpolated into an "
+            "os.system() shell command. A misconfigured env var like '/tmp; rm -rf /' "
+            "executes arbitrary commands as the server process.",
+            "Use subprocess.run(['mysqldump', 'mydb'], stdout=open(path, 'w'), shell=False)."
+        ),
+        _issue(
+            53, "api.py", "security", "critical",
+            "Unsafe deserialization: pickle.loads() on untrusted user-supplied bytes allows "
+            "remote code execution. Any client can craft a pickle payload that runs arbitrary code.",
+            "Use json.loads() or a schema-validated format. Never unpickle untrusted data."
+        ),
+    ],
+    "max_steps": 25,
+    "hints": [
+        "Check every hardcoded string assigned to variables like SECRET_KEY, TOKEN, PASSWORD.",
+        "Look at every endpoint: which ones verify the caller's identity vs just authentication?",
+        "Find all places user-supplied data touches: SQL queries, shell commands, deserialization.",
+    ],
+}
+_JS_CODE = """\
+const express = require('express');
+const jwt = require('jsonwebtoken');
+const { execSync } = require('child_process');
+const path = require('path');
+const fs = require('fs');
+const sqlite3 = require('better-sqlite3');
+const app = express();
+app.use(express.json());
+const JWT_SECRET = 'super-secret-key-hardcoded';
+const db = new sqlite3('./data.db');
+app.post('/login', (req, res) => {
+    const { username, password } = req.body;
+    const user = db.prepare(`SELECT * FROM users WHERE username = '${username}' AND password = '${password}'`).get();
+    if (!user) return res.status(401).json({ error: 'Invalid credentials' });
+    const token = jwt.sign({ id: user.id, role: user.role }, JWT_SECRET);
+    res.json({ token });
+});
+app.get('/user/:id', (req, res) => {
+    const token = req.headers.authorization;
+    const payload = jwt.verify(token, JWT_SECRET);
+    const user = db.prepare(`SELECT * FROM users WHERE id = ${req.params.id}`).get();
+    res.json(user);
+});
+app.get('/search', (req, res) => {
+    const q = req.query.q;
+    res.send(`<h1>Results for: ${q}</h1>`);
+});
+app.post('/run-report', (req, res) => {
+    const { filename } = req.body;
+    const output = execSync(`node reports/${filename}`);
+    res.send(output.toString());
+});
+app.get('/files', (req, res) => {
+    const name = req.query.name;
+    const filePath = path.join(__dirname, 'uploads', name);
+    res.send(fs.readFileSync(filePath, 'utf8'));
+});
+app.post('/template', (req, res) => {
+    const { template, data } = req.body;
+    const fn = new Function('data', `return \\`${template}\\``);
+    res.json({ result: fn(data) });
+});
+app.listen(3000);
+"""
+TASK_JS_SECURITY: Dict[str, Any] = {
+    "task_id": "js-security",
+    "difficulty": "hard",
+    "description": (
+        "Perform a security audit on this Express.js REST API.\n"
+        "The service handles authentication and user data operations in Node.js.\n"
+        "It contains critical security vulnerabilities common in JavaScript backends.\n"
+        "Identify ALL issues with exact line numbers, types, and severity.\n\n"
+        "File to review: server.js"
+    ),
+    "language": "javascript",
+    "code_files": {
+        "server.js": _JS_CODE,
+    },
+    "ground_truth_issues": [
+        _issue(
+            11, "server.js", "security", "high",
+            "Hardcoded JWT secret 'super-secret-key-hardcoded' in source. "
+            "Anyone with code access can forge tokens for any user.",
+            "Use: const JWT_SECRET = process.env.JWT_SECRET and rotate it as an env secret."
+        ),
+        _issue(
+            16, "server.js", "security", "critical",
+            "SQL injection: username and password are interpolated directly into a template "
+            "literal inside prepare(). An attacker can bypass authentication with username = ' OR '1'='1'--.",
+            "Use parameterized queries: db.prepare('SELECT * FROM users WHERE username = ? AND password = ?').get(username, password)"
+        ),
+        _issue(
+            18, "server.js", "security", "medium",
+            "JWT issued without expiry ('expiresIn' option missing). Tokens are valid forever; "
+            "a stolen token can never be invalidated without rotating the secret.",
+            "Add: jwt.sign({ id: user.id, role: user.role }, JWT_SECRET, { expiresIn: '1h' })"
+        ),
+        _issue(
+            25, "server.js", "security", "critical",
+            "Missing authorization + SQL injection: any authenticated user can fetch any "
+            "user by changing req.params.id (IDOR). Also id is interpolated directly into SQL.",
+            "Check payload.id === req.params.id (or admin role). Use parameterized: db.prepare('SELECT * FROM users WHERE id = ?').get(req.params.id)"
+        ),
+        _issue(
+            31, "server.js", "security", "high",
+            "Cross-site scripting (XSS): user-supplied query parameter q is reflected "
+            "directly into HTML response without escaping.",
+            "Use a templating engine with auto-escaping, or: res.send(`<h1>Results for: ${escapeHtml(q)}</h1>`)"
+        ),
+        _issue(
+            36, "server.js", "security", "critical",
+            "Command injection: user-supplied filename is passed directly to execSync() "
+            "in a shell command. An attacker can supply 'x; rm -rf /' as filename.",
+            "Validate filename against a strict allowlist. Use execFileSync(['node', 'reports/' + sanitizedName]) with shell:false."
+        ),
+        _issue(
+            42, "server.js", "security", "high",
+            "Path traversal: user-supplied 'name' is joined to uploads directory with path.join. "
+            "An attacker can supply '../../../etc/passwd' to read arbitrary files.",
+            "Use: path.resolve(__dirname, 'uploads', path.basename(name)) and validate the result starts with the uploads dir."
+        ),
+        _issue(
+            48, "server.js", "security", "critical",
+            "Unsafe dynamic code execution: new Function() with user-supplied template string "
+            "is equivalent to eval(). Any client can execute arbitrary JavaScript on the server.",
+            "Never use new Function() or eval() with user input. Use a safe template engine like Handlebars or Mustache."
+        ),
+    ],
+    "max_steps": 25,
+    "hints": [
+        "Check every place user input (req.body, req.params, req.query) touches a database query, shell command, or HTML response.",
+        "Look for hardcoded secrets at the top of the file.",
+        "The /template and /run-report endpoints have particularly dangerous patterns.",
+    ],
+}
 ALL_TASKS: Dict[str, Dict[str, Any]] = {
     TASK_BUG_DETECTION["task_id"]: TASK_BUG_DETECTION,
     TASK_SECURITY_AUDIT["task_id"]: TASK_SECURITY_AUDIT,
     TASK_COMPREHENSIVE["task_id"]: TASK_COMPREHENSIVE,
+    TASK_ASYNC_REVIEW["task_id"]: TASK_ASYNC_REVIEW,
+    TASK_DATA_PIPELINE["task_id"]: TASK_DATA_PIPELINE,
+    TASK_API_SECURITY["task_id"]: TASK_API_SECURITY,
+    TASK_JS_SECURITY["task_id"]: TASK_JS_SECURITY,
 }
 TASK_IDS: List[str] = list(ALL_TASKS.keys())

tests/test_environment.py CHANGED Viewed

@@ -41,6 +41,18 @@ def env_hard(env):
     return env
 # ---------------------------------------------------------------------------
 # reset() tests
 # ---------------------------------------------------------------------------
@@ -106,6 +118,40 @@ class TestReset:
         assert obs.flagged_issues == []
         assert obs.step_count == 0
 # ---------------------------------------------------------------------------
 # step() — flag_issue tests
@@ -167,6 +213,148 @@ class TestFlagIssue:
         obs = env_bug.state
         assert len(obs.flagged_issues) == 3
 # ---------------------------------------------------------------------------
 # step() — clear_flag tests
@@ -312,3 +500,341 @@ class TestMaxSteps:
                 break
         assert obs.done is True

     return env
+@pytest.fixture
+def env_async(env):
+    env.reset(task_id="async-review")
+    return env
+@pytest.fixture
+def env_pipeline(env):
+    env.reset(task_id="data-pipeline")
+    return env
 # ---------------------------------------------------------------------------
 # reset() tests
 # ---------------------------------------------------------------------------
         assert obs.flagged_issues == []
         assert obs.step_count == 0
+    def test_reset_has_code_metadata(self, env):
+        """Reset observation should include code_metadata."""
+        obs = env.reset(task_id="bug-detection")
+        assert isinstance(obs.code_metadata, dict)
+        assert "total_lines" in obs.code_metadata
+        assert "num_functions" in obs.code_metadata
+        assert "complexity_estimate" in obs.code_metadata
+    def test_reset_code_metadata_has_issue_categories(self, env):
+        """code_metadata should list the issue categories present in ground truth."""
+        obs = env.reset(task_id="bug-detection")
+        assert "issue_categories" in obs.code_metadata
+        # bug-detection has only bug type issues
+        assert "bug" in obs.code_metadata["issue_categories"]
+    def test_reset_has_empty_progress(self, env):
+        """Reset observation progress may be empty or absent (populated on step)."""
+        obs = env.reset(task_id="bug-detection")
+        assert isinstance(obs.progress, dict)
+    def test_reset_has_empty_reward_breakdown(self, env):
+        obs = env.reset(task_id="bug-detection")
+        assert isinstance(obs.reward_breakdown, dict)
+    def test_reset_async_task(self, env):
+        obs = env.reset(task_id="async-review")
+        assert obs.task_id == "async-review"
+        assert "async.py" in obs.code_files
+    def test_reset_pipeline_task(self, env):
+        obs = env.reset(task_id="data-pipeline")
+        assert obs.task_id == "data-pipeline"
+        assert "pipeline.py" in obs.code_files
 # ---------------------------------------------------------------------------
 # step() — flag_issue tests
         obs = env_bug.state
         assert len(obs.flagged_issues) == 3
+    def test_flag_has_reward_breakdown(self, env_bug):
+        """Every step should have a reward_breakdown dict."""
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        assert isinstance(obs.reward_breakdown, dict)
+        assert len(obs.reward_breakdown) > 0
+    def test_flag_has_progress(self, env_bug):
+        """Every step should have a progress dict with required keys."""
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        assert isinstance(obs.progress, dict)
+        for key in ("precision", "recall", "f1", "true_positives", "steps_remaining"):
+            assert key in obs.progress, f"Missing key: {key}"
+    def test_flag_has_flagged_summary(self, env_bug):
+        """Every step should have a flagged_summary dict."""
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        assert isinstance(obs.flagged_summary, dict)
+        assert "total_flagged" in obs.flagged_summary
+        assert "correct" in obs.flagged_summary
+        assert "incorrect" in obs.flagged_summary
+        assert "near_misses" in obs.flagged_summary
+# ---------------------------------------------------------------------------
+# Near-miss tests
+# ---------------------------------------------------------------------------
+class TestNearMiss:
+    def test_near_miss_gives_partial_credit(self, env_bug):
+        """A flag within 3-5 lines of a GT issue should give +0.03 not -0.05."""
+        # GT issue is at line 6 (off-by-one), so line 10 is 4 away = near miss
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=10, filename="utils.py",
+            issue_type="bug", severity="high", description="near miss test"
+        ))
+        # Near miss gives +0.03
+        assert obs.reward is not None and obs.reward > 0, (
+            f"Expected near-miss +0.03 but got {obs.reward}"
+        )
+        assert obs.reward == pytest.approx(0.03, abs=0.01)
+    def test_near_miss_counted_in_summary(self, env_bug):
+        """Near-miss flags should appear in flagged_summary.near_misses."""
+        # Line 10 is 4 lines from GT at line 6 → near miss
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=10, filename="utils.py",
+            issue_type="bug", severity="high", description="near miss"
+        ))
+        assert obs.flagged_summary.get("near_misses", 0) >= 1
+    def test_true_positive_not_counted_as_near_miss(self, env_bug):
+        """An exact TP should not be counted as a near miss."""
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="exact match"
+        ))
+        assert obs.flagged_summary.get("correct", 0) >= 1
+        assert obs.flagged_summary.get("near_misses", 0) == 0
+# ---------------------------------------------------------------------------
+# Confidence field tests
+# ---------------------------------------------------------------------------
+class TestConfidenceField:
+    def test_action_with_confidence(self, env_bug):
+        """ReviewAction should accept a confidence field."""
+        action = ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test",
+            confidence=0.9
+        )
+        assert action.confidence == 0.9
+    def test_high_confidence_tp_gets_bonus(self, env_bug):
+        """High confidence + TP should give more than base 0.10."""
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test",
+            confidence=0.9
+        ))
+        assert obs.reward is not None and obs.reward > 0.10
+    def test_high_confidence_fp_gets_extra_penalty(self, env_bug):
+        """High confidence + FP should give more penalty than -0.05."""
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=100, filename="utils.py",
+            issue_type="bug", severity="low", description="wrong",
+            confidence=0.9
+        ))
+        assert obs.reward is not None and obs.reward < -0.05
+    def test_low_confidence_tp_base_reward_only(self, env_bug):
+        """Low confidence + TP should give exactly base 0.10 (no bonus)."""
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test",
+            confidence=0.5
+        ))
+        assert obs.reward is not None
+        # Should be 0.10 base + possible temporal bonus but no confidence bonus
+        assert obs.reward >= 0.10
+    def test_no_confidence_field_is_none(self):
+        """ReviewAction without confidence defaults to None."""
+        action = ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+        )
+        assert action.confidence is None
+    def test_confidence_in_action_to_dict(self):
+        """confidence should round-trip through to_dict/from_dict."""
+        action = ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            confidence=0.75
+        )
+        d = action.to_dict()
+        assert d["confidence"] == 0.75
+        action2 = ReviewAction.from_dict(d)
+        assert action2.confidence == 0.75
+    def test_related_lines_field(self):
+        """ReviewAction should accept a related_lines field."""
+        action = ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            related_lines=[6, 7, 8]
+        )
+        assert action.related_lines == [6, 7, 8]
+        d = action.to_dict()
+        assert d["related_lines"] == [6, 7, 8]
+        action2 = ReviewAction.from_dict(d)
+        assert action2.related_lines == [6, 7, 8]
 # ---------------------------------------------------------------------------
 # step() — clear_flag tests
                 break
         assert obs.done is True
+# ---------------------------------------------------------------------------
+# New task tests
+# ---------------------------------------------------------------------------
+class TestNewTasks:
+    def test_async_review_task_exists(self, env):
+        obs = env.reset(task_id="async-review")
+        assert obs.task_id == "async-review"
+        assert obs.done is False
+    def test_async_review_has_correct_issue_count(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["async-review"]
+        assert len(task["ground_truth_issues"]) == 6
+    def test_async_review_has_async_py(self, env):
+        obs = env.reset(task_id="async-review")
+        assert "async.py" in obs.code_files
+        code = obs.code_files["async.py"]
+        assert "asyncio" in code
+        assert "aiohttp" in code
+    def test_async_review_max_steps(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["async-review"]
+        assert task["max_steps"] == 20
+    def test_data_pipeline_task_exists(self, env):
+        obs = env.reset(task_id="data-pipeline")
+        assert obs.task_id == "data-pipeline"
+        assert obs.done is False
+    def test_data_pipeline_has_correct_issue_count(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["data-pipeline"]
+        assert len(task["ground_truth_issues"]) == 7
+    def test_data_pipeline_has_pipeline_py(self, env):
+        obs = env.reset(task_id="data-pipeline")
+        assert "pipeline.py" in obs.code_files
+        code = obs.code_files["pipeline.py"]
+        assert "sqlite3" in code
+        assert "hashlib" in code
+    def test_data_pipeline_max_steps(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["data-pipeline"]
+        assert task["max_steps"] == 25
+    def test_task_count(self):
+        from tasks.data import TASK_IDS
+        assert len(TASK_IDS) >= 6
+    def test_async_review_correct_tp_reward(self, env_async):
+        """Flagging a known issue in async-review should give positive reward."""
+        obs = env_async.step(ReviewAction(
+            action_type="flag_issue", line_number=22, filename="async.py",
+            issue_type="bug", severity="high",
+            description="ClientSession not closed"
+        ))
+        assert obs.reward is not None and obs.reward > 0
+    def test_data_pipeline_correct_tp_reward(self, env_pipeline):
+        """Flagging a known SQL injection in pipeline.py should give positive reward."""
+        obs = env_pipeline.step(ReviewAction(
+            action_type="flag_issue", line_number=27, filename="pipeline.py",
+            issue_type="security", severity="critical",
+            description="SQL injection"
+        ))
+        assert obs.reward is not None and obs.reward > 0
+    def test_all_tasks_have_hints(self):
+        from tasks.data import ALL_TASKS
+        for task_id, task in ALL_TASKS.items():
+            assert "hints" in task, f"Task {task_id} missing hints"
+            assert len(task["hints"]) >= 3, f"Task {task_id} has fewer than 3 hints"
+# ---------------------------------------------------------------------------
+# Observation serialization
+# ---------------------------------------------------------------------------
+class TestObservationSerialization:
+    def test_reset_obs_to_dict_has_new_fields(self, env):
+        """to_dict() should include all new fields."""
+        obs = env.reset(task_id="bug-detection")
+        d = obs.to_dict()
+        assert "reward_breakdown" in d
+        assert "progress" in d
+        assert "flagged_summary" in d
+        assert "code_metadata" in d
+    def test_obs_from_dict_handles_missing_new_fields(self):
+        """from_dict() should handle missing new fields gracefully."""
+        d = {
+            "task_id": "bug-detection",
+            "task_description": "test",
+            "code_files": {},
+            "language": "python",
+            "flagged_issues": [],
+            "step_count": 0,
+            "max_steps": 15,
+            "hints_remaining": 3,
+            "feedback": "",
+            "current_score": 0.0,
+            "done": False,
+            "reward": None,
+            # No reward_breakdown, progress, flagged_summary, code_metadata
+        }
+        obs = ReviewObservation.from_dict(d)
+        assert obs.reward_breakdown == {}
+        assert obs.progress == {}
+        assert obs.flagged_summary == {}
+        assert obs.code_metadata == {}
+    def test_step_obs_to_dict_round_trip(self, env_bug):
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="test"
+        ))
+        d = obs.to_dict()
+        obs2 = ReviewObservation.from_dict(d)
+        assert obs2.task_id == obs.task_id
+        assert obs2.step_count == obs.step_count
+        assert isinstance(obs2.reward_breakdown, dict)
+        assert isinstance(obs2.progress, dict)
+        assert isinstance(obs2.flagged_summary, dict)
+# ---------------------------------------------------------------------------
+# Severity exact match bonus
+# ---------------------------------------------------------------------------
+class TestSeverityBonus:
+    def test_severity_match_gives_extra_reward(self, env_bug):
+        """Exact severity match should give more than a severity mismatch."""
+        # GT at line 6 is "high"
+        obs_match = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="exact severity"
+        ))
+        env_bug.reset(task_id="bug-detection")
+        obs_wrong = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="low", description="wrong severity"
+        ))
+        assert obs_match.reward > obs_wrong.reward
+    def test_severity_bonus_in_reward_breakdown(self, env_bug):
+        """reward_breakdown should include 'severity_exact' key on correct severity."""
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="correct severity"
+        ))
+        assert "severity_exact" in obs.reward_breakdown
+    def test_severity_mismatch_no_severity_bonus(self, env_bug):
+        """Wrong severity should not include 'severity_exact' key."""
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="low", description="wrong severity"
+        ))
+        assert "severity_exact" not in obs.reward_breakdown
+# ---------------------------------------------------------------------------
+# Flood protection (escalating FP penalty)
+# ---------------------------------------------------------------------------
+class TestFloodProtection:
+    def test_many_fps_escalate_penalty(self, env_bug):
+        """After 3 false positives, each subsequent FP should have larger penalty."""
+        rewards = []
+        for line in [101, 102, 103, 104, 105]:
+            obs = env_bug.step(ReviewAction(
+                action_type="flag_issue", line_number=line, filename="utils.py",
+                issue_type="bug", severity="low", description="fp"
+            ))
+            if obs.reward is not None and obs.reward < 0:
+                rewards.append(obs.reward)
+        # The 4th and 5th FPs should have larger absolute penalty
+        if len(rewards) >= 4:
+            assert abs(rewards[-1]) >= abs(rewards[0]), (
+                f"Expected escalating penalty but got {rewards}"
+            )
+    def test_fp_below_threshold_normal_penalty(self, env_bug):
+        """First FP should get standard -0.05 penalty."""
+        obs = env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=200, filename="utils.py",
+            issue_type="bug", severity="low", description="first fp"
+        ))
+        assert obs.reward is not None
+        assert obs.reward == pytest.approx(-0.05, abs=0.01)
+    def test_clearing_fp_reduces_penalty_track(self, env_bug):
+        """Clearing a FP should give positive reward."""
+        env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=200, filename="utils.py",
+            issue_type="bug", severity="low", description="fp"
+        ))
+        obs = env_bug.step(ReviewAction(
+            action_type="clear_flag", line_number=200, filename="utils.py",
+        ))
+        assert obs.reward is not None and obs.reward > 0
+# ---------------------------------------------------------------------------
+# Unfound issue types in progress
+# ---------------------------------------------------------------------------
+class TestUnfoundIssueTypes:
+    def test_unfound_types_present_at_start(self, env_bug):
+        """Before flagging anything, all GT issue types should be in unfound_issue_types."""
+        obs = env_bug.step(ReviewAction(action_type="request_hint"))
+        unfound = obs.progress.get("unfound_issue_types", [])
+        assert "bug" in unfound
+    def test_unfound_types_shrinks_when_issue_found(self, env_bug):
+        """Finding a bug should remove 'bug' from unfound_issue_types."""
+        obs_before = env_bug.step(ReviewAction(action_type="request_hint"))
+        unfound_before = set(obs_before.progress.get("unfound_issue_types", []))
+        env_bug.step(ReviewAction(
+            action_type="flag_issue", line_number=6, filename="utils.py",
+            issue_type="bug", severity="high", description="found a bug"
+        ))
+        obs_after = env_bug.step(ReviewAction(action_type="request_hint"))
+        unfound_after = set(obs_after.progress.get("unfound_issue_types", []))
+        # bug should now be gone from unfound
+        assert "bug" not in unfound_after or len(unfound_after) < len(unfound_before)
+    def test_unfound_types_is_list(self, env_bug):
+        obs = env_bug.step(ReviewAction(action_type="request_hint"))
+        assert isinstance(obs.progress.get("unfound_issue_types", []), list)
+# ---------------------------------------------------------------------------
+# API security task
+# ---------------------------------------------------------------------------
+class TestApiSecurityTask:
+    def test_api_security_task_exists(self, env):
+        obs = env.reset(task_id="api-security")
+        assert obs.task_id == "api-security"
+        assert obs.done is False
+    def test_api_security_has_api_py(self, env):
+        obs = env.reset(task_id="api-security")
+        assert "api.py" in obs.code_files
+    def test_api_security_has_8_issues(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["api-security"]
+        assert len(task["ground_truth_issues"]) == 8
+    def test_api_security_has_critical_issues(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["api-security"]
+        severities = {i["severity"] for i in task["ground_truth_issues"]}
+        assert "critical" in severities
+    def test_api_security_tp_reward(self, env):
+        env.reset(task_id="api-security")
+        obs = env.step(ReviewAction(
+            action_type="flag_issue", line_number=38, filename="api.py",
+            issue_type="security", severity="critical",
+            description="SQL injection via f-string"
+        ))
+        assert obs.reward is not None and obs.reward > 0
+    def test_api_security_keyword_baseline_finds_issues(self):
+        from tasks.data import ALL_TASKS
+        from server.graders import run_keyword_baseline
+        task = ALL_TASKS["api-security"]
+        findings = run_keyword_baseline(task)
+        assert len(findings) >= 2
+    def test_api_security_difficulty_hard(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["api-security"]
+        assert task["difficulty"] == "hard"
+# ---------------------------------------------------------------------------
+# Auto-end gives full score (not 0.5x)
+# ---------------------------------------------------------------------------
+class TestAutoEndFullScore:
+    def test_auto_end_uses_full_grade(self, env_bug):
+        """Auto-end should give full grade_episode score, not a penalized value."""
+        # Flag all 3 correct bugs first
+        for line, sev in [(6, "high"), (13, "medium"), (33, "low")]:
+            env_bug.step(ReviewAction(
+                action_type="flag_issue", line_number=line, filename="utils.py",
+                issue_type="bug", severity=sev, description=f"bug at {line}"
+            ))
+        # Exhaust remaining steps with hints
+        max_steps = 15
+        for _ in range(max_steps - 3 - 1):
+            obs = env_bug.step(ReviewAction(action_type="request_hint"))
+            if obs.done:
+                break
+        obs = env_bug.step(ReviewAction(action_type="request_hint"))
+        if obs.done and obs.reward_breakdown.get("auto_end_grade") is not None:
+            # If auto-ended, score should be >= 0.7 since all 3 bugs found
+            assert obs.reward >= 0.7, f"Auto-end gave {obs.reward} instead of full grade"
+# ---------------------------------------------------------------------------
+# Function ranges in code_metadata
+# ---------------------------------------------------------------------------
+class TestFunctionRanges:
+    def test_reset_has_function_ranges(self, env):
+        obs = env.reset(task_id="bug-detection")
+        assert "function_ranges" in obs.code_metadata
+    def test_function_ranges_is_list(self, env):
+        obs = env.reset(task_id="bug-detection")
+        assert isinstance(obs.code_metadata["function_ranges"], list)
+    def test_function_ranges_have_required_fields(self, env):
+        obs = env.reset(task_id="bug-detection")
+        for fr in obs.code_metadata["function_ranges"]:
+            assert "name" in fr
+            assert "file" in fr
+            assert "start" in fr
+            assert "end" in fr
+    def test_function_ranges_nonempty_for_python(self, env):
+        obs = env.reset(task_id="bug-detection")
+        assert len(obs.code_metadata["function_ranges"]) > 0

tests/test_graders.py CHANGED Viewed

@@ -7,7 +7,11 @@ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 import pytest
 from models import Issue
-from server.graders import grade_episode, match_issue, run_keyword_baseline
 from tasks.data import ALL_TASKS, TASK_IDS
@@ -56,6 +60,231 @@ class TestMatchIssue:
         gt = _issue(6, "utils.py", "bug", "high")
         assert match_issue(f, gt) is False
 # ---------------------------------------------------------------------------
 # grade_episode()
@@ -177,6 +406,23 @@ class TestKeywordBaseline:
             if task_id == "security-audit":
                 assert score > 0.0, f"Heuristic found nothing in {task_id}"
 # ---------------------------------------------------------------------------
 # Ground truth sanity checks
@@ -213,3 +459,159 @@ class TestGroundTruth:
         files = {i["filename"] for i in task["ground_truth_issues"]}
         assert "views.py" in files
         assert "models.py" in files

 import pytest
 from models import Issue
+from server.graders import (
+    grade_episode, match_issue, run_keyword_baseline,
+    match_quality, compute_code_metadata, grade_episode_detailed,
+    NEAR_TOLERANCE,
+)
 from tasks.data import ALL_TASKS, TASK_IDS
         gt = _issue(6, "utils.py", "bug", "high")
         assert match_issue(f, gt) is False
+    def test_near_tolerance_param_accepted(self):
+        """match_issue should accept near_tolerance param without error."""
+        f = _issue(6, "utils.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        result = match_issue(f, gt, line_tolerance=2, near_tolerance=5)
+        assert result is True
+# ---------------------------------------------------------------------------
+# match_quality()
+# ---------------------------------------------------------------------------
+class TestMatchQuality:
+    def test_exact_match_within_2_lines(self):
+        f = _issue(7, "utils.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_quality(f, gt) == "exact"
+    def test_near_match_3_to_5_lines(self):
+        # 4 lines away from GT at 6 → near
+        f = _issue(10, "utils.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_quality(f, gt) == "near"
+    def test_near_match_exactly_3_lines(self):
+        f = _issue(9, "utils.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_quality(f, gt) == "near"
+    def test_near_match_exactly_5_lines(self):
+        f = _issue(11, "utils.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_quality(f, gt) == "near"
+    def test_no_match_beyond_5_lines(self):
+        f = _issue(12, "utils.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_quality(f, gt) == "none"
+    def test_no_match_wrong_file(self):
+        f = _issue(6, "other.py", "bug", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        assert match_quality(f, gt) == "none"
+    def test_near_ignores_type_difference(self):
+        """Near match checks same file + line range, ignores type."""
+        f = _issue(10, "utils.py", "performance", "high")
+        gt = _issue(6, "utils.py", "bug", "high")
+        # 4 lines away → near
+        assert match_quality(f, gt) == "near"
+    def test_near_tolerance_constant(self):
+        assert NEAR_TOLERANCE == 5
+# ---------------------------------------------------------------------------
+# compute_code_metadata()
+# ---------------------------------------------------------------------------
+class TestComputeCodeMetadata:
+    def test_returns_dict(self):
+        code = {"test.py": "def foo(): pass\n"}
+        result = compute_code_metadata(code)
+        assert isinstance(result, dict)
+    def test_total_lines(self):
+        code = {"test.py": "line1\nline2\nline3\n"}
+        result = compute_code_metadata(code)
+        assert result["total_lines"] == 3
+    def test_num_functions(self):
+        code = {"test.py": "def foo():\n    pass\n\ndef bar():\n    pass\n"}
+        result = compute_code_metadata(code)
+        assert result["num_functions"] == 2
+    def test_function_names(self):
+        code = {"test.py": "def foo():\n    pass\n\ndef bar():\n    pass\n"}
+        result = compute_code_metadata(code)
+        assert "foo" in result["function_names"]
+        assert "bar" in result["function_names"]
+    def test_num_classes(self):
+        code = {"test.py": "class Foo:\n    pass\n\nclass Bar:\n    pass\n"}
+        result = compute_code_metadata(code)
+        assert result["num_classes"] == 2
+    def test_class_names(self):
+        code = {"test.py": "class Foo:\n    pass\n"}
+        result = compute_code_metadata(code)
+        assert "Foo" in result["class_names"]
+    def test_imports(self):
+        code = {"test.py": "import os\nimport sys\nfrom typing import List\n"}
+        result = compute_code_metadata(code)
+        assert "os" in result["imports"]
+        assert "sys" in result["imports"]
+        assert "typing" in result["imports"]
+    def test_complexity_low(self):
+        code = {"test.py": "def foo():\n    return 1\n"}
+        result = compute_code_metadata(code)
+        assert result["complexity_estimate"] == "low"
+    def test_complexity_medium(self):
+        # 6-15 branches — each if is top-level so indent is fine
+        lines = ["def foo(x):"]
+        for i in range(8):
+            lines.append(f"    if x > {i}:")
+            lines.append("        pass")
+        code = {"test.py": "\n".join(lines) + "\n"}
+        result = compute_code_metadata(code)
+        assert result["complexity_estimate"] in ("medium", "high")
+    def test_complexity_high(self):
+        # 16+ branches
+        lines = ["def foo(x):"]
+        for i in range(20):
+            lines.append(f"    if x > {i}:")
+            lines.append("        pass")
+        code = {"test.py": "\n".join(lines) + "\n"}
+        result = compute_code_metadata(code)
+        assert result["complexity_estimate"] == "high"
+    def test_issue_categories_passed_through(self):
+        code = {"test.py": "x = 1\n"}
+        result = compute_code_metadata(code, issue_categories=["bug", "security", "bug"])
+        # Should deduplicate
+        cats = result["issue_categories"]
+        assert "bug" in cats
+        assert "security" in cats
+    def test_syntax_error_no_crash(self):
+        """Non-parseable code should not raise."""
+        code = {"bad.py": "this is not valid python !!!\n   def broken("}
+        result = compute_code_metadata(code)
+        assert "total_lines" in result
+        assert result["total_lines"] >= 1
+    def test_multi_file(self):
+        code = {
+            "a.py": "def foo():\n    pass\n",
+            "b.py": "def bar():\n    pass\n",
+        }
+        result = compute_code_metadata(code)
+        assert result["num_functions"] == 2
+        assert result["total_lines"] == 4
+    def test_utils_task_metadata(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["bug-detection"]
+        result = compute_code_metadata(task["code_files"])
+        assert result["total_lines"] > 0
+        assert result["num_functions"] >= 4  # utils.py has 4 functions
+# ---------------------------------------------------------------------------
+# grade_episode_detailed()
+# ---------------------------------------------------------------------------
+class TestGradeEpisodeDetailed:
+    def test_returns_dict(self):
+        gt = [_issue(6, "utils.py", "bug", "high")]
+        result = grade_episode_detailed(gt, gt)
+        assert isinstance(result, dict)
+    def test_required_keys(self):
+        gt = [_issue(6, "utils.py", "bug", "high")]
+        result = grade_episode_detailed(gt, gt)
+        for key in ("score", "f1", "precision", "recall", "severity_accuracy",
+                    "true_positives", "false_positives", "false_negatives",
+                    "near_misses", "per_file"):
+            assert key in result, f"Missing key: {key}"
+    def test_perfect_match(self):
+        gt = [_issue(6, "utils.py", "bug", "high")]
+        result = grade_episode_detailed(gt, gt)
+        assert result["true_positives"] == 1
+        assert result["false_positives"] == 0
+        assert result["false_negatives"] == 0
+    def test_false_positive_counted(self):
+        gt = [_issue(6, "utils.py", "bug", "high")]
+        flagged = [_issue(6, "utils.py", "bug", "high"),
+                   _issue(100, "utils.py", "bug", "low")]
+        result = grade_episode_detailed(flagged, gt)
+        assert result["false_positives"] >= 1
+    def test_near_miss_counted(self):
+        gt = [_issue(6, "utils.py", "bug", "high")]
+        # 4 lines away = near miss
+        flagged = [_issue(10, "utils.py", "bug", "high")]
+        result = grade_episode_detailed(flagged, gt)
+        assert result["near_misses"] >= 1
+    def test_per_file_breakdown(self):
+        gt = [
+            _issue(6, "utils.py", "bug", "high"),
+            _issue(10, "other.py", "security", "critical"),
+        ]
+        flagged = [_issue(6, "utils.py", "bug", "high")]
+        result = grade_episode_detailed(flagged, gt)
+        assert "utils.py" in result["per_file"]
+    def test_score_matches_grade_episode(self):
+        """Detailed score should match grade_episode for simple cases."""
+        gt = [
+            _issue(6, "utils.py", "bug", "high"),
+            _issue(13, "utils.py", "bug", "medium"),
+        ]
+        flagged = [_issue(6, "utils.py", "bug", "high")]
+        simple_score = grade_episode(flagged, gt)
+        detailed = grade_episode_detailed(flagged, gt)
+        # Scores may differ slightly (near_miss handling), but should be close
+        assert abs(detailed["score"] - simple_score) <= 0.15
+    def test_empty_ground_truth_perfect(self):
+        result = grade_episode_detailed([], [])
+        assert result["score"] == 1.0
+    def test_empty_flagged_zero(self):
+        gt = [_issue(6, "utils.py")]
+        result = grade_episode_detailed([], gt)
+        assert result["score"] == 0.0
+        assert result["false_negatives"] == 1
 # ---------------------------------------------------------------------------
 # grade_episode()
             if task_id == "security-audit":
                 assert score > 0.0, f"Heuristic found nothing in {task_id}"
+    def test_baseline_finds_md5_in_pipeline(self):
+        """Keyword baseline should find the MD5 issue in data-pipeline."""
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["data-pipeline"]
+        findings = run_keyword_baseline(task)
+        md5_finds = [f for f in findings if "md5" in f.description.lower() or "MD5" in f.description]
+        assert len(md5_finds) >= 1
+    def test_baseline_finds_sql_injection_in_pipeline(self):
+        """Keyword baseline should find SQL injection via f-string in pipeline.py."""
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["data-pipeline"]
+        findings = run_keyword_baseline(task)
+        sql_finds = [f for f in findings if f.issue_type == "security"
+                     and "sql" in f.description.lower()]
+        assert len(sql_finds) >= 1
 # ---------------------------------------------------------------------------
 # Ground truth sanity checks
         files = {i["filename"] for i in task["ground_truth_issues"]}
         assert "views.py" in files
         assert "models.py" in files
+    def test_async_review_has_6_issues(self):
+        task = ALL_TASKS["async-review"]
+        assert len(task["ground_truth_issues"]) == 6
+    def test_data_pipeline_has_7_issues(self):
+        task = ALL_TASKS["data-pipeline"]
+        assert len(task["ground_truth_issues"]) == 7
+    def test_async_review_issues_in_async_py(self):
+        task = ALL_TASKS["async-review"]
+        for issue in task["ground_truth_issues"]:
+            assert issue["filename"] == "async.py"
+    def test_data_pipeline_issues_in_pipeline_py(self):
+        task = ALL_TASKS["data-pipeline"]
+        for issue in task["ground_truth_issues"]:
+            assert issue["filename"] == "pipeline.py"
+    def test_data_pipeline_has_security_and_performance(self):
+        task = ALL_TASKS["data-pipeline"]
+        types = {i["issue_type"] for i in task["ground_truth_issues"]}
+        assert "security" in types
+        assert "performance" in types
+    def test_async_review_has_bug_and_performance(self):
+        task = ALL_TASKS["async-review"]
+        types = {i["issue_type"] for i in task["ground_truth_issues"]}
+        assert "bug" in types
+        assert "performance" in types
+    def test_all_tasks_count(self):
+        assert len(ALL_TASKS) >= 6
+    def test_async_review_line_numbers_are_valid(self):
+        """GT issue line numbers should be within the code file."""
+        from tasks.data import TASK_ASYNC_REVIEW
+        code = TASK_ASYNC_REVIEW["code_files"]["async.py"]
+        total_lines = len(code.splitlines())
+        for issue in TASK_ASYNC_REVIEW["ground_truth_issues"]:
+            assert 1 <= issue["line_number"] <= total_lines, (
+                f"Line {issue['line_number']} out of range (file has {total_lines} lines)"
+            )
+    def test_pipeline_line_numbers_are_valid(self):
+        """GT issue line numbers should be within the code file."""
+        from tasks.data import TASK_DATA_PIPELINE
+        code = TASK_DATA_PIPELINE["code_files"]["pipeline.py"]
+        total_lines = len(code.splitlines())
+        for issue in TASK_DATA_PIPELINE["ground_truth_issues"]:
+            assert 1 <= issue["line_number"] <= total_lines, (
+                f"Line {issue['line_number']} out of range (file has {total_lines} lines)"
+            )
+    def test_api_security_has_8_issues(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["api-security"]
+        assert len(task["ground_truth_issues"]) == 8
+    def test_api_security_line_numbers_are_valid(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["api-security"]
+        code = task["code_files"]["api.py"]
+        total_lines = len(code.splitlines())
+        for issue in task["ground_truth_issues"]:
+            assert 1 <= issue["line_number"] <= total_lines, (
+                f"Line {issue['line_number']} out of range (file has {total_lines} lines)"
+            )
+    def test_api_security_has_security_issues(self):
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["api-security"]
+        types = {i["issue_type"] for i in task["ground_truth_issues"]}
+        assert "security" in types
+# ---------------------------------------------------------------------------
+# compute_function_map and function_ranges in metadata
+# ---------------------------------------------------------------------------
+class TestFunctionRangesMetadata:
+    def test_function_ranges_in_metadata(self):
+        code = {"test.py": "def foo():\n    return 1\n\ndef bar(x):\n    return x\n"}
+        result = compute_code_metadata(code)
+        assert "function_ranges" in result
+        assert len(result["function_ranges"]) == 2
+    def test_function_ranges_have_correct_fields(self):
+        code = {"test.py": "def foo():\n    return 1\n"}
+        result = compute_code_metadata(code)
+        fr = result["function_ranges"][0]
+        assert fr["name"] == "foo"
+        assert fr["file"] == "test.py"
+        assert "start" in fr
+        assert "end" in fr
+        assert fr["start"] <= fr["end"]
+    def test_function_ranges_empty_for_no_functions(self):
+        code = {"test.py": "x = 1\ny = 2\n"}
+        result = compute_code_metadata(code)
+        assert result["function_ranges"] == []
+    def test_function_ranges_multifile(self):
+        code = {
+            "a.py": "def foo():\n    pass\n",
+            "b.py": "def bar():\n    pass\n\ndef baz():\n    pass\n",
+        }
+        result = compute_code_metadata(code)
+        names = {fr["name"] for fr in result["function_ranges"]}
+        assert names == {"foo", "bar", "baz"}
+    def test_function_ranges_correct_line_numbers(self):
+        code = {"test.py": "x = 1\n\ndef foo():\n    return 1\n"}
+        result = compute_code_metadata(code)
+        assert len(result["function_ranges"]) == 1
+        assert result["function_ranges"][0]["start"] == 3  # line 3
+# ---------------------------------------------------------------------------
+# New keyword patterns
+# ---------------------------------------------------------------------------
+class TestNewKeywordPatterns:
+    def test_baseline_finds_hardcoded_admin_token(self):
+        from server.graders import run_keyword_baseline
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["api-security"]
+        findings = run_keyword_baseline(task)
+        token_finds = [f for f in findings if "ADMIN_TOKEN" in f.description or "token" in f.description.lower()]
+        assert len(token_finds) >= 1
+    def test_baseline_finds_pickle_loads(self):
+        from server.graders import run_keyword_baseline
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["api-security"]
+        findings = run_keyword_baseline(task)
+        pickle_finds = [f for f in findings if "pickle" in f.description.lower()]
+        assert len(pickle_finds) >= 1
+    def test_baseline_finds_os_system(self):
+        from server.graders import run_keyword_baseline
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["api-security"]
+        findings = run_keyword_baseline(task)
+        sys_finds = [f for f in findings if "os.system" in f.description.lower() or "command" in f.description.lower()]
+        assert len(sys_finds) >= 1
+    def test_baseline_api_security_score_nonzero(self):
+        from server.graders import run_keyword_baseline, grade_episode
+        from models import Issue
+        from tasks.data import ALL_TASKS
+        task = ALL_TASKS["api-security"]
+        findings = run_keyword_baseline(task)
+        gt = [Issue.from_dict(i) for i in task["ground_truth_issues"]]
+        score = grade_episode(findings, gt)
+        assert score > 0.0, "Keyword baseline should find at least 1 issue in api-security"