Spaces:

100XZX001
/

CodeReview-Professional-Workflow

Sleeping

App Files Files Community

100XZX001 commited on 27 days ago

Commit

1588266

verified ·

1 Parent(s): a9cad0e

Upload 16 files

Browse files

Files changed (15) hide show

LICENSE +21 -0
README.md +118 -103
__init__.py +16 -16
author.py +219 -219
bugs.json +127 -127
client.py +4 -4
environment.py +613 -556
grader.py +141 -147
models.py +87 -114
openenv.yaml +135 -135
pyproject.toml +29 -29
redteam.py +274 -274
rubrics.py +136 -123
test_runner.py +208 -181
training.py +792 -708

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026 YUTA
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,103 +1,118 @@
----
-title: Code Review Professional Workflow
-emoji: 🔥
-colorFrom: blue
-colorTo: purple
-sdk: docker
-app_port: 7860
-pinned: false
----
-# Code Review Professional Workflow
-            "Multi‑turn code review environment for professional‑level bug fixing. "
-            "The agent must inspect, test, lint, query documentation, and negotiate with "
-            "a simulated (persona‑driven) author to get a fix accepted. "
-            "Includes 25 bugs across 5 difficulty levels, AST‑based injection, "
-            "a reward‑shaping system, and curriculum learning. "
-            "Designed for RL training (PPO, DPO, or any policy‑gradient method)
-## Quick Start
-```python
-from environment import CodeReviewEnv
-env = CodeReviewEnv()
-obs = env.reset()
-print(obs.code_snippet)
-```
-## Environment Endpoints
-- `POST /reset` – reset environment (optional `task` parameter)
-- `POST /step` – take an action (JSON)
-- `GET /state` – get full environment state
-- `GET /health` – health check
-- `GET /metadata` – environment metadata
-- `GET /schema` – action/observation schemas
-- `POST /mcp` – minimal MCP endpoint
-## Tasks
-## 🐛 Bug Taxonomy (25 bugs across 5 difficulty levels)
-The **RedTeam** randomly selects one bug from the current difficulty level at the start of every episode.
-Your agent must figure out what’s broken, gather evidence, and convince the simulated author – or it won’t stick.
-### 🟢 Easy – Null‑Checks & Simple Logic Errors
-| # | Bug ID | What’s wrong | Injection method |
-|---|--------|--------------|------------------|
-| 1 | `null_check` | Missing `if key in dict:` guard → KeyError | AST: remove the if‑statement |
-| 2 | `simple_typo` | Misspelled variable `users` → `usres` | AST: rename variable |
-| 3 | `string_index` | String index shifted by +1 | AST: change constant in index |
-| 4 | `default_value` | `dict.get(key)` used without a fallback | AST: replace `dict.get(key)` with `dict[key]` |
-| 5 | `empty_return` | Function returns `None` prematurely | AST: insert `return None` early |
-### 🟡 Medium – Off‑By‑One, Loop Logic & Simple Arithmetic
-| # | Bug ID | What’s wrong | Injection method |
-|---|--------|--------------|------------------|
-| 6 | `off_by_one` | `range(x)` becomes `range(1, x-1)` – skips first & last | AST: modify range arguments |
-| 7 | `loop_skip` | `range(len(arr))` becomes `range(len(arr)-1)` – misses last element | AST: change range length |
-| 8 | `sign_error` | `sum += item` turned into `sum -= item` | AST: swap Add / Sub |
-| 9 | `swap_args` | Function arguments swapped | AST: swap first two arguments |
-|10 | `uninitialised_var` | Variable used before assignment in a loop | AST: remove the assignment statement |
-### 🟠 Hard – Division‑By‑Zero, Floating‑Point & Edge Cases
-| # | Bug ID | What’s wrong | Injection method |
-|---|--------|--------------|------------------|
-|11 | `division_by_zero_empty` | Empty‑list guard removed before averaging | AST: delete `if not data:` |
-|12 | `division_by_zero_zero` | Denominator check removed | AST: remove the zero‑check |
-|13 | `float_precision` | True division `/` replaced by integer division `//` | AST: change Div → FloorDiv |
-|14 | `abs_usage` | `abs()` call removed when comparing differences | AST: delete `abs()` wrapper |
-|15 | `round_error` | `round()` placed too early, causing precision drift | AST: inject `round()` prematurely |
-### 🔴 Harder – Race Conditions & Atomicity Bugs
-| # | Bug ID | What’s wrong | Injection method |
-|---|--------|--------------|------------------|
-|16 | `missing_lock` | Shared counter incremented without a lock | Template: remove `with lock:` |
-|17 | `double_lock` | Acquiring the same lock twice → deadlock risk | Template: add extra `lock.acquire()` |
-|18 | `global_nonatomic` | `count = count + 1` (read‑modify‑write) instead of `+=` | AST: modify assignment node |
-|19 | `thread_safe_list` | List append across threads without synchronisation | Template: remove lock from list operation |
-|20 | `volatile_read` | Shared flag read outside a lock → stale value | Template: remove synchronisation block |
-### ⚫ Hardest – Deadlocks, Ordering & Complex Concurrency
-| # | Bug ID | What’s wrong | Injection method |
-|---|--------|--------------|------------------|
-|21 | `deadlock_order` | Locks acquired in opposite order in two threads | Template: swap lock order |
-|22 | `nested_lock_timeout` | `lock.acquire()` without a timeout → permanent hang | Template: remove timeout logic |
-|23 | `fork_join` | Thread started but not joined (`join()` missing) | AST: remove `thread.join()` |
-|24 | `mutex_release` | Lock released by a thread that never acquired it | Template: incorrect release logic |
-|25 | `race_on_init` | Shared resource initialised after threads have started | Template: move initialisation after `join()` |
-## Deployment
-```bash
-openenv push
-```
-## License
-MIT

+---
+title: Code Review Professional Workflow
+emoji: 🔥
+colorFrom: blue
+colorTo: purple
+sdk: docker
+app_port: 7860
+pinned: false
+---
+# Code Review Professional Workflow
+This project is a multi-turn RL environment where an agent plays the role of a senior code reviewer.
+Instead of just patching code, the agent must gather evidence (`inspect`, `run_tests`, `run_linter`,
+`query_docs`) and convince a simulated developer persona to accept the fix.
+### Why this environment is interesting
+- It combines **technical correctness** (tests/lint) with **human acceptance** (negotiation).
+- It includes **25 injected bug types** across 5 difficulty levels via `RedTeam`.
+- It supports both a **full reward profile** (rich shaping) and a **core reward profile**
+  (minimal, baseline-friendly signal for ablations).
+## Quick Start
+```python
+from environment import CodeReviewEnv
+env = CodeReviewEnv(task="easy", reward_profile="full")
+obs = env.reset()
+print(obs.code_snippet)
+```
+## Demo Script (Non-Technical Friendly)
+Use this 60-90 second flow in a demo:
+1. Reset on `easy` and show the buggy snippet.
+2. Take `inspect` and `run_tests` actions to show evidence gathering.
+3. Ask `query_docs` once to show retrieval-assisted reasoning.
+4. Propose a fix and show accepted/denied feedback from the author persona.
+5. Repeat once on `harder` to show increased challenge.
+Message for audience: "The agent is learning not only to fix code, but to justify and communicate the fix."
+## Environment Endpoints
+- `POST /reset` – reset environment (optional `task` parameter)
+- `POST /step` – take an action (JSON)
+- `GET /state` – get full environment state
+- `GET /health` – health check
+- `GET /metadata` – environment metadata
+- `GET /schema` – action/observation schemas
+- `POST /mcp` – minimal MCP endpoint
+## Tasks
+## 🐛 Bug Taxonomy (25 bugs across 5 difficulty levels)
+The **RedTeam** randomly selects one bug from the current difficulty level at the start of every episode.
+Your agent must figure out what’s broken, gather evidence, and convince the simulated author – or it won’t stick.
+### 🟢 Easy – Null‑Checks & Simple Logic Errors
+| # | Bug ID | What’s wrong | Injection method |
+|---|--------|--------------|------------------|
+| 1 | `null_check` | Missing `if key in dict:` guard → KeyError | AST: remove the if‑statement |
+| 2 | `simple_typo` | Misspelled variable `users` → `usres` | AST: rename variable |
+| 3 | `string_index` | String index shifted by +1 | AST: change constant in index |
+| 4 | `default_value` | `dict.get(key)` used without a fallback | AST: replace `dict.get(key)` with `dict[key]` |
+| 5 | `empty_return` | Function returns `None` prematurely | AST: insert `return None` early |
+### 🟡 Medium – Off‑By‑One, Loop Logic & Simple Arithmetic
+| # | Bug ID | What’s wrong | Injection method |
+|---|--------|--------------|------------------|
+| 6 | `off_by_one` | `range(x)` becomes `range(1, x-1)` – skips first & last | AST: modify range arguments |
+| 7 | `loop_skip` | `range(len(arr))` becomes `range(len(arr)-1)` – misses last element | AST: change range length |
+| 8 | `sign_error` | `sum += item` turned into `sum -= item` | AST: swap Add / Sub |
+| 9 | `swap_args` | Function arguments swapped | AST: swap first two arguments |
+|10 | `uninitialised_var` | Variable used before assignment in a loop | AST: remove the assignment statement |
+### 🟠 Hard – Division‑By‑Zero, Floating‑Point & Edge Cases
+| # | Bug ID | What’s wrong | Injection method |
+|---|--------|--------------|------------------|
+|11 | `division_by_zero_empty` | Empty‑list guard removed before averaging | AST: delete `if not data:` |
+|12 | `division_by_zero_zero` | Denominator check removed | AST: remove the zero‑check |
+|13 | `float_precision` | True division `/` replaced by integer division `//` | AST: change Div → FloorDiv |
+|14 | `abs_usage` | `abs()` call removed when comparing differences | AST: delete `abs()` wrapper |
+|15 | `round_error` | `round()` placed too early, causing precision drift | AST: inject `round()` prematurely |
+### 🔴 Harder – Race Conditions & Atomicity Bugs
+| # | Bug ID | What’s wrong | Injection method |
+|---|--------|--------------|------------------|
+|16 | `missing_lock` | Shared counter incremented without a lock | Template: remove `with lock:` |
+|17 | `double_lock` | Acquiring the same lock twice → deadlock risk | Template: add extra `lock.acquire()` |
+|18 | `global_nonatomic` | `count = count + 1` (read‑modify‑write) instead of `+=` | AST: modify assignment node |
+|19 | `thread_safe_list` | List append across threads without synchronisation | Template: remove lock from list operation |
+|20 | `volatile_read` | Shared flag read outside a lock → stale value | Template: remove synchronisation block |
+### ⚫ Hardest – Deadlocks, Ordering & Complex Concurrency
+| # | Bug ID | What’s wrong | Injection method |
+|---|--------|--------------|------------------|
+|21 | `deadlock_order` | Locks acquired in opposite order in two threads | Template: swap lock order |
+|22 | `nested_lock_timeout` | `lock.acquire()` without a timeout → permanent hang | Template: remove timeout logic |
+|23 | `fork_join` | Thread started but not joined (`join()` missing) | AST: remove `thread.join()` |
+|24 | `mutex_release` | Lock released by a thread that never acquired it | Template: incorrect release logic |
+|25 | `race_on_init` | Shared resource initialised after threads have started | Template: move initialisation after `join()` |
+## Deployment
+```bash
+openenv push
+```
+## License
+MIT

__init__.py CHANGED Viewed

@@ -1,16 +1,16 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""Criticrl  Environment."""
-from .client import CriticrlEnv
-from .models import CriticrlAction, CriticrlObservation
-__all__ = [
-    "CriticrlAction",
-    "CriticrlObservation",
-    "CriticrlEnv",
-]

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Criticrl  Environment."""
+from .client import CriticrlEnv
+from .models import CriticrlAction, CriticrlObservation
+__all__ = [
+    "CriticrlAction",
+    "CriticrlObservation",
+    "CriticrlEnv",
+]

author.py CHANGED Viewed

@@ -1,219 +1,219 @@
-# cell 7 author.py – Final production version: stateful, evidence-driven, belief tracking
-import re
-import ast
-from dataclasses import dataclass, field
-from typing import List, Dict, Any, Optional
-@dataclass
-class PersonaAuthor:
-    """
-    Simulates a human developer with:
-    - Continuous belief (confidence)
-    - Evidence-based reasoning
-    - Conversation memory
-    - Code inspection awareness
-    """
-    personality: str = "defensive"   # defensive | junior | collaborative
-    max_persuasion_rounds: int = 5
-    # Evidence weights
-    weight_test_pass: float = 0.5
-    weight_lint_clean: float = 0.2
-    weight_doc_found: float = 0.15
-    weight_explanation_quality: float = 0.15
-    # Personality thresholds
-    thresholds: Dict[str, float] = field(default_factory=lambda: {
-        "defensive": 0.7,
-        "junior": 0.3,
-        "collaborative": 0.5,
-    })
-    # Internal state
-    _confidence: float = 0.0
-    _conversation: List[Dict[str, Any]] = field(default_factory=list)
-    _pushback_count: int = 0
-    _last_evidence_score: float = 0.0
-    _stagnation_counter: int = 0
-    # ------------------------------------------------------------------
-    # Lifecycle
-    # ------------------------------------------------------------------
-    def __post_init__(self):
-        self.reset()
-    def reset(self):
-        self._confidence = 0.0
-        self._conversation.clear()
-        self._pushback_count = 0
-        self._last_evidence_score = 0.0
-        self._stagnation_counter = 0
-    # ------------------------------------------------------------------
-    # Main interaction
-    # ------------------------------------------------------------------
-    # Added weight for code change magnitude
-    weight_code_change: float = 0.1   # small change is better
-    def respond(self,
-                agent_comment: str = "",
-                agent_question: str = "",
-                test_results: Optional[str] = None,
-                lint_results: Optional[str] = None,
-                doc_results: Optional[str] = None,
-                proposed_fix: Optional[str] = None,
-                original_code: Optional[str] = None) -> str:
-        # Store conversation
-        self._conversation.append({
-            "comment": agent_comment,
-            "question": agent_question,
-            "test": test_results,
-            "lint": lint_results,
-            "docs": doc_results
-        })
-        # Extract structured evidence
-        evidence = self._extract_evidence(test_results, lint_results, doc_results)
-        # Code inspection
-        code_change = 0.0
-        if proposed_fix and original_code:
-            code_change = self._inspect_code(proposed_fix, original_code)
-            evidence["code_change"] = code_change
-        # Explanation score
-        text = (agent_comment + " " + agent_question).lower()
-        explanation_score = self._score_explanation(text)
-        # Compute evidence score – now includes code change penalty (1 - change)
-        evidence_score = (
-            self.weight_test_pass * evidence.get("test_pass_ratio", 0.0) +
-            self.weight_lint_clean * (1 - min(1.0, evidence.get("lint_errors", 0)/10)) +
-            self.weight_doc_found * (1.0 if evidence.get("doc_found") else 0.0) +
-            self.weight_explanation_quality * explanation_score +
-            self.weight_code_change * (1.0 - code_change)   # surgical fix rewarded
-        )
-        evidence_score = max(0.0, min(1.0, evidence_score))
-        # Detect improvement
-        delta = evidence_score - self._last_evidence_score
-        self._last_evidence_score = evidence_score
-        if delta > 0.05:
-            self._stagnation_counter = 0
-        else:
-            self._stagnation_counter += 1
-        # Update belief (momentum)
-        lr = 0.3
-        self._confidence = (1 - lr) * self._confidence + lr * evidence_score
-        # Penalise stagnation
-        if self._stagnation_counter >= 2:
-            self._confidence *= 0.9
-        # Decision
-        threshold = self.thresholds.get(self.personality, 0.5)
-        if self._confidence >= threshold or self._pushback_count >= self.max_persuasion_rounds:
-            return "Alright, I'm convinced. Let's proceed with your fix."
-        # Otherwise push back
-        self._pushback_count += 1
-        return self._generate_pushback(evidence, text)
-    # ------------------------------------------------------------------
-    # Evidence extraction
-    # ------------------------------------------------------------------
-    def _extract_evidence(self, test_results, lint_results, doc_results):
-        evidence = {
-            "test_pass_ratio": 0.0,
-            "lint_errors": 0,
-            "doc_found": False
-        }
-        # Parse test results
-        if test_results:
-            match = re.search(r'(\d+)\s*/\s*(\d+)', test_results)
-            if match:
-                p, t = int(match.group(1)), int(match.group(2))
-                evidence["test_pass_ratio"] = p / t if t else 0.0
-            elif "true" in test_results.lower():
-                evidence["test_pass_ratio"] = 1.0
-            elif "false" in test_results.lower():
-                evidence["test_pass_ratio"] = 0.0
-        # Lint errors
-        if lint_results:
-            evidence["lint_errors"] = len(re.findall(r'error', lint_results.lower()))
-        # Docs
-        if doc_results and "no relevant" not in doc_results.lower():
-            evidence["doc_found"] = True
-        return evidence
-    # ------------------------------------------------------------------
-    # Explanation scoring
-    # ------------------------------------------------------------------
-    def _score_explanation(self, text: str) -> float:
-        score = 0.0
-        if "because" in text or "therefore" in text:
-            score += 0.3
-        if "test" in text or "example" in text:
-            score += 0.2
-        if len(text.split()) > 30:
-            score += 0.2
-        if "error" in text or "fix" in text:
-            score += 0.1
-        return min(1.0, score)
-    # ------------------------------------------------------------------
-    # Code inspection
-    # ------------------------------------------------------------------
-    def _inspect_code(self, new_code: str, old_code: str) -> float:
-        try:
-            t1 = ast.parse(old_code)
-            t2 = ast.parse(new_code)
-            n1 = len(list(ast.walk(t1)))
-            n2 = len(list(ast.walk(t2)))
-            change = abs(n2 - n1) / max(n1, 1)
-            return min(1.0, change)
-        except:
-            return 0.0
-    # ------------------------------------------------------------------
-    # Pushback generator
-    # ------------------------------------------------------------------
-    def _generate_pushback(self, evidence, text):
-        if evidence["test_pass_ratio"] < 0.5:
-            return "Tests are still failing. Show a passing case."
-        if evidence["lint_errors"] > 0:
-            return f"There are {evidence['lint_errors']} lint errors. Fix them."
-        if not evidence["doc_found"]:
-            return "Provide documentation or reference."
-        if "because" not in text:
-            return "Explain why this works."
-        if len(text.split()) < 20:
-            return "Too brief. Expand your reasoning."
-        return "Not convinced yet. Give a concrete example."
-    # ------------------------------------------------------------------
-    # Score
-    # ------------------------------------------------------------------
-    def get_negotiation_score(self) -> float:
-        penalty = 0.1 * min(3, self._pushback_count)
-        return max(0.0, min(1.0, self._confidence - penalty))

+# cell 7 author.py – Final production version: stateful, evidence-driven, belief tracking
+import re
+import ast
+from dataclasses import dataclass, field
+from typing import List, Dict, Any, Optional
+@dataclass
+class PersonaAuthor:
+    """
+    Simulates a human developer with:
+    - Continuous belief (confidence)
+    - Evidence-based reasoning
+    - Conversation memory
+    - Code inspection awareness
+    """
+    personality: str = "defensive"   # defensive | junior | collaborative
+    max_persuasion_rounds: int = 5
+    # Evidence weights
+    weight_test_pass: float = 0.5
+    weight_lint_clean: float = 0.2
+    weight_doc_found: float = 0.15
+    weight_explanation_quality: float = 0.15
+    # Personality thresholds
+    thresholds: Dict[str, float] = field(default_factory=lambda: {
+        "defensive": 0.7,
+        "junior": 0.3,
+        "collaborative": 0.5,
+    })
+    # Internal state
+    _confidence: float = 0.0
+    _conversation: List[Dict[str, Any]] = field(default_factory=list)
+    _pushback_count: int = 0
+    _last_evidence_score: float = 0.0
+    _stagnation_counter: int = 0
+    # ------------------------------------------------------------------
+    # Lifecycle
+    # ------------------------------------------------------------------
+    def __post_init__(self):
+        self.reset()
+    def reset(self):
+        self._confidence = 0.0
+        self._conversation.clear()
+        self._pushback_count = 0
+        self._last_evidence_score = 0.0
+        self._stagnation_counter = 0
+    # ------------------------------------------------------------------
+    # Main interaction
+    # ------------------------------------------------------------------
+    # Added weight for code change magnitude
+    weight_code_change: float = 0.1   # small change is better
+    def respond(self,
+                agent_comment: str = "",
+                agent_question: str = "",
+                test_results: Optional[str] = None,
+                lint_results: Optional[str] = None,
+                doc_results: Optional[str] = None,
+                proposed_fix: Optional[str] = None,
+                original_code: Optional[str] = None) -> str:
+        # Store conversation
+        self._conversation.append({
+            "comment": agent_comment,
+            "question": agent_question,
+            "test": test_results,
+            "lint": lint_results,
+            "docs": doc_results
+        })
+        # Extract structured evidence
+        evidence = self._extract_evidence(test_results, lint_results, doc_results)
+        # Code inspection
+        code_change = 0.0
+        if proposed_fix and original_code:
+            code_change = self._inspect_code(proposed_fix, original_code)
+            evidence["code_change"] = code_change
+        # Explanation score
+        text = (agent_comment + " " + agent_question).lower()
+        explanation_score = self._score_explanation(text)
+        # Compute evidence score – now includes code change penalty (1 - change)
+        evidence_score = (
+            self.weight_test_pass * evidence.get("test_pass_ratio", 0.0) +
+            self.weight_lint_clean * (1 - min(1.0, evidence.get("lint_errors", 0)/10)) +
+            self.weight_doc_found * (1.0 if evidence.get("doc_found") else 0.0) +
+            self.weight_explanation_quality * explanation_score +
+            self.weight_code_change * (1.0 - code_change)   # surgical fix rewarded
+        )
+        evidence_score = max(0.0, min(1.0, evidence_score))
+        # Detect improvement
+        delta = evidence_score - self._last_evidence_score
+        self._last_evidence_score = evidence_score
+        if delta > 0.05:
+            self._stagnation_counter = 0
+        else:
+            self._stagnation_counter += 1
+        # Update belief (momentum)
+        lr = 0.3
+        self._confidence = (1 - lr) * self._confidence + lr * evidence_score
+        # Penalise stagnation
+        if self._stagnation_counter >= 2:
+            self._confidence *= 0.9
+        # Decision
+        threshold = self.thresholds.get(self.personality, 0.5)
+        if self._confidence >= threshold or self._pushback_count >= self.max_persuasion_rounds:
+            return "Alright, I'm convinced. Let's proceed with your fix."
+        # Otherwise push back
+        self._pushback_count += 1
+        return self._generate_pushback(evidence, text)
+    # ------------------------------------------------------------------
+    # Evidence extraction
+    # ------------------------------------------------------------------
+    def _extract_evidence(self, test_results, lint_results, doc_results):
+        evidence = {
+            "test_pass_ratio": 0.0,
+            "lint_errors": 0,
+            "doc_found": False
+        }
+        # Parse test results
+        if test_results:
+            match = re.search(r'(\d+)\s*/\s*(\d+)', test_results)
+            if match:
+                p, t = int(match.group(1)), int(match.group(2))
+                evidence["test_pass_ratio"] = p / t if t else 0.0
+            elif "true" in test_results.lower():
+                evidence["test_pass_ratio"] = 1.0
+            elif "false" in test_results.lower():
+                evidence["test_pass_ratio"] = 0.0
+        # Lint errors
+        if lint_results:
+            evidence["lint_errors"] = len(re.findall(r'error', lint_results.lower()))
+        # Docs
+        if doc_results and "no relevant" not in doc_results.lower():
+            evidence["doc_found"] = True
+        return evidence
+    # ------------------------------------------------------------------
+    # Explanation scoring
+    # ------------------------------------------------------------------
+    def _score_explanation(self, text: str) -> float:
+        score = 0.0
+        if "because" in text or "therefore" in text:
+            score += 0.3
+        if "test" in text or "example" in text:
+            score += 0.2
+        if len(text.split()) > 30:
+            score += 0.2
+        if "error" in text or "fix" in text:
+            score += 0.1
+        return min(1.0, score)
+    # ------------------------------------------------------------------
+    # Code inspection
+    # ------------------------------------------------------------------
+    def _inspect_code(self, new_code: str, old_code: str) -> float:
+        try:
+            t1 = ast.parse(old_code)
+            t2 = ast.parse(new_code)
+            n1 = len(list(ast.walk(t1)))
+            n2 = len(list(ast.walk(t2)))
+            change = abs(n2 - n1) / max(n1, 1)
+            return min(1.0, change)
+        except:
+            return 0.0
+    # ------------------------------------------------------------------
+    # Pushback generator
+    # ------------------------------------------------------------------
+    def _generate_pushback(self, evidence, text):
+        if evidence["test_pass_ratio"] < 0.5:
+            return "Tests are still failing. Show a passing case."
+        if evidence["lint_errors"] > 0:
+            return f"There are {evidence['lint_errors']} lint errors. Fix them."
+        if not evidence["doc_found"]:
+            return "Provide documentation or reference."
+        if "because" not in text:
+            return "Explain why this works."
+        if len(text.split()) < 20:
+            return "Too brief. Expand your reasoning."
+        return "Not convinced yet. Give a concrete example."
+    # ------------------------------------------------------------------
+    # Score
+    # ------------------------------------------------------------------
+    def get_negotiation_score(self) -> float:
+        penalty = 0.1 * min(3, self._pushback_count)
+        return max(0.0, min(1.0, self._confidence - penalty))

bugs.json CHANGED Viewed

@@ -1,127 +1,127 @@
-{
-  "easy": {
-    "null_check": {
-      "type": "ast",
-      "bug_type": "null_check",
-      "oracle_hint": "Add back the if-guard that was removed"
-    },
-    "simple_typo": {
-      "type": "ast",
-      "bug_type": "simple_typo",
-      "oracle_hint": "Fix the misspelled variable name"
-    },
-    "string_index": {
-      "type": "ast",
-      "bug_type": "string_index",
-      "oracle_hint": "Correct the index offset"
-    },
-    "default_value": {
-      "type": "ast",
-      "bug_type": "default_value",
-      "oracle_hint": "Restore dict.get() with proper default"
-    },
-    "empty_return": {
-      "type": "ast",
-      "bug_type": "empty_return",
-      "oracle_hint": "Remove the premature return None"
-    }
-  },
-  "medium": {
-    "off_by_one": {
-      "type": "ast",
-      "bug_type": "off_by_one"
-    },
-    "loop_skip": {
-      "type": "ast",
-      "bug_type": "loop_skip"
-    },
-    "sign_error": {
-      "type": "ast",
-      "bug_type": "sign_error"
-    },
-    "swap_args": {
-      "type": "ast",
-      "bug_type": "swap_args"
-    },
-    "uninitialised_var": {
-      "type": "ast",
-      "bug_type": "uninitialised_var"
-    }
-  },
-  "hard": {
-    "division_by_zero_empty": {
-      "type": "ast",
-      "bug_type": "division_by_zero_empty"
-    },
-    "division_by_zero_zero": {
-      "type": "ast",
-      "bug_type": "division_by_zero_zero"
-    },
-    "float_precision": {
-      "type": "ast",
-      "bug_type": "float_precision"
-    },
-    "abs_usage": {
-      "type": "ast",
-      "bug_type": "abs_usage"
-    },
-    "round_error": {
-      "type": "ast",
-      "bug_type": "round_error"
-    }
-  },
-  "harder": {
-    "missing_lock": {
-      "type": "template",
-      "buggy": "counter = 0\ndef increment():\n    global counter\n    counter += 1",
-      "oracle": "counter = 0\nimport threading\nlock = threading.Lock()\ndef increment():\n    global counter\n    with lock:\n        counter += 1"
-    },
-    "double_lock": {
-      "type": "template",
-      "buggy": "import threading\nlock = threading.Lock()\ndef do_work():\n    lock.acquire()\n    lock.acquire()\n    print('working')\n    lock.release()",
-      "oracle": "import threading\nlock = threading.Lock()\ndef do_work():\n    with lock:\n        print('working')"
-    },
-    "global_nonatomic": {
-      "type": "template",
-      "buggy": "count = 0\ndef add():\n    global count\n    count = count + 1",
-      "oracle": "count = 0\ndef add():\n    global count\n    count += 1"
-    },
-    "thread_safe_list": {
-      "type": "template",
-      "buggy": "import threading\nitems = []\ndef append_item(item):\n    items.append(item)",
-      "oracle": "import threading\nitems = []\nlock = threading.Lock()\ndef append_item(item):\n    with lock:\n        items.append(item)"
-    },
-    "volatile_read": {
-      "type": "template",
-      "buggy": "import threading\nstop = False\ndef worker():\n    while not stop:\n        pass",
-      "oracle": "import threading\nstop = False\nlock = threading.Lock()\ndef worker():\n    while True:\n        with lock:\n            if stop:\n                break"
-    }
-  },
-  "hardest": {
-    "deadlock_order": {
-      "type": "template",
-      "buggy": "import threading\nlock1 = threading.Lock()\nlock2 = threading.Lock()\ndef thread1():\n    with lock1:\n        with lock2:\n            pass\ndef thread2():\n    with lock2:\n        with lock1:\n            pass",
-      "oracle": "import threading\nlock1 = threading.Lock()\nlock2 = threading.Lock()\ndef thread1():\n    with lock1:\n        with lock2:\n            pass\ndef thread2():\n    with lock1:\n        with lock2:\n            pass"
-    },
-    "nested_lock_timeout": {
-      "type": "template",
-      "buggy": "import threading\nlock = threading.Lock()\ndef work():\n    lock.acquire()\n    # critical section\n    lock.release()",
-      "oracle": "import threading\nlock = threading.Lock()\ndef work():\n    if lock.acquire(timeout=1):\n        try:\n            # critical section\n        finally:\n            lock.release()"
-    },
-    "fork_join": {
-      "type": "template",
-      "buggy": "import threading\ndef worker():\n    pass\nt = threading.Thread(target=worker)\nt.start()",
-      "oracle": "import threading\ndef worker():\n    pass\nt = threading.Thread(target=worker)\nt.start()\nt.join()"
-    },
-    "mutex_release": {
-      "type": "template",
-      "buggy": "import threading\nlock = threading.Lock()\ndef thread_A():\n    lock.acquire()\n    lock.release()\ndef thread_B():\n    lock.release()",
-      "oracle": "import threading\nlock = threading.Lock()\ndef thread_A():\n    with lock:\n        pass\ndef thread_B():\n    with lock:\n        pass"
-    },
-    "race_on_init": {
-      "type": "template",
-      "buggy": "import threading\nitems = []\ndef init():\n    global items\n    items = [1,2,3]\nt = threading.Thread(target=init)\nt.start()\nprint(items)",
-      "oracle": "import threading\nitems = []\ndef init():\n    global items\n    items = [1,2,3]\nt = threading.Thread(target=init)\nt.start()\nt.join()\nprint(items)"
-    }
-  }
-}

+{
+  "easy": {
+    "null_check": {
+      "type": "ast",
+      "bug_type": "null_check",
+      "oracle_hint": "Add back the if-guard that was removed"
+    },
+    "simple_typo": {
+      "type": "ast",
+      "bug_type": "simple_typo",
+      "oracle_hint": "Fix the misspelled variable name"
+    },
+    "string_index": {
+      "type": "ast",
+      "bug_type": "string_index",
+      "oracle_hint": "Correct the index offset"
+    },
+    "default_value": {
+      "type": "ast",
+      "bug_type": "default_value",
+      "oracle_hint": "Restore dict.get() with proper default"
+    },
+    "empty_return": {
+      "type": "ast",
+      "bug_type": "empty_return",
+      "oracle_hint": "Remove the premature return None"
+    }
+  },
+  "medium": {
+    "off_by_one": {
+      "type": "ast",
+      "bug_type": "off_by_one"
+    },
+    "loop_skip": {
+      "type": "ast",
+      "bug_type": "loop_skip"
+    },
+    "sign_error": {
+      "type": "ast",
+      "bug_type": "sign_error"
+    },
+    "swap_args": {
+      "type": "ast",
+      "bug_type": "swap_args"
+    },
+    "uninitialised_var": {
+      "type": "ast",
+      "bug_type": "uninitialised_var"
+    }
+  },
+  "hard": {
+    "division_by_zero_empty": {
+      "type": "ast",
+      "bug_type": "division_by_zero_empty"
+    },
+    "division_by_zero_zero": {
+      "type": "ast",
+      "bug_type": "division_by_zero_zero"
+    },
+    "float_precision": {
+      "type": "ast",
+      "bug_type": "float_precision"
+    },
+    "abs_usage": {
+      "type": "ast",
+      "bug_type": "abs_usage"
+    },
+    "round_error": {
+      "type": "ast",
+      "bug_type": "round_error"
+    }
+  },
+  "harder": {
+    "missing_lock": {
+      "type": "template",
+      "buggy": "counter = 0\ndef increment():\n    global counter\n    counter += 1",
+      "oracle": "counter = 0\nimport threading\nlock = threading.Lock()\ndef increment():\n    global counter\n    with lock:\n        counter += 1"
+    },
+    "double_lock": {
+      "type": "template",
+      "buggy": "import threading\nlock = threading.Lock()\ndef do_work():\n    lock.acquire()\n    lock.acquire()\n    print('working')\n    lock.release()",
+      "oracle": "import threading\nlock = threading.Lock()\ndef do_work():\n    with lock:\n        print('working')"
+    },
+    "global_nonatomic": {
+      "type": "template",
+      "buggy": "count = 0\ndef add():\n    global count\n    count = count + 1",
+      "oracle": "count = 0\ndef add():\n    global count\n    count += 1"
+    },
+    "thread_safe_list": {
+      "type": "template",
+      "buggy": "import threading\nitems = []\ndef append_item(item):\n    items.append(item)",
+      "oracle": "import threading\nitems = []\nlock = threading.Lock()\ndef append_item(item):\n    with lock:\n        items.append(item)"
+    },
+    "volatile_read": {
+      "type": "template",
+      "buggy": "import threading\nstop = False\ndef worker():\n    while not stop:\n        pass",
+      "oracle": "import threading\nstop = False\nlock = threading.Lock()\ndef worker():\n    while True:\n        with lock:\n            if stop:\n                break"
+    }
+  },
+  "hardest": {
+    "deadlock_order": {
+      "type": "template",
+      "buggy": "import threading\nlock1 = threading.Lock()\nlock2 = threading.Lock()\ndef thread1():\n    with lock1:\n        with lock2:\n            pass\ndef thread2():\n    with lock2:\n        with lock1:\n            pass",
+      "oracle": "import threading\nlock1 = threading.Lock()\nlock2 = threading.Lock()\ndef thread1():\n    with lock1:\n        with lock2:\n            pass\ndef thread2():\n    with lock1:\n        with lock2:\n            pass"
+    },
+    "nested_lock_timeout": {
+      "type": "template",
+      "buggy": "import threading\nlock = threading.Lock()\ndef work():\n    lock.acquire()\n    # critical section\n    lock.release()",
+      "oracle": "import threading\nlock = threading.Lock()\ndef work():\n    if lock.acquire(timeout=1):\n        try:\n            # critical section\n        finally:\n            lock.release()"
+    },
+    "fork_join": {
+      "type": "template",
+      "buggy": "import threading\ndef worker():\n    pass\nt = threading.Thread(target=worker)\nt.start()",
+      "oracle": "import threading\ndef worker():\n    pass\nt = threading.Thread(target=worker)\nt.start()\nt.join()"
+    },
+    "mutex_release": {
+      "type": "template",
+      "buggy": "import threading\nlock = threading.Lock()\ndef thread_A():\n    lock.acquire()\n    lock.release()\ndef thread_B():\n    lock.release()",
+      "oracle": "import threading\nlock = threading.Lock()\ndef thread_A():\n    with lock:\n        pass\ndef thread_B():\n    with lock:\n        pass"
+    },
+    "race_on_init": {
+      "type": "template",
+      "buggy": "import threading\nitems = []\ndef init():\n    global items\n    items = [1,2,3]\nt = threading.Thread(target=init)\nt.start()\nprint(items)",
+      "oracle": "import threading\nitems = []\ndef init():\n    global items\n    items = [1,2,3]\nt = threading.Thread(target=init)\nt.start()\nt.join()\nprint(items)"
+    }
+  }
+}

client.py CHANGED Viewed

@@ -1,5 +1,5 @@
-# client.py – OpenEnv client entry point
-from environment import CodeReviewEnv
-# The OpenEnv framework will import this class as the environment.
 __all__ = ["CodeReviewEnv"]

+# client.py – OpenEnv client entry point
+from environment import CodeReviewEnv
+# The OpenEnv framework will import this class as the environment.
 __all__ = ["CodeReviewEnv"]

environment.py CHANGED Viewed

@@ -1,556 +1,613 @@
-# environment.py – FULLY CORRECTED RL Environment (TRUE Markov + Fixed Bugs)
-import sys
-import subprocess
-import tempfile
-import os
-import re
-from dataclasses import dataclass, field
-from typing import Tuple, Dict, Any, Optional, List
-from models import (
-    AnyAction, WriteComment, ProposeFix, Execute, Inspect,
-    RunLinter, RunTests, QueryDocs, Skip, Done, AskQuestion,
-    Observation, Reward, State
-)
-from redteam import RedTeam
-from test_runner import TestRunner
-from author import PersonaAuthor
-from rltool import ToolBox
-from rubrics import (
-    ToolUsageRubric,
-    TestDeltaRubric,
-    LintDeltaRubric,
-    TerminalSuccessRubric,
-    ExplorationRubric,
-    AntiHackingRubric,
-    StepPenaltyRubric,
-)
-# ======================================================================
-# FULLY MARKOV OBSERVATION (NOTHING HIDDEN)
-# ======================================================================
-@dataclass
-class EnhancedObservation:
-    code_snippet: str
-    last_tool_output: str
-    current_test_score: float
-    current_lint_score: float
-    negotiation_score: float
-    previous_test_score: float
-    previous_lint_score: float
-    author_confidence: float
-    author_threshold: float
-    step: int
-    max_steps: int
-    progress_ratio: float
-    tests_run: bool
-    linter_run: bool
-    docs_queried: bool
-    last_action_type: str
-    action_history: List[str]
-    done: bool
-    bug_description: str
-    comments_count: int
-    # default fields must be at the very end
-    author_response: str = ""
-# ======================================================================
-# HELPER FUNCTIONS
-# ======================================================================
-def execute_code(code: str, timeout_sec: int = 5) -> Tuple[bool, str, str]:
-    if not code.strip():
-        return False, "", "Error: Empty code"
-    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding='utf-8') as f:
-        f.write(code)
-        tmp_path = f.name
-    try:
-        result = subprocess.run(
-            [sys.executable, tmp_path],
-            capture_output=True,
-            text=True,
-            timeout=timeout_sec
-        )
-        success = (result.returncode == 0)
-        return success, result.stdout, result.stderr
-    except subprocess.TimeoutExpired:
-        return False, "", f"Timeout after {timeout_sec}s"
-    except Exception as e:
-        return False, "", f"Execution error: {str(e)}"
-    finally:
-        try:
-            os.unlink(tmp_path)
-        except:
-            pass
-# ======================================================================
-# ENHANCED CODE REVIEW ENVIRONMENT
-# ======================================================================
-@dataclass
-class CodeReviewEnv:
-    task: str = "easy"
-    max_steps: int = 10
-    step_penalty: float = 0.01
-    # Curriculum learning
-    auto_difficulty: bool = False
-    success_threshold: float = 0.7
-    # Reward shaping parameters
-    delta_weight: float = 0.3
-    tool_usage_bonus: float = 0.05
-    diversity_bonus: float = 0.03
-    _red_team: Optional[RedTeam] = field(init=False, default=None)
-    _author: Optional[PersonaAuthor] = field(init=False, default=None)
-    _current_code: str = field(init=False, default="")
-    _current_bug_id: str = field(init=False, default="")
-    _bug_description: str = field(init=False, default="")
-    _oracle_fix: str = field(init=False, default="")
-    _comments: list = field(init=False, default_factory=list)
-    _test_results: Optional[str] = field(init=False, default=None)
-    _lint_results: Optional[str] = field(init=False, default=None)
-    _doc_results: Optional[str] = field(init=False, default=None)
-    _step_count: int = field(init=False, default=0)
-    _done: bool = field(init=False, default=False)
-    # State tracking for dense rewards
-    _previous_test_score: float = field(init=False, default=0.0)
-    _previous_lint_score: float = field(init=False, default=0.0)
-    _current_test_score: float = field(init=False, default=0.0)
-    _current_lint_score: float = field(init=False, default=0.0)
-    # Tool usage tracking
-    _tests_run: bool = field(init=False, default=False)
-    _linter_run: bool = field(init=False, default=False)
-    _docs_queried: bool = field(init=False, default=False)
-    # Action history
-    _action_history: List[str] = field(init=False, default_factory=list)
-    _last_action_type: str = field(init=False, default="none")
-    # FIXED: Track CUMULATIVE episode reward
-    _episode_total_reward: float = field(init=False, default=0.0)
-    _episode_rewards: List[float] = field(init=False, default_factory=list)
-    _difficulty_level: int = field(init=False, default=0)
-    # ===================================================================
-    def __post_init__(self):
-        self.set_task(self.task)
-    # ===================================================================
-    def set_task(self, task: str):
-        if task not in ["easy", "medium", "hard", "harder", "hardest"]:
-            raise ValueError(f"Unknown task: {task}")
-        self.task = task
-        self._red_team = RedTeam(task)
-        self._author = PersonaAuthor()
-        self.rubrics = [
-            TestDeltaRubric(weight=self.delta_weight),
-            LintDeltaRubric(weight=self.delta_weight),
-            ToolUsageRubric(bonus=self.tool_usage_bonus),
-            TerminalSuccessRubric(),
-            ExplorationRubric(penalty=-0.05, bonus=self.diversity_bonus * 0.7),
-            AntiHackingRubric(),
-            StepPenaltyRubric(penalty=self.step_penalty),
-        ]
-        task_to_level = {
-            "easy": 0, "medium": 1, "hard": 2,
-            "harder": 3, "hardest": 4
-        }
-        self._difficulty_level = task_to_level[task]
-        self._reset_internal()
-    # ===================================================================
-    def _reset_internal(self):
-        self._step_count = 0                         # ← FIXED
-        self._comments = []
-        self._test_results = None
-        self._lint_results = None
-        self._doc_results = None
-        self._done = False
-        # Reset state tracking
-        self._previous_test_score = 0.0
-        self._previous_lint_score = 0.0
-        self._current_test_score = 0.0
-        self._current_lint_score = 0.0
-        self._tests_run = False
-        self._linter_run = False
-        self._docs_queried = False
-        self._action_history = []
-        self._last_action_type = "none"
-        # FIXED: Reset episode cumulative reward
-        self._episode_total_reward = 0.0
-        self._author.reset()
-        # Base tasks
-        if self.task == "easy":
-            original = "def get_user(id):\n    if id in users:\n        return users[id]"
-        elif self.task == "medium":
-            original = "def process_items(items):\n    for item in items:\n        print(item)"
-        elif self.task == "hard":
-            original = "def average(data):\n    if not data:\n        return 0\n    return sum(data) / len(data)"
-        elif self.task == "harder":
-            original = "counter = 0\ndef increment():\n    global counter\n    with lock:\n        counter += 1"
-        else:
-            original = "def safe_work():\n    with lock1:\n        with lock2:\n            do_work()"
-        buggy_code, bug_id, desc, oracle = self._red_team.inject_bug(original)
-        self._current_code = buggy_code
-        self._current_bug_id = bug_id
-        self._bug_description = desc
-        self._oracle_fix = oracle
-        self._comments.append(f"[RedTeam] {desc}")
-    # ===================================================================
-    def reset(self) -> EnhancedObservation:
-        """Reset with optional curriculum adjustment."""
-        if self.auto_difficulty and len(self._episode_rewards) > 0:
-            recent_performance = sum(self._episode_rewards[-5:]) / min(5, len(self._episode_rewards))
-            if recent_performance > self.success_threshold and self._difficulty_level < 4:
-                self._difficulty_level += 1
-                print(f"[Curriculum] Increasing difficulty to level {self._difficulty_level}")
-            elif recent_performance < 0.3 and self._difficulty_level > 0:
-                self._difficulty_level -= 1
-                print(f"[Curriculum] Decreasing difficulty to level {self._difficulty_level}")
-            level_to_task = {0: "easy", 1: "medium", 2: "hard", 3: "harder", 4: "hardest"}
-            self.task = level_to_task[self._difficulty_level]
-            self._red_team = RedTeam(self.task)
-        self._reset_internal()
-        return self._get_observation()
-    # ===================================================================
-    def _get_observation(self) -> EnhancedObservation:
-        """Return COMPLETE Markov state."""
-        # Compute author response: only after comment/question/fix does the author actually speak
-        if self._last_action_type in ("write_comment", "ask_question", "propose_fix"):
-            author_response = self._test_results or ""
-        else:
-            author_response = ""
-        return EnhancedObservation(
-            code_snippet=self._current_code,
-            last_tool_output=self._test_results or "",
-            author_response=author_response,          # ← now field exists
-            current_test_score=self._current_test_score,
-            current_lint_score=self._current_lint_score,
-            negotiation_score=self._author.get_negotiation_score(),
-            previous_test_score=self._previous_test_score,
-            previous_lint_score=self._previous_lint_score,
-            author_confidence=self._author._confidence,
-            author_threshold=self._author.thresholds.get(self._author.personality, 0.5),
-            step=self._step_count,
-            max_steps=self.max_steps,
-            progress_ratio=self._step_count / self.max_steps,
-            tests_run=self._tests_run,
-            linter_run=self._linter_run,
-            docs_queried=self._docs_queried,
-            last_action_type=self._last_action_type,
-            action_history=self._action_history[-5:],
-            done=self._done,
-            bug_description=self._bug_description,
-            comments_count=len(self._comments),
-        )
-    # ===================================================================
-    def _get_action_type(self, action: AnyAction) -> str:
-        """Extract action type as string."""
-        if isinstance(action, RunTests):
-            return "run_tests"
-        elif isinstance(action, RunLinter):
-            return "run_linter"
-        elif isinstance(action, QueryDocs):
-            return "query_docs"
-        elif isinstance(action, Execute):
-            return "execute"
-        elif isinstance(action, Inspect):
-            return "inspect"
-        elif isinstance(action, WriteComment):
-            return "write_comment"
-        elif isinstance(action, AskQuestion):
-            return "ask_question"
-        elif isinstance(action, ProposeFix):
-            return "propose_fix"
-        elif isinstance(action, Done):
-            return "done"
-        elif isinstance(action, Skip):
-            return "skip"
-        else:
-            return "unknown"
-    # ===================================================================
-    def step(self, action: AnyAction) -> Tuple[EnhancedObservation, Reward, bool, Dict[str, Any]]:
-        """
-        TRUE RL STEP with:
-        - Complete Markov observations (no hidden state)
-        - Dense intermediate rewards
-        - Delta-based credit assignment (no double-counting)
-        - Proper episode reward tracking
-        """
-        if self._done:
-            raise RuntimeError("Episode already finished")
-        # Store previous metrics for delta computation
-        self._previous_test_score = self._current_test_score
-        self._previous_lint_score = self._current_lint_score
-        base_reward = 0.0
-        action_type = self._get_action_type(action)
-        # Update action history
-        self._action_history.append(action_type)
-        self._last_action_type = action_type
-        # ==============================================================
-        # TOOL ACTIONS
-        # ==============================================================
-        if isinstance(action, Execute):
-            success, stdout, stderr = execute_code(self._current_code)
-            output = (stdout + stderr).strip() or "No output"
-            self._test_results = f"[Execute] {'Success' if success else 'Failed'}\n{output[:300]}"
-            base_reward = 0.001 if success else -0.05
-        elif isinstance(action, Inspect):
-            self._test_results = f"[Inspect]\n{self._current_code[:500]}"
-            base_reward = 0.001
-        elif isinstance(action, RunLinter):
-            lint_output = ToolBox.run_linter(self._current_code)
-            self._lint_results = lint_output[:500]
-            self._test_results = f"[Linter]\n{self._lint_results}"
-            self._current_lint_score = self._run_linter_score(self._current_code)
-            self._linter_run = True
-            base_reward = 0.002
-        elif isinstance(action, RunTests):
-            runner = TestRunner(self._current_bug_id)
-            score, output = runner.run_tests(self._current_code)
-            self._current_test_score = score
-            self._tests_run = True
-            self._test_results = f"[Tests] Score: {score:.2f}\n{output[:300]}"
-            base_reward = 0.002
-            if score > 0.8:
-                base_reward += 0.005
-        elif isinstance(action, QueryDocs):
-            doc = ToolBox.query_docs(action.query_topic)
-            self._doc_results = doc
-            self._test_results = f"[Docs]\n{doc[:400]}"
-            self._docs_queried = True
-            base_reward = 0.001
-        # ==============================================================
-        # COMMUNICATION ACTIONS
-        # ==============================================================
-        elif isinstance(action, WriteComment):
-            self._comments.append(f"Agent: {action.comment_text}")
-            response = self._author.respond(
-                agent_comment=action.comment_text,
-                test_results=self._test_results,
-                lint_results=self._lint_results,
-                doc_results=self._doc_results,
-                proposed_fix=None,
-                original_code=self._current_code
-            )
-            self._comments.append(f"Author: {response}")
-            self._test_results = f"[Comment] Author: {response[:200]}"
-            base_reward = 0.001
-        elif isinstance(action, AskQuestion):
-            self._comments.append(f"Agent: {action.question}")
-            response = self._author.respond(
-                agent_question=action.question,
-                test_results=self._test_results,
-                lint_results=self._lint_results,
-                doc_results=self._doc_results,
-                proposed_fix=None,
-                original_code=self._current_code                  # ← FIXED
-            )
-            self._comments.append(f"Author: {response}")
-            self._test_results = f"[Question] Author: {response[:200]}"
-            base_reward = 0.002
-        # ==============================================================
-        # FINAL FIX ACTION
-        # ==============================================================
-        elif isinstance(action, ProposeFix):
-            if not action.fix_code:
-                base_reward = -0.05
-                self._done = True
-            else:
-                # Save original code BEFORE overwriting (for author.respond)
-                original_buggy = self._current_code
-                self._current_code = action.fix_code
-                runner = TestRunner(self._current_bug_id)
-                test_score, test_output = runner.run_tests(self._current_code)
-                lint_score = self._run_linter_score(self._current_code)
-                negotiation_score = self._author.get_negotiation_score()
-                self._current_test_score = test_score
-                self._current_lint_score = lint_score
-                # Author gating – determines if the episode ends, reward is separate
-                threshold = self._author.thresholds.get(self._author.personality, 0.5)
-                if self._author._confidence < threshold:
-                    if self._step_count < self.max_steps:
-                        self._done = False
-                    else:
-                        self._done = True
-                else:
-                    self._done = True
-                # Get author's verbal feedback (pushback/acceptance)
-                author_feedback = self._author.respond(
-                    agent_comment=f"Proposed fix:\n{action.fix_code}",
-                    test_results=f"Score: {test_score:.2f}",
-                    lint_results=f"Score: {lint_score:.2f}",
-                    doc_results=self._doc_results,
-                    proposed_fix=action.fix_code,
-                    original_code=original_buggy   # now correctly the buggy code, not the fix
-                )
-                self._test_results = f"[Fix] Author: {author_feedback[:200]}"
-                self._comments.append(f"Author: {author_feedback}")
-                base_reward = 0.001   # rubrics provide the real signal
-        # ==============================================================
-        # TERMINATION ACTIONS
-        # ==============================================================
-        elif isinstance(action, Skip):
-            base_reward = -0.03
-            self._done = True
-        elif isinstance(action, Done):
-            if self._tests_run:
-                base_reward = self._current_test_score * 0.5 - 0.2
-            else:
-                base_reward = -0.04
-            self._done = True
-        else:
-            base_reward = -0.02
-            self._done = True
-        # ==============================================================
-        # STEP UPDATE (before rubric computation so info contains final step)
-        # ==============================================================
-        self._step_count += 1
-        if self._step_count >= self.max_steps:
-            self._done = True
-        # Get fresh observation (needed for rubrics that may read obs)
-        obs = self._get_observation()
-        # Prepare info dict (rubrics may need action_type and deltas)
-        info = {
-            "action_type": action_type,
-            "test_score": self._current_test_score,
-            "lint_score": self._current_lint_score,
-            "test_delta": self._current_test_score - self._previous_test_score,
-            "lint_delta": self._current_lint_score - self._previous_lint_score,
-            "base_reward": base_reward,
-        }
-        # ==============================================================
-        # COMPUTE FINAL REWARD USING RUBRICS
-        # ==============================================================
-        rubric_score = sum(r(self, action, obs, None, self._done, info) for r in self.rubrics)
-        final_reward = 0.4 * base_reward + rubric_score
-        final_reward = max(-1.0, min(1.0, final_reward))   # safety clip
-        # Track cumulative episode reward
-        self._episode_total_reward += final_reward
-        # Store episode total if done
-        if self._done:
-            self._episode_rewards.append(self._episode_total_reward)
-        # Complete info
-        info["final_reward"] = final_reward
-        info["episode_total"] = self._episode_total_reward
-        return obs, Reward(value=final_reward), self._done, info
-    # ===================================================================
-    def _run_linter_score(self, code: str) -> float:
-        """Run pylint and return normalized score [0, 1]."""
-        try:
-            with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
-                f.write(code)
-                tmp_path = f.name
-            result = subprocess.run(
-                ['pylint', tmp_path, '--score=y', '--exit-zero'],
-                capture_output=True,
-                text=True,
-                timeout=5
-            )
-            match = re.search(r"rated at (\d+\.\d+)/10", result.stdout)
-            if match:
-                return float(match.group(1)) / 10.0
-            return 0.0
-        except:
-            return 0.0
-        finally:
-            try:
-                os.unlink(tmp_path)
-            except:
-                pass
-    # ===================================================================
-    def state(self) -> State:
-        """Legacy compatibility."""
-        return State(
-            pr_title="Code Review",
-            pr_description=self._bug_description,
-            code_snippet=self._current_code,
-            comments=self._comments.copy(),
-            test_results=self._test_results,
-            step=self._step_count,
-            done=self._done
-        )

+# environment.py – FULLY CORRECTED RL Environment (TRUE Markov + Fixed Bugs)
+import sys
+import subprocess
+import tempfile
+import os
+import re
+from dataclasses import dataclass, field
+from typing import Tuple, Dict, Any, Optional, List
+from models import (
+    AnyAction, WriteComment, ProposeFix, Execute, Inspect,
+    RunLinter, RunTests, QueryDocs, Skip, Done, AskQuestion,
+    Observation, Reward, State
+)
+from redteam import RedTeam
+from test_runner import TestRunner
+from author import PersonaAuthor
+from rltool import ToolBox
+from rubrics import (
+    ToolUsageRubric,
+    TestDeltaRubric,
+    LintDeltaRubric,
+    TerminalSuccessRubric,
+    ExplorationRubric,
+    AntiHackingRubric,
+    StepPenaltyRubric,
+)
+# ======================================================================
+# FULLY MARKOV OBSERVATION (NOTHING HIDDEN)
+# ======================================================================
+@dataclass
+class EnhancedObservation:
+    code_snippet: str
+    last_tool_output: str
+    current_test_score: float
+    current_lint_score: float
+    negotiation_score: float
+    previous_test_score: float
+    previous_lint_score: float
+    author_confidence: float
+    author_threshold: float
+    step: int
+    max_steps: int
+    progress_ratio: float
+    tests_run: bool
+    linter_run: bool
+    docs_queried: bool
+    last_action_type: str
+    action_history: List[str]
+    done: bool
+    bug_description: str
+    comments_count: int
+    # default fields must be at the very end
+    author_response: str = ""
+# ======================================================================
+# HELPER FUNCTIONS
+# ======================================================================
+def execute_code(code: str, timeout_sec: int = 5) -> Tuple[bool, str, str]:
+    if not code.strip():
+        return False, "", "Error: Empty code"
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding='utf-8') as f:
+        f.write(code)
+        tmp_path = f.name
+    try:
+        result = subprocess.run(
+            [sys.executable, tmp_path],
+            capture_output=True,
+            text=True,
+            timeout=timeout_sec
+        )
+        success = (result.returncode == 0)
+        return success, result.stdout, result.stderr
+    except subprocess.TimeoutExpired:
+        return False, "", f"Timeout after {timeout_sec}s"
+    except Exception as e:
+        return False, "", f"Execution error: {str(e)}"
+    finally:
+        try:
+            os.unlink(tmp_path)
+        except:
+            pass
+# ======================================================================
+# ENHANCED CODE REVIEW ENVIRONMENT
+# ======================================================================
+@dataclass
+class CodeReviewEnv:
+    task: str = "easy"
+    max_steps: int = 10
+    step_penalty: float = 0.01
+    reward_profile: str = "full"  # "full" or "core"
+    # Curriculum learning
+    auto_difficulty: bool = False
+    success_threshold: float = 0.7
+    # Reward shaping parameters
+    delta_weight: float = 0.3
+    tool_usage_bonus: float = 0.05
+    diversity_bonus: float = 0.03
+    _red_team: Optional[RedTeam] = field(init=False, default=None)
+    _author: Optional[PersonaAuthor] = field(init=False, default=None)
+    _current_code: str = field(init=False, default="")
+    _current_bug_id: str = field(init=False, default="")
+    _bug_description: str = field(init=False, default="")
+    _oracle_fix: str = field(init=False, default="")
+    _comments: list = field(init=False, default_factory=list)
+    _test_results: Optional[str] = field(init=False, default=None)
+    _lint_results: Optional[str] = field(init=False, default=None)
+    _doc_results: Optional[str] = field(init=False, default=None)
+    _step_count: int = field(init=False, default=0)
+    _done: bool = field(init=False, default=False)
+    # State tracking for dense rewards
+    _previous_test_score: float = field(init=False, default=0.0)
+    _previous_lint_score: float = field(init=False, default=0.0)
+    _current_test_score: float = field(init=False, default=0.0)
+    _current_lint_score: float = field(init=False, default=0.0)
+    # Tool usage tracking
+    _tests_run: bool = field(init=False, default=False)
+    _linter_run: bool = field(init=False, default=False)
+    _docs_queried: bool = field(init=False, default=False)
+    # Action history
+    _action_history: List[str] = field(init=False, default_factory=list)
+    _last_action_type: str = field(init=False, default="none")
+    _last_author_response: str = field(init=False, default="")
+    # FIXED: Track CUMULATIVE episode reward
+    _episode_total_reward: float = field(init=False, default=0.0)
+    _episode_rewards: List[float] = field(init=False, default_factory=list)
+    _difficulty_level: int = field(init=False, default=0)
+    # Bug-id bridge:
+    # RedTeam has fine-grained IDs, while TestRunner currently expects a
+    # smaller canonical set. Keep this mapping here so both modules can evolve
+    # independently without breaking evaluation.
+    _BUG_ID_CANONICAL_MAP = {
+        "division_by_zero_empty": "division_by_zero",
+        "division_by_zero_zero": "division_by_zero",
+        "sign_error": "wrong_operator",
+    }
+    # ===================================================================
+    def __post_init__(self):
+        self.set_task(self.task)
+    # ===================================================================
+    def _build_rubrics(self):
+        """
+        Build rubric stack from a named reward profile.
+        - full: richer shaping for exploration/tool-use behavior
+        - core: minimal stable signal for quick ablations/baselines
+        """
+        core_rubrics = [
+            TestDeltaRubric(weight=self.delta_weight),
+            LintDeltaRubric(weight=self.delta_weight),
+            TerminalSuccessRubric(),
+            StepPenaltyRubric(penalty=self.step_penalty),
+        ]
+        if self.reward_profile == "core":
+            return core_rubrics
+        if self.reward_profile == "full":
+            return [
+                *core_rubrics[:-1],  # step penalty appended at end for consistent ordering
+                ToolUsageRubric(bonus=self.tool_usage_bonus),
+                ExplorationRubric(penalty=-0.05, bonus=self.diversity_bonus * 0.7),
+                AntiHackingRubric(),
+                core_rubrics[-1],
+            ]
+        raise ValueError(f"Unknown reward_profile: {self.reward_profile}")
+    # ===================================================================
+    def set_task(self, task: str):
+        if task not in ["easy", "medium", "hard", "harder", "hardest"]:
+            raise ValueError(f"Unknown task: {task}")
+        self.task = task
+        # Use stochastic bug sampling across episodes; fixed seed here would
+        # repeatedly select the same bug and weaken training diversity.
+        self._red_team = RedTeam(task, seed=None)
+        self._author = PersonaAuthor()
+        self.rubrics = self._build_rubrics()
+        task_to_level = {
+            "easy": 0, "medium": 1, "hard": 2,
+            "harder": 3, "hardest": 4
+        }
+        self._difficulty_level = task_to_level[task]
+        self._reset_internal()
+    # ===================================================================
+    def _reset_internal(self):
+        self._step_count = 0                         # ← FIXED
+        self._comments = []
+        self._test_results = None
+        self._lint_results = None
+        self._doc_results = None
+        self._done = False
+        # Reset state tracking
+        self._previous_test_score = 0.0
+        self._previous_lint_score = 0.0
+        self._current_test_score = 0.0
+        self._current_lint_score = 0.0
+        self._tests_run = False
+        self._linter_run = False
+        self._docs_queried = False
+        self._action_history = []
+        self._last_action_type = "none"
+        self._last_author_response = ""
+        # FIXED: Reset episode cumulative reward
+        self._episode_total_reward = 0.0
+        self._author.reset()
+        # Base tasks
+        if self.task == "easy":
+            original = "def get_user(id):\n    if id in users:\n        return users[id]"
+        elif self.task == "medium":
+            original = "def process_items(items):\n    for item in items:\n        print(item)"
+        elif self.task == "hard":
+            original = "def average(data):\n    if not data:\n        return 0\n    return sum(data) / len(data)"
+        elif self.task == "harder":
+            original = "counter = 0\ndef increment():\n    global counter\n    with lock:\n        counter += 1"
+        else:
+            original = "def safe_work():\n    with lock1:\n        with lock2:\n            do_work()"
+        buggy_code, bug_id, desc, oracle = self._red_team.inject_bug(original)
+        self._current_code = buggy_code
+        self._current_bug_id = bug_id
+        self._bug_description = desc
+        self._oracle_fix = oracle
+        self._comments.append(f"[RedTeam] {desc}")
+    # ===================================================================
+    def reset(self) -> EnhancedObservation:
+        """Reset with optional curriculum adjustment."""
+        if self.auto_difficulty and len(self._episode_rewards) > 0:
+            recent_performance = sum(self._episode_rewards[-5:]) / min(5, len(self._episode_rewards))
+            if recent_performance > self.success_threshold and self._difficulty_level < 4:
+                self._difficulty_level += 1
+                print(f"[Curriculum] Increasing difficulty to level {self._difficulty_level}")
+            elif recent_performance < 0.3 and self._difficulty_level > 0:
+                self._difficulty_level -= 1
+                print(f"[Curriculum] Decreasing difficulty to level {self._difficulty_level}")
+            level_to_task = {0: "easy", 1: "medium", 2: "hard", 3: "harder", 4: "hardest"}
+            self.task = level_to_task[self._difficulty_level]
+            # Keep curriculum stochastic for better coverage within each level.
+            self._red_team = RedTeam(self.task, seed=None)
+        self._reset_internal()
+        return self._get_observation()
+    # ===================================================================
+    def _get_observation(self) -> EnhancedObservation:
+        """Return COMPLETE Markov state."""
+        # Keep the author's message separate from tool output.
+        # Using `_test_results` here can leak unrelated outputs (tests/linter/docs)
+        # and gives the policy a noisy signal for dialogue actions.
+        if self._last_action_type in ("write_comment", "ask_question", "propose_fix"):
+            author_response = self._last_author_response
+        else:
+            author_response = ""
+        return EnhancedObservation(
+            code_snippet=self._current_code,
+            last_tool_output=self._test_results or "",
+            author_response=author_response,          # ← now field exists
+            current_test_score=self._current_test_score,
+            current_lint_score=self._current_lint_score,
+            negotiation_score=self._author.get_negotiation_score(),
+            previous_test_score=self._previous_test_score,
+            previous_lint_score=self._previous_lint_score,
+            author_confidence=self._author._confidence,
+            author_threshold=self._author.thresholds.get(self._author.personality, 0.5),
+            step=self._step_count,
+            max_steps=self.max_steps,
+            # Guard against accidental `max_steps=0` configs.
+            progress_ratio=(self._step_count / self.max_steps) if self.max_steps > 0 else 1.0,
+            tests_run=self._tests_run,
+            linter_run=self._linter_run,
+            docs_queried=self._docs_queried,
+            last_action_type=self._last_action_type,
+            action_history=self._action_history[-5:],
+            done=self._done,
+            bug_description=self._bug_description,
+            comments_count=len(self._comments),
+        )
+    # ===================================================================
+    def _get_action_type(self, action: AnyAction) -> str:
+        """Extract action type as string."""
+        if isinstance(action, RunTests):
+            return "run_tests"
+        elif isinstance(action, RunLinter):
+            return "run_linter"
+        elif isinstance(action, QueryDocs):
+            return "query_docs"
+        elif isinstance(action, Execute):
+            return "execute"
+        elif isinstance(action, Inspect):
+            return "inspect"
+        elif isinstance(action, WriteComment):
+            return "write_comment"
+        elif isinstance(action, AskQuestion):
+            return "ask_question"
+        elif isinstance(action, ProposeFix):
+            return "propose_fix"
+        elif isinstance(action, Done):
+            return "done"
+        elif isinstance(action, Skip):
+            return "skip"
+        else:
+            return "unknown"
+    # ===================================================================
+    def _get_test_runner_bug_id(self) -> str:
+        """
+        Normalize RedTeam bug ids to the canonical ids understood by TestRunner.
+        Falls back to the original id for known direct matches.
+        """
+        return self._BUG_ID_CANONICAL_MAP.get(self._current_bug_id, self._current_bug_id)
+    # ===================================================================
+    def step(self, action: AnyAction) -> Tuple[EnhancedObservation, Reward, bool, Dict[str, Any]]:
+        """
+        TRUE RL STEP with:
+        - Complete Markov observations (no hidden state)
+        - Dense intermediate rewards
+        - Delta-based credit assignment (no double-counting)
+        - Proper episode reward tracking
+        """
+        if self._done:
+            raise RuntimeError("Episode already finished")
+        # Store previous metrics for delta computation
+        self._previous_test_score = self._current_test_score
+        self._previous_lint_score = self._current_lint_score
+        # Snapshot tool-usage flags BEFORE action mutates them.
+        # Rubrics use these to detect true "first-use" behavior.
+        prev_tests_run = self._tests_run
+        prev_linter_run = self._linter_run
+        prev_docs_queried = self._docs_queried
+        base_reward = 0.0
+        action_type = self._get_action_type(action)
+        # Update action history
+        self._action_history.append(action_type)
+        self._last_action_type = action_type
+        # ==============================================================
+        # TOOL ACTIONS
+        # ==============================================================
+        if isinstance(action, Execute):
+            success, stdout, stderr = execute_code(self._current_code)
+            output = (stdout + stderr).strip() or "No output"
+            self._test_results = f"[Execute] {'Success' if success else 'Failed'}\n{output[:300]}"
+            base_reward = 0.001 if success else -0.05
+        elif isinstance(action, Inspect):
+            self._test_results = f"[Inspect]\n{self._current_code[:500]}"
+            base_reward = 0.001
+        elif isinstance(action, RunLinter):
+            lint_output = ToolBox.run_linter(self._current_code)
+            self._lint_results = lint_output[:500]
+            self._test_results = f"[Linter]\n{self._lint_results}"
+            self._current_lint_score = self._run_linter_score(self._current_code)
+            self._linter_run = True
+            base_reward = 0.002
+        elif isinstance(action, RunTests):
+            runner = TestRunner(self._get_test_runner_bug_id())
+            score, output = runner.run_tests(self._current_code)
+            self._current_test_score = score
+            self._tests_run = True
+            self._test_results = f"[Tests] Score: {score:.2f}\n{output[:300]}"
+            base_reward = 0.002
+            if score > 0.8:
+                base_reward += 0.005
+        elif isinstance(action, QueryDocs):
+            # Normalize query to avoid rewarding empty/noisy requests.
+            query_topic = (action.query_topic or "").strip()
+            doc = ToolBox.query_docs(query_topic if query_topic else "general bug fixing")
+            self._doc_results = doc
+            self._test_results = f"[Docs]\n{doc[:400]}"
+            self._docs_queried = True
+            base_reward = 0.001
+        # ==============================================================
+        # COMMUNICATION ACTIONS
+        # ==============================================================
+        elif isinstance(action, WriteComment):
+            self._comments.append(f"Agent: {action.comment_text}")
+            response = self._author.respond(
+                agent_comment=action.comment_text,
+                test_results=self._test_results,
+                lint_results=self._lint_results,
+                doc_results=self._doc_results,
+                proposed_fix=None,
+                original_code=self._current_code
+            )
+            self._comments.append(f"Author: {response}")
+            self._last_author_response = response
+            self._test_results = f"[Comment] Author: {response[:200]}"
+            base_reward = 0.001
+        elif isinstance(action, AskQuestion):
+            self._comments.append(f"Agent: {action.question}")
+            response = self._author.respond(
+                agent_question=action.question,
+                test_results=self._test_results,
+                lint_results=self._lint_results,
+                doc_results=self._doc_results,
+                proposed_fix=None,
+                original_code=self._current_code                  # ← FIXED
+            )
+            self._comments.append(f"Author: {response}")
+            self._last_author_response = response
+            self._test_results = f"[Question] Author: {response[:200]}"
+            base_reward = 0.002
+        # ==============================================================
+        # FINAL FIX ACTION
+        # ==============================================================
+        elif isinstance(action, ProposeFix):
+            if not action.fix_code:
+                base_reward = -0.05
+                self._done = True
+            else:
+                # Save original code BEFORE overwriting (for author.respond)
+                original_buggy = self._current_code
+                self._current_code = action.fix_code
+                runner = TestRunner(self._get_test_runner_bug_id())
+                test_score, test_output = runner.run_tests(self._current_code)
+                lint_score = self._run_linter_score(self._current_code)
+                negotiation_score = self._author.get_negotiation_score()
+                self._current_test_score = test_score
+                self._current_lint_score = lint_score
+                # Author gating – determines if the episode ends, reward is separate
+                threshold = self._author.thresholds.get(self._author.personality, 0.5)
+                if self._author._confidence < threshold:
+                    if self._step_count < self.max_steps:
+                        self._done = False
+                    else:
+                        self._done = True
+                else:
+                    self._done = True
+                # Get author's verbal feedback (pushback/acceptance)
+                author_feedback = self._author.respond(
+                    agent_comment=f"Proposed fix:\n{action.fix_code}",
+                    test_results=f"Score: {test_score:.2f}",
+                    lint_results=f"Score: {lint_score:.2f}",
+                    doc_results=self._doc_results,
+                    proposed_fix=action.fix_code,
+                    original_code=original_buggy   # now correctly the buggy code, not the fix
+                )
+                self._test_results = f"[Fix] Author: {author_feedback[:200]}"
+                self._comments.append(f"Author: {author_feedback}")
+                self._last_author_response = author_feedback
+                base_reward = 0.001   # rubrics provide the real signal
+        # ==============================================================
+        # TERMINATION ACTIONS
+        # ==============================================================
+        elif isinstance(action, Skip):
+            base_reward = -0.03
+            self._done = True
+        elif isinstance(action, Done):
+            if self._tests_run:
+                base_reward = self._current_test_score * 0.5 - 0.2
+            else:
+                base_reward = -0.04
+            self._done = True
+        else:
+            base_reward = -0.02
+            self._done = True
+        # ==============================================================
+        # STEP UPDATE (before rubric computation so info contains final step)
+        # ==============================================================
+        self._step_count += 1
+        if self._step_count >= self.max_steps:
+            self._done = True
+        # Get fresh observation (needed for rubrics that may read obs)
+        obs = self._get_observation()
+        # Prepare info dict (rubrics may need action_type and deltas)
+        info = {
+            "action_type": action_type,
+            "test_score": self._current_test_score,
+            "lint_score": self._current_lint_score,
+            "test_delta": self._current_test_score - self._previous_test_score,
+            "lint_delta": self._current_lint_score - self._previous_lint_score,
+            "prev_tests_run": prev_tests_run,
+            "prev_linter_run": prev_linter_run,
+            "prev_docs_queried": prev_docs_queried,
+            "docs_query_len": len((action.query_topic or "").strip()) if isinstance(action, QueryDocs) else 0,
+            "base_reward": base_reward,
+        }
+        # ==============================================================
+        # COMPUTE FINAL REWARD USING RUBRICS
+        # ==============================================================
+        rubric_score = sum(r(self, action, obs, None, self._done, info) for r in self.rubrics)
+        final_reward = 0.4 * base_reward + rubric_score
+        final_reward = max(-1.0, min(1.0, final_reward))   # safety clip
+        # Track cumulative episode reward
+        self._episode_total_reward += final_reward
+        # Store episode total if done
+        if self._done:
+            self._episode_rewards.append(self._episode_total_reward)
+        # Complete info
+        info["final_reward"] = final_reward
+        info["episode_total"] = self._episode_total_reward
+        return obs, Reward(value=final_reward), self._done, info
+    # ===================================================================
+    def _run_linter_score(self, code: str) -> float:
+        """Run pylint and return normalized score [0, 1]."""
+        try:
+            with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
+                f.write(code)
+                tmp_path = f.name
+            result = subprocess.run(
+                ['pylint', tmp_path, '--score=y', '--exit-zero'],
+                capture_output=True,
+                text=True,
+                timeout=5
+            )
+            match = re.search(r"rated at (\d+\.\d+)/10", result.stdout)
+            if match:
+                return float(match.group(1)) / 10.0
+            return 0.0
+        except:
+            return 0.0
+        finally:
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+    # ===================================================================
+    def state(self) -> State:
+        """Legacy compatibility."""
+        return State(
+            pr_title="Code Review",
+            pr_description=self._bug_description,
+            code_snippet=self._current_code,
+            comments=self._comments.copy(),
+            test_results=self._test_results,
+            step=self._step_count,
+            done=self._done
+        )

grader.py CHANGED Viewed

@@ -1,148 +1,142 @@
-# grader.py – Production‑grade, continuous reward, exploit‑aware
-import ast
-import subprocess
-import tempfile
-import os
-import re
-import sys
-import json
-from dataclasses import dataclass
-from typing import Optional
-@dataclass
-class RigorousGrader:
-    bug_id: str
-    oracle_code: Optional[str] = None
-    def grade_fix(self, proposed_fix: str) -> float:
-        """Returns a smooth reward in [0,1]."""
-        # Syntax check
-        try:
-            ast.parse(proposed_fix)
-        except SyntaxError:
-            return 0.0
-        # Exploit detection (optional)
-        if self._is_exploit(proposed_fix):
-            return 0.0
-        # Continuous test score
-        test_score = self._run_continuous_tests(proposed_fix)
-        # Lint score
-        lint_score = self._get_lint_score(proposed_fix)
-        # Oracle similarity
-        oracle_score = self._ast_similarity(proposed_fix) if self.oracle_code else 0.0
-        # Weighted combination
-        final = (0.5 * test_score) + (0.3 * lint_score) + (0.2 * oracle_score)
-        return max(0.0, min(1.0, final))
-    def _run_continuous_tests(self, code: str) -> float:
-        """Proportion of passed test cases."""
-        test_cases = self._get_test_cases()
-        if not test_cases:
-            return 0.0
-        passed = 0
-        for test_input, expected in test_cases:
-            if self._run_single_test(code, test_input, expected):
-                passed += 1
-        return passed / len(test_cases)
-    def _get_test_cases(self) -> list:
-        """Define multiple test cases per bug type."""
-        if self.bug_id == "null_check":
-            return [
-                ({"users": {"alice": "Alice"}, "id": "bob"}, None),
-                ({"users": {"alice": "Alice"}, "id": "alice"}, "Alice"),
-            ]
-        elif self.bug_id == "off_by_one":
-            return [
-                ([1, 2, 3, 4], 4),
-                ([], 0),
-            ]
-        # Extend for other bugs …
-        return []
-    def _run_single_test(self, code: str, test_input, expected) -> bool:
-        """Execute the agent's code with test_input and compare to expected."""
-        harness = f"""
-import json
-{code}
-try:
-    result = fix({json.dumps(test_input)})
-    print(json.dumps({{"ok": True, "result": result}}))
-except Exception as e:
-    print(json.dumps({{"ok": False, "error": str(e)}}))
-"""
-        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
-            f.write(harness)
-            tmp_path = f.name
-        try:
-            result = subprocess.run(
-                [sys.executable, tmp_path],
-                capture_output=True, text=True, timeout=5
-            )
-            data = json.loads(result.stdout.strip())
-            if data.get("ok") and data["result"] == expected:
-                return True
-            return False
-        except (json.JSONDecodeError, subprocess.TimeoutExpired, Exception):
-            return False
-        finally:
-            try:
-                os.unlink(tmp_path)
-            except:
-                pass
-    def _is_exploit(self, code: str) -> bool:
-        """Detect hardcoded returns or trivial bypasses."""
-        lower = code.lower()
-        if "return 0" in lower and "if" not in lower:
-            return True
-        if code.strip() == "":
-            return True
-        return False
-    def _get_lint_score(self, code: str) -> float:
-        """Continuous lint score, fallback 0.0 on error."""
-        try:
-            with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
-                f.write(code)
-                f.flush()
-                tmp_path = f.name
-            result = subprocess.run(
-                ['pylint', tmp_path, '--score=y', '--exit-zero'],
-                capture_output=True,
-                text=True,
-                timeout=5
-            )
-            match = re.search(r"rated at (\d+\.\d+)/10", result.stdout)
-            if match:
-                score = float(match.group(1)) / 10.0
-            else:
-                score = 0.0
-            return max(0.0, min(1.0, score))
-        except Exception:
-            return 0.0
-        finally:
-            try:
-                os.unlink(tmp_path)
-            except:
-                pass
-    def _ast_similarity(self, proposed_code: str) -> float:
-        """Structural similarity to oracle."""
-        if not self.oracle_code:
-            return 0.0
-        try:
-            tree_prop = ast.parse(proposed_code)
-            tree_oracle = ast.parse(self.oracle_code)
-            nodes_prop = [type(n) for n in ast.walk(tree_prop)]
-            nodes_oracle = [type(n) for n in ast.walk(tree_oracle)]
-            common = sum(1 for n in nodes_prop if n in nodes_oracle)
-            total = max(len(nodes_prop), len(nodes_oracle))
-            return common / total if total > 0 else 0.0
-        except:
             return 0.0

+# grader.py – Production‑grade, continuous reward, exploit‑aware, example of  monolithic scoring
+import ast
+import subprocess
+import tempfile
+import os
+import re
+from dataclasses import dataclass
+from typing import Optional
+@dataclass
+class RigorousGrader:
+    bug_id: str
+    oracle_code: Optional[str] = None
+    def grade_fix(self, proposed_fix: str) -> float:
+        """
+        Returns a smooth reward in [0,1] based on:
+        - Syntax validity
+        - Proportion of tests passed (continuous, not binary)
+        - Lint quality (with conservative fallback)
+        - Structural similarity to oracle (anti‑gaming)
+        - Exploit detection (hardcoded outputs / no real change)
+        """
+        # 1. Syntax check (binary – non‑negotiable)
+        try:
+            ast.parse(proposed_fix)
+        except SyntaxError:
+            return 0.0   # hard zero, not negative (RL stable)
+        # 2. Exploit detection: trivial or hardcoded fixes
+        if self._is_exploit(proposed_fix):
+            return 0.0
+        # 3. Continuous test score (proportion of passed test cases)
+        test_score = self._run_continuous_tests(proposed_fix)
+        # 4. Lint score (continuous, fallback 0.0 not 0.5)
+        lint_score = self._get_lint_score(proposed_fix)
+        # 5. Oracle similarity (structural, not gameable)
+        oracle_score = self._ast_similarity(proposed_fix) if self.oracle_code else 0.0
+        # Weighted combination (all continuous)
+        final = (0.5 * test_score) + (0.3 * lint_score) + (0.2 * oracle_score)
+        return max(0.0, min(1.0, final))
+    def _run_continuous_tests(self, code: str) -> float:
+        """
+        Returns proportion of passed tests (0.0 to 1.0).
+        Uses multiple test cases per bug type.
+        """
+        test_cases = self._get_test_cases()
+        if not test_cases:
+            return 0.0
+        passed = 0
+        for test_input, expected in test_cases:
+            if self._run_single_test(code, test_input, expected):
+                passed += 1
+        return passed / len(test_cases)
+    def _get_test_cases(self) -> list:
+        """Define multiple test cases for each bug type."""
+        if self.bug_id == "null_check":
+            return [
+                ({"users": {"alice": "Alice"}, "id": "bob"}, None),  # should not crash
+                ({"users": {"alice": "Alice"}, "id": "alice"}, "Alice"),
+            ]
+        elif self.bug_id == "off_by_one":
+            return [
+                ([1,2,3,4], 4),   # should count all elements
+                ([], 0),
+            ]
+        # Add more for other bugs...
+        return []
+    def _run_single_test(self, code: str, test_input, expected) -> bool:
+        """Execute code with given input and compare output."""
+        # Simplified – in production, use a safe sandbox
+        try:
+            # Inject test harness (this is a placeholder)
+            exec_globals = {}
+            exec(code, exec_globals)
+            # Call the function (assume it's named appropriately)
+            # This is highly simplified; real implementation would need more care.
+            return True  # placeholder
+        except:
+            return False
+    def _is_exploit(self, code: str) -> bool:
+        """Detect hardcoded returns or trivial bypasses."""
+        lower = code.lower()
+        # Hardcoded return for a specific input
+        if "return 0" in lower and "if" not in lower:
+            return True
+        # No change at all (same as original placeholder)
+        if code.strip() == "":
+            return True
+        return False
+    def _get_lint_score(self, code: str) -> float:
+        """Continuous lint score, fallback 0.0 on error."""
+        try:
+            with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
+                f.write(code)
+                f.flush()
+                tmp_path = f.name
+            result = subprocess.run(
+                ['pylint', tmp_path, '--score=y', '--exit-zero'],
+                capture_output=True,
+                text=True,
+                timeout=5
+            )
+            match = re.search(r"rated at (\d+\.\d+)/10", result.stdout)
+            if match:
+                score = float(match.group(1)) / 10.0
+            else:
+                score = 0.0   # was 0.5 – now conservative
+            return max(0.0, min(1.0, score))
+        except Exception:
+            return 0.0
+        finally:
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+    def _ast_similarity(self, proposed_code: str) -> float:
+        """Structural similarity – penalizes structure‑only changes without logic change."""
+        if not self.oracle_code:
+            return 0.0
+        try:
+            tree_prop = ast.parse(proposed_code)
+            tree_oracle = ast.parse(self.oracle_code)
+            # Count matching node types (crude but simple)
+            nodes_prop = [type(n) for n in ast.walk(tree_prop)]
+            nodes_oracle = [type(n) for n in ast.walk(tree_oracle)]
+            common = sum(1 for n in nodes_prop if n in nodes_oracle)
+            total = max(len(nodes_prop), len(nodes_oracle))
+            return common / total if total > 0 else 0.0
+        except:
             return 0.0

models.py CHANGED Viewed

@@ -1,115 +1,88 @@
-# models.py – Typed Models (Discriminated Unions, POMDP Separation)
-from typing import Literal, Union, Annotated, Optional
-from dataclasses import dataclass
-from pydantic import BaseModel, Field, TypeAdapter, field_validator
-# ----------------------------------------------------------------------
-# Action classes (discriminated union – short names as used by agent)
-# ----------------------------------------------------------------------
-class Action(BaseModel):
-    action_type: Literal["comment", "skip", "done", "question",
-                         "fix", "execute", "inspect", "run_linter",
-                         "run_tests", "query_docs"]
-class WriteComment(Action):
-    action_type: Literal["comment"] = "comment"
-    comment_text: str = Field(..., min_length=1)
-class Skip(Action):
-    action_type: Literal["skip"] = "skip"
-class Done(Action):
-    action_type: Literal["done"] = "done"
-class AskQuestion(Action):
-    action_type: Literal["question"] = "question"
-    question: str = Field(..., min_length=1)
-class ProposeFix(Action):
-    action_type: Literal["fix"] = "fix"
-    fix_code: str = Field(..., min_length=1)
-    @field_validator('fix_code')
-    @classmethod
-    def not_empty(cls, v: str) -> str:
-        if not v.strip():
-            raise ValueError('fix_code cannot be empty')
-        return v
-class Execute(Action):
-    action_type: Literal["execute"] = "execute"
-class Inspect(Action):
-    action_type: Literal["inspect"] = "inspect"
-class RunLinter(Action):
-    action_type: Literal["run_linter"] = "run_linter"
-class RunTests(Action):
-    action_type: Literal["run_tests"] = "run_tests"
-class QueryDocs(Action):
-    action_type: Literal["query_docs"] = "query_docs"
-    query_topic: str = Field(..., min_length=1)
-# ----------------------------------------------------------------------
-# FREE FUNCTION – map agent action strings to typed actions
-# (to be called from training.py where AgentAction is available)
-# ----------------------------------------------------------------------
-def map_to_env(action_type: str, content: str) -> Action:
-    """Convert a parsed agent action (type + content) into a Pydantic Action."""
-    if action_type == "run_tests":
-        return RunTests()
-    elif action_type == "run_linter":
-        return RunLinter()
-    elif action_type == "inspect":
-        return Inspect()
-    elif action_type == "fix":
-        return ProposeFix(fix_code=content or "")
-    elif action_type == "comment":
-        return WriteComment(comment_text=content or "")
-    elif action_type == "question":
-        return AskQuestion(question=content or "")
-    elif action_type == "query_docs":
-        return QueryDocs(query_topic=content or "")
-    elif action_type == "done":
-        return Done()
-    else:
-        return Skip()
-# Discriminated union for one‑line polymorphic deserialization
-AnyAction = Annotated[
-    Union[WriteComment, Skip, Done, AskQuestion, ProposeFix,
-          Execute, Inspect, RunLinter, RunTests, QueryDocs],
-    Field(discriminator='action_type')
-]
-action_adapter = TypeAdapter(AnyAction)
-# ----------------------------------------------------------------------
-# Observation (POMDP – what the agent sees)
-# ----------------------------------------------------------------------
-@dataclass(slots=True)
-class Observation:
-    code_snippet: str
-    last_tool_output: str = ""
-    step: int = 0
-    done: bool = False
-# ----------------------------------------------------------------------
-# Reward (lightweight)
-# ----------------------------------------------------------------------
-@dataclass(slots=True)
-class Reward:
-    value: float
-# ----------------------------------------------------------------------
-# State (full environment state – not exposed to agent)
-# ----------------------------------------------------------------------
-@dataclass(slots=True)
-class State:
-    pr_title: str
-    pr_description: str
-    code_snippet: str
-    comments: list[str]
-    test_results: Optional[str]
-    step: int
     done: bool

+# models.py – Typed Models (Discriminated Unions, POMDP Separation)
+from typing import Literal, Union, Annotated, Optional
+from pydantic import BaseModel, Field, TypeAdapter, field_validator
+# ----------------------------------------------------------------------
+# Action classes (discriminated union)
+# ----------------------------------------------------------------------
+class Action(BaseModel):
+    action_type: Literal["write_comment", "skip", "done", "ask_question",
+                         "propose_fix", "execute", "inspect", "run_linter",
+                         "run_tests", "query_docs"]
+class WriteComment(Action):
+    action_type: Literal["write_comment"] = "write_comment"
+    comment_text: str = Field(..., min_length=1)
+class Skip(Action):
+    action_type: Literal["skip"] = "skip"
+class Done(Action):
+    action_type: Literal["done"] = "done"
+class AskQuestion(Action):
+    action_type: Literal["ask_question"] = "ask_question"
+    question: str = Field(..., min_length=1)
+class ProposeFix(Action):
+    action_type: Literal["propose_fix"] = "propose_fix"
+    fix_code: str = Field(..., min_length=1)
+    @field_validator('fix_code')
+    @classmethod
+    def not_empty(cls, v: str) -> str:
+        if not v.strip():
+            raise ValueError('fix_code cannot be empty')
+        return v
+class Execute(Action):
+    action_type: Literal["execute"] = "execute"
+class Inspect(Action):
+    action_type: Literal["inspect"] = "inspect"
+class RunLinter(Action):
+    action_type: Literal["run_linter"] = "run_linter"
+class RunTests(Action):
+    action_type: Literal["run_tests"] = "run_tests"
+class QueryDocs(Action):
+    action_type: Literal["query_docs"] = "query_docs"
+    query_topic: str = Field(..., min_length=1)
+# Discriminated union for one‑line polymorphic deserialization
+AnyAction = Annotated[
+    Union[WriteComment, Skip, Done, AskQuestion, ProposeFix,
+          Execute, Inspect, RunLinter, RunTests, QueryDocs],
+    Field(discriminator='action_type')
+]
+action_adapter = TypeAdapter(AnyAction)
+# ----------------------------------------------------------------------
+# Observation (POMDP – what the agent sees)
+# ----------------------------------------------------------------------
+class Observation(BaseModel):
+    # Base schema model used by API metadata endpoints.
+    # Keep this lightweight for compatibility with legacy callers.
+    code_snippet: str
+    last_tool_output: str = ""
+    step: int = 0
+    done: bool = False
+# ----------------------------------------------------------------------
+# Reward (lightweight)
+# ----------------------------------------------------------------------
+class Reward(BaseModel):
+    value: float
+# ----------------------------------------------------------------------
+# State (full environment state – not exposed to agent)
+# ----------------------------------------------------------------------
+class State(BaseModel):
+    pr_title: str
+    pr_description: str
+    code_snippet: str
+    comments: list[str]
+    test_results: Optional[str]
+    step: int
     done: bool

openenv.yaml CHANGED Viewed

@@ -1,135 +1,135 @@
-# openenv.yaml – Environment metadata for OpenEnv
-name: CodeReview-Professional-Workflow
-version: 1.0.0
-description: |
-  Multi‑turn code review environment for professional tasks.
-  Agent must inspect, test, lint, query docs, and negotiate with a simulated author
-  to fix injected bugs. Supports DPO training on full trajectories.
-author: yuvraj gupta
-license: MIT
-# ----------------------------------------------------------------------
-# Tasks (difficulty progression)
-# ----------------------------------------------------------------------
-tasks:
-  - id: easy
-    description: "Fix missing null check in a dictionary lookup"
-  - id: medium
-    description: "Improve loop efficiency (replace range(len) with direct iteration)"
-  - id: hard
-    description: "Handle division by zero in average calculation"
-  - id: harder
-    description: "Fix race condition by adding a lock"
-  - id: hardest
-    description: "Resolve potential deadlock by standardising lock order"
-# ----------------------------------------------------------------------
-# Observation space (complete Markov state – agent sees everything)
-# ----------------------------------------------------------------------
-observation_space:
-  type: object
-  properties:
-    code_snippet:
-      type: string
-      description: "Current code snippet (may contain injected bug)"
-    last_tool_output:
-      type: string
-      description: "Raw output from last tool (test runner, linter, etc.)"
-    author_response:
-      type: string
-      description: "Latest feedback from the simulated human developer"
-    current_test_score:
-      type: number
-      description: "Proportion of tests passed (0.0–1.0)"
-    current_lint_score:
-      type: number
-      description: "Normalised pylint score (0.0–1.0)"
-    negotiation_score:
-      type: number
-      description: "Author's confidence minus pushback penalty"
-    previous_test_score:
-      type: number
-      description: "Test score before the last action"
-    previous_lint_score:
-      type: number
-      description: "Lint score before the last action"
-    author_confidence:
-      type: number
-      description: "Internal belief of the author (0.0–1.0)"
-    author_threshold:
-      type: number
-      description: "Confidence threshold for this personality"
-    step:
-      type: integer
-      description: "Current step number"
-    max_steps:
-      type: integer
-      description: "Maximum steps allowed in the episode"
-    progress_ratio:
-      type: number
-      description: "step / max_steps"
-    tests_run:
-      type: boolean
-      description: "Whether the agent has run tests at least once"
-    linter_run:
-      type: boolean
-      description: "Whether the agent has run the linter at least once"
-    docs_queried:
-      type: boolean
-      description: "Whether the agent has queried documentation"
-    last_action_type:
-      type: string
-      description: "String name of the last executed action"
-    action_history:
-      type: array
-      items:
-        type: string
-      description: "Last 5 action types"
-    done:
-      type: boolean
-      description: "Whether the episode has finished"
-    bug_description:
-      type: string
-      description: "Short description of the injected bug"
-    comments_count:
-      type: integer
-      description: "Number of comments exchanged so far"
-# ----------------------------------------------------------------------
-# Action space (short names as produced by the agent)
-# ----------------------------------------------------------------------
-action_space:
-  type: object
-  properties:
-    action_type:
-      type: string
-      enum:
-        - comment
-        - skip
-        - done
-        - question
-        - fix
-        - execute
-        - inspect
-        - run_linter
-        - run_tests
-        - query_docs
-    comment_text:
-      type: string
-      description: "Required for comment"
-    question:
-      type: string
-      description: "Required for question"
-    fix_code:
-      type: string
-      description: "Required for fix"
-    query_topic:
-      type: string
-      description: "Required for query_docs"
-# ----------------------------------------------------------------------
-# (Optional) Server configuration – used by openenv serve
-# ----------------------------------------------------------------------
-server:
-  app: server.app:app
-  port: 7860

+# openenv.yaml – Environment metadata for OpenEnv
+name: CodeReview-Professional-Workflow
+version: 1.0.0
+description: |
+  Multi‑turn code review environment for professional tasks.
+  Agent must inspect, test, lint, query docs, and negotiate with a simulated author
+  to fix injected bugs. Supports DPO training on full trajectories.
+author: yuvraj gupta
+license: MIT
+# ----------------------------------------------------------------------
+# Tasks (difficulty progression)
+# ----------------------------------------------------------------------
+tasks:
+  - id: easy
+    description: "Fix missing null check in a dictionary lookup"
+  - id: medium
+    description: "Improve loop efficiency (replace range(len) with direct iteration)"
+  - id: hard
+    description: "Handle division by zero in average calculation"
+  - id: harder
+    description: "Fix race condition by adding a lock"
+  - id: hardest
+    description: "Resolve potential deadlock by standardising lock order"
+# ----------------------------------------------------------------------
+# Observation space (complete Markov state – agent sees everything)
+# ----------------------------------------------------------------------
+observation_space:
+  type: object
+  properties:
+    code_snippet:
+      type: string
+      description: "Current code snippet (may contain injected bug)"
+    last_tool_output:
+      type: string
+      description: "Raw output from last tool (test runner, linter, etc.)"
+    author_response:
+      type: string
+      description: "Latest feedback from the simulated human developer"
+    current_test_score:
+      type: number
+      description: "Proportion of tests passed (0.0–1.0)"
+    current_lint_score:
+      type: number
+      description: "Normalised pylint score (0.0–1.0)"
+    negotiation_score:
+      type: number
+      description: "Author's confidence minus pushback penalty"
+    previous_test_score:
+      type: number
+      description: "Test score before the last action"
+    previous_lint_score:
+      type: number
+      description: "Lint score before the last action"
+    author_confidence:
+      type: number
+      description: "Internal belief of the author (0.0–1.0)"
+    author_threshold:
+      type: number
+      description: "Confidence threshold for this personality"
+    step:
+      type: integer
+      description: "Current step number"
+    max_steps:
+      type: integer
+      description: "Maximum steps allowed in the episode"
+    progress_ratio:
+      type: number
+      description: "step / max_steps"
+    tests_run:
+      type: boolean
+      description: "Whether the agent has run tests at least once"
+    linter_run:
+      type: boolean
+      description: "Whether the agent has run the linter at least once"
+    docs_queried:
+      type: boolean
+      description: "Whether the agent has queried documentation"
+    last_action_type:
+      type: string
+      description: "String name of the last executed action"
+    action_history:
+      type: array
+      items:
+        type: string
+      description: "Last 5 action types"
+    done:
+      type: boolean
+      description: "Whether the episode has finished"
+    bug_description:
+      type: string
+      description: "Short description of the injected bug"
+    comments_count:
+      type: integer
+      description: "Number of comments exchanged so far"
+# ----------------------------------------------------------------------
+# Action space (short names as produced by the agent)
+# ----------------------------------------------------------------------
+action_space:
+  type: object
+  properties:
+    action_type:
+      type: string
+      enum:
+        - comment
+        - skip
+        - done
+        - question
+        - fix
+        - execute
+        - inspect
+        - run_linter
+        - run_tests
+        - query_docs
+    comment_text:
+      type: string
+      description: "Required for comment"
+    question:
+      type: string
+      description: "Required for question"
+    fix_code:
+      type: string
+      description: "Required for fix"
+    query_topic:
+      type: string
+      description: "Required for query_docs"
+# ----------------------------------------------------------------------
+# (Optional) Server configuration – used by openenv serve
+# ----------------------------------------------------------------------
+server:
+  app: server.app:app
+  port: 7860

pyproject.toml CHANGED Viewed

@@ -1,30 +1,30 @@
-[build-system]
-requires = ["setuptools>=61.0", "wheel"]
-build-backend = "setuptools.build_meta"
-[project]
-name = "code_review_professional"
-version = "1.0.0"
-description = "Multi‑turn code review environment with AST injection, DPO training, and author negotiation"
-authors = [{name = "yuvraj gupta", email = "yuvraj467229@gmail.com"}]
-license = {text = "MIT"}
-readme = "README.md"
-requires-python = ">=3.10"
-dependencies = [
-    "openenv-core>=0.2.0",
-    "fastapi>=0.115.0",
-    "uvicorn>=0.24.0",
-    "unsloth>=2025.3.1",
-    "trl>=0.15.0",
-    "accelerate>=1.2.0",
-    "pylint>=3.3.0",
-    "sentence-transformers>=3.3.0",
-    "datasets>=3.3.0",
-    "chromadb>=0.5.0",
-]
-[project.optional-dependencies]
-dev = ["pytest>=7.0", "black>=23.0", "isort>=5.0"]
-[tool.openenv]
 server = "server.app:app"

+[build-system]
+requires = ["setuptools>=61.0", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "code_review_professional"
+version = "1.0.0"
+description = "Multi‑turn code review environment with AST injection, DPO training, and author negotiation"
+authors = [{name = "yuvraj gupta", email = "yuvraj467229@gmail.com"}]
+license = {text = "MIT"}
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core>=0.2.0",
+    "fastapi>=0.115.0",
+    "uvicorn>=0.24.0",
+    "unsloth>=2025.3.1",
+    "trl>=0.15.0",
+    "accelerate>=1.2.0",
+    "pylint>=3.3.0",
+    "sentence-transformers>=3.3.0",
+    "datasets>=3.3.0",
+    "chromadb>=0.5.0",
+]
+[project.optional-dependencies]
+dev = ["pytest>=7.0", "black>=23.0", "isort>=5.0"]
+[tool.openenv]
 server = "server.app:app"

redteam.py CHANGED Viewed

@@ -1,274 +1,274 @@
-# redteam.py – Task‑aware bug injection (25 bugs, 5 difficulty levels)
-import ast
-import random
-from dataclasses import dataclass, field
-from typing import Tuple, Optional, List, Dict
-# ----------------------------------------------------------------------
-# 1. AST Bug Injector (extended for all simple bugs)
-# ----------------------------------------------------------------------
-class ASTBugInjector(ast.NodeTransformer):
-    def __init__(self, bug_type: str):
-        super().__init__()
-        self.bug_type = bug_type
-        self.modified = False
-    # --- Easy: null_check, simple_typo, string_index, default_value, empty_return ---
-    def visit_If(self, node: ast.If):
-        # null_check: remove the if-guard
-        if self.bug_type == "null_check" and not self.modified:
-            if node.body and len(node.body) == 1:
-                self.modified = True
-                return node.body[0]
-        # division_by_zero_empty: remove the empty check
-        if self.bug_type == "division_by_zero_empty" and not self.modified:
-            # pattern: if not data: return 0  – we delete the entire if
-            if (isinstance(node.test, ast.UnaryOp) and
-                isinstance(node.test.op, ast.Not) and
-                isinstance(node.test.operand, ast.Name)):
-                self.modified = True
-                return None  # signal to remove this node from parent
-        return self.generic_visit(node)
-    def visit_Name(self, node: ast.Name):
-        if self.bug_type == "simple_typo" and not self.modified:
-            if node.id == "users":
-                self.modified = True
-                return ast.Name(id="usres", ctx=node.ctx)
-        return self.generic_visit(node)
-    def visit_Subscript(self, node: ast.Subscript):
-        if self.bug_type == "string_index" and not self.modified:
-            if isinstance(node.slice, ast.Index) and isinstance(node.slice.value, ast.Constant):
-                old_val = node.slice.value.value
-                if isinstance(old_val, int):
-                    self.modified = True
-                    node.slice = ast.Index(value=ast.Constant(value=old_val + 1))
-        return self.generic_visit(node)
-    def visit_Call(self, node: ast.Call):
-        # default_value: change dict.get(key) to dict[key] (no default)
-        if self.bug_type == "default_value" and not self.modified:
-            if (isinstance(node.func, ast.Attribute) and
-                node.func.attr == "get" and len(node.args) == 1):
-                self.modified = True
-                return ast.Subscript(
-                    value=node.func.value,
-                    slice=ast.Index(value=node.args[0]),
-                    ctx=node.ctx
-                )
-        # abs_usage: remove abs()
-        if self.bug_type == "abs_usage" and not self.modified:
-            if isinstance(node.func, ast.Name) and node.func.id == "abs":
-                self.modified = True
-                return node.args[0]
-        return self.generic_visit(node)
-    def visit_FunctionDef(self, node: ast.FunctionDef):
-        # empty_return: insert a premature return None
-        if self.bug_type == "empty_return" and not self.modified:
-            self.modified = True
-            node.body.insert(0, ast.Return(value=ast.Constant(value=None)))
-        return self.generic_visit(node)
-    # --- Medium: off_by_one, loop_skip, sign_error, swap_args, uninitialised_var ---
-    def visit_For(self, node: ast.For):
-        if (self.bug_type in ("off_by_one", "loop_skip")) and not self.modified:
-            if (isinstance(node.iter, ast.Call) and
-                isinstance(node.iter.func, ast.Name) and
-                node.iter.func.id == "range"):
-                if self.bug_type == "off_by_one":
-                    new_iter = ast.Call(
-                        func=ast.Name(id='range', ctx=ast.Load()),
-                        args=[
-                            ast.Constant(value=1),
-                            ast.BinOp(left=node.iter.args[0], op=ast.Sub(), right=ast.Constant(value=1))
-                        ],
-                        keywords=[]
-                    )
-                    node.iter = new_iter
-                    self.modified = True
-                elif self.bug_type == "loop_skip" and len(node.iter.args) == 1:
-                    new_iter = ast.Call(
-                        func=ast.Name(id='range', ctx=ast.Load()),
-                        args=[ast.BinOp(left=node.iter.args[0], op=ast.Sub(), right=ast.Constant(value=1))],
-                        keywords=[]
-                    )
-                    node.iter = new_iter
-                    self.modified = True
-        return self.generic_visit(node)
-    def visit_BinOp(self, node: ast.BinOp):
-        # sign_error: flip Add/Sub, wrong_operator: Add->Sub, float_precision: Div->FloorDiv
-        if not self.modified:
-            if self.bug_type in ("wrong_operator", "sign_error"):
-                if isinstance(node.op, ast.Add):
-                    node.op = ast.Sub()
-                    self.modified = True
-                elif isinstance(node.op, ast.Sub):
-                    node.op = ast.Add()
-                    self.modified = True
-            elif self.bug_type == "float_precision" and isinstance(node.op, ast.Div):
-                node.op = ast.FloorDiv()
-                self.modified = True
-        return self.generic_visit(node)
-    def visit_arguments(self, node: ast.arguments):
-        # swap_args: swap first two arguments of a function
-        if self.bug_type == "swap_args" and not self.modified and len(node.args) >= 2:
-            self.modified = True
-            node.args[0], node.args[1] = node.args[1], node.args[0]
-        return self.generic_visit(node)
-    def visit_Assign(self, node: ast.Assign):
-        # uninitialised_var: remove an assignment statement (replaced with Pass)
-        if self.bug_type == "uninitialised_var" and not self.modified:
-            self.modified = True
-            return ast.Pass()
-        return self.generic_visit(node)
-# ----------------------------------------------------------------------
-# 2. Bug database (25 bugs, categorized by difficulty)
-# ----------------------------------------------------------------------
-BUG_DB = {
-    "easy": {
-        "null_check":    {"type": "ast", "bug_type": "null_check"},
-        "simple_typo":   {"type": "ast", "bug_type": "simple_typo"},
-        "string_index":  {"type": "ast", "bug_type": "string_index"},
-        "default_value": {"type": "ast", "bug_type": "default_value"},
-        "empty_return":  {"type": "ast", "bug_type": "empty_return"},
-    },
-    "medium": {
-        "off_by_one":     {"type": "ast", "bug_type": "off_by_one"},
-        "loop_skip":      {"type": "ast", "bug_type": "loop_skip"},
-        "sign_error":     {"type": "ast", "bug_type": "sign_error"},
-        "swap_args":      {"type": "ast", "bug_type": "swap_args"},
-        "uninitialised_var": {"type": "ast", "bug_type": "uninitialised_var"},
-    },
-    "hard": {
-        "division_by_zero_empty": {"type": "ast", "bug_type": "division_by_zero_empty"},
-        "division_by_zero_zero":  {"type": "ast", "bug_type": "division_by_zero_empty"},  # same injector
-        "float_precision":        {"type": "ast", "bug_type": "float_precision"},
-        "abs_usage":              {"type": "ast", "bug_type": "abs_usage"},
-        "round_error":            {"type": "ast", "bug_type": "round_error"},  # can be extended
-    },
-    "harder": {
-        "missing_lock": {
-            "type": "template",
-            "buggy": "counter = 0\ndef increment():\n    global counter\n    counter += 1",
-            "oracle": "counter = 0\nimport threading\nlock = threading.Lock()\ndef increment():\n    global counter\n    with lock:\n        counter += 1",
-        },
-        "double_lock": {
-            "type": "template",
-            "buggy": "import threading\nlock = threading.Lock()\ndef do_work():\n    lock.acquire()\n    lock.acquire()\n    print('working')\n    lock.release()",
-            "oracle": "import threading\nlock = threading.Lock()\ndef do_work():\n    with lock:\n        print('working')",
-        },
-        "global_nonatomic": {
-            "type": "template",
-            "buggy": "count = 0\ndef add():\n    global count\n    count = count + 1",
-            "oracle": "count = 0\ndef add():\n    global count\n    count += 1",
-        },
-        "thread_safe_list": {
-            "type": "template",
-            "buggy": "import threading\nitems = []\ndef append_item(item):\n    items.append(item)",
-            "oracle": "import threading\nitems = []\nlock = threading.Lock()\ndef append_item(item):\n    with lock:\n        items.append(item)",
-        },
-        "volatile_read": {
-            "type": "template",
-            "buggy": "import threading\nstop = False\ndef worker():\n    while not stop:\n        pass",
-            "oracle": "import threading\nstop = False\nlock = threading.Lock()\ndef worker():\n    while True:\n        with lock:\n            if stop:\n                break",
-        },
-    },
-    "hardest": {
-        "deadlock_order": {
-            "type": "template",
-            "buggy": "import threading\nlock1 = threading.Lock()\nlock2 = threading.Lock()\ndef thread1():\n    with lock1:\n        with lock2:\n            pass\ndef thread2():\n    with lock2:\n        with lock1:\n            pass",
-            "oracle": "import threading\nlock1 = threading.Lock()\nlock2 = threading.Lock()\ndef thread1():\n    with lock1:\n        with lock2:\n            pass\ndef thread2():\n    with lock1:\n        with lock2:\n            pass",
-        },
-        "nested_lock_timeout": {
-            "type": "template",
-            "buggy": "import threading\nlock = threading.Lock()\ndef work():\n    lock.acquire()\n    # critical section\n    lock.release()",
-            "oracle": "import threading\nlock = threading.Lock()\ndef work():\n    if lock.acquire(timeout=1):\n        try:\n            # critical section\n        finally:\n            lock.release()",
-        },
-        "fork_join": {
-            "type": "template",
-            "buggy": "import threading\ndef worker():\n    pass\nt = threading.Thread(target=worker)\nt.start()",
-            "oracle": "import threading\ndef worker():\n    pass\nt = threading.Thread(target=worker)\nt.start()\nt.join()",
-        },
-        "mutex_release": {
-            "type": "template",
-            "buggy": "import threading\nlock = threading.Lock()\ndef thread_A():\n    lock.acquire()\n    lock.release()\ndef thread_B():\n    lock.release()",
-            "oracle": "import threading\nlock = threading.Lock()\ndef thread_A():\n    with lock:\n        pass\ndef thread_B():\n    with lock:\n        pass",
-        },
-        "race_on_init": {
-            "type": "template",
-            "buggy": "import threading\nitems = []\ndef init():\n    global items\n    items = [1,2,3]\nt = threading.Thread(target=init)\nt.start()\nprint(items)",
-            "oracle": "import threading\nitems = []\ndef init():\n    global items\n    items = [1,2,3]\nt = threading.Thread(target=init)\nt.start()\nt.join()\nprint(items)",
-        },
-    },
-}
-# ----------------------------------------------------------------------
-# 3. Derived helpers
-# ----------------------------------------------------------------------
-TASK_BUG_MAP = {level: list(bugs.keys()) for level, bugs in BUG_DB.items()}
-TEMPLATE_BUGS = {}
-for level, bugs in BUG_DB.items():
-    for bug_id, bug in bugs.items():
-        if bug["type"] == "template":
-            TEMPLATE_BUGS[bug_id] = (bug["buggy"], bug["oracle"])
-# ----------------------------------------------------------------------
-# 4. RedTeam Controller (task‑aware)
-# ----------------------------------------------------------------------
-@dataclass
-class RedTeam:
-    task: str
-    seed: Optional[int] = 42
-    noise_prob: float = 0.2
-    _random: random.Random = field(init=False)
-    def __post_init__(self):
-        self._random = random.Random(self.seed)
-    def inject_bug(self, original_code: str) -> Tuple[str, str, str, str]:
-        """
-        Returns: (buggy_code, bug_type, description, oracle_fix)
-        Selects a bug appropriate for the task difficulty.
-        """
-        bug_list = TASK_BUG_MAP.get(self.task, ["null_check"])
-        bug_type = self._random.choice(bug_list)
-        # Template bug: return hardcoded buggy + oracle
-        if bug_type in TEMPLATE_BUGS:
-            buggy_code, oracle_code = TEMPLATE_BUGS[bug_type]
-            description = f"Template bug: {bug_type}"
-            if self._random.random() < self.noise_prob:
-                buggy_code += "\n# TODO: refactor later"
-            return buggy_code, bug_type, description, oracle_code
-        # AST injection
-        try:
-            tree = ast.parse(original_code)
-        except SyntaxError:
-            return original_code, "parse_error", "Syntax error in original code", original_code
-        injector = ASTBugInjector(bug_type)
-        modified_tree = injector.visit(tree)
-        ast.fix_missing_locations(modified_tree)
-        if injector.modified:
-            buggy_code = ast.unparse(modified_tree)
-            oracle_fix = original_code
-            description = f"AST bug: {bug_type}"
-        else:
-            buggy_code = original_code
-            oracle_fix = original_code
-            bug_type = "no_op"
-            description = "No suitable code structure found for injection"
-        if self._random.random() < self.noise_prob:
-            buggy_code += "\n# TODO: refactor later"
-        return buggy_code, bug_type, description, oracle_fix

+# redteam.py – Task‑aware bug injection (25 bugs, 5 difficulty levels)
+import ast
+import random
+from dataclasses import dataclass, field
+from typing import Tuple, Optional, List, Dict
+# ----------------------------------------------------------------------
+# 1. AST Bug Injector (extended for all simple bugs)
+# ----------------------------------------------------------------------
+class ASTBugInjector(ast.NodeTransformer):
+    def __init__(self, bug_type: str):
+        super().__init__()
+        self.bug_type = bug_type
+        self.modified = False
+    # --- Easy: null_check, simple_typo, string_index, default_value, empty_return ---
+    def visit_If(self, node: ast.If):
+        # null_check: remove the if-guard
+        if self.bug_type == "null_check" and not self.modified:
+            if node.body and len(node.body) == 1:
+                self.modified = True
+                return node.body[0]
+        # division_by_zero_empty: remove the empty check
+        if self.bug_type == "division_by_zero_empty" and not self.modified:
+            # pattern: if not data: return 0  – we delete the entire if
+            if (isinstance(node.test, ast.UnaryOp) and
+                isinstance(node.test.op, ast.Not) and
+                isinstance(node.test.operand, ast.Name)):
+                self.modified = True
+                return None  # signal to remove this node from parent
+        return self.generic_visit(node)
+    def visit_Name(self, node: ast.Name):
+        if self.bug_type == "simple_typo" and not self.modified:
+            if node.id == "users":
+                self.modified = True
+                return ast.Name(id="usres", ctx=node.ctx)
+        return self.generic_visit(node)
+    def visit_Subscript(self, node: ast.Subscript):
+        if self.bug_type == "string_index" and not self.modified:
+            if isinstance(node.slice, ast.Index) and isinstance(node.slice.value, ast.Constant):
+                old_val = node.slice.value.value
+                if isinstance(old_val, int):
+                    self.modified = True
+                    node.slice = ast.Index(value=ast.Constant(value=old_val + 1))
+        return self.generic_visit(node)
+    def visit_Call(self, node: ast.Call):
+        # default_value: change dict.get(key) to dict[key] (no default)
+        if self.bug_type == "default_value" and not self.modified:
+            if (isinstance(node.func, ast.Attribute) and
+                node.func.attr == "get" and len(node.args) == 1):
+                self.modified = True
+                return ast.Subscript(
+                    value=node.func.value,
+                    slice=ast.Index(value=node.args[0]),
+                    ctx=node.ctx
+                )
+        # abs_usage: remove abs()
+        if self.bug_type == "abs_usage" and not self.modified:
+            if isinstance(node.func, ast.Name) and node.func.id == "abs":
+                self.modified = True
+                return node.args[0]
+        return self.generic_visit(node)
+    def visit_FunctionDef(self, node: ast.FunctionDef):
+        # empty_return: insert a premature return None
+        if self.bug_type == "empty_return" and not self.modified:
+            self.modified = True
+            node.body.insert(0, ast.Return(value=ast.Constant(value=None)))
+        return self.generic_visit(node)
+    # --- Medium: off_by_one, loop_skip, sign_error, swap_args, uninitialised_var ---
+    def visit_For(self, node: ast.For):
+        if (self.bug_type in ("off_by_one", "loop_skip")) and not self.modified:
+            if (isinstance(node.iter, ast.Call) and
+                isinstance(node.iter.func, ast.Name) and
+                node.iter.func.id == "range"):
+                if self.bug_type == "off_by_one":
+                    new_iter = ast.Call(
+                        func=ast.Name(id='range', ctx=ast.Load()),
+                        args=[
+                            ast.Constant(value=1),
+                            ast.BinOp(left=node.iter.args[0], op=ast.Sub(), right=ast.Constant(value=1))
+                        ],
+                        keywords=[]
+                    )
+                    node.iter = new_iter
+                    self.modified = True
+                elif self.bug_type == "loop_skip" and len(node.iter.args) == 1:
+                    new_iter = ast.Call(
+                        func=ast.Name(id='range', ctx=ast.Load()),
+                        args=[ast.BinOp(left=node.iter.args[0], op=ast.Sub(), right=ast.Constant(value=1))],
+                        keywords=[]
+                    )
+                    node.iter = new_iter
+                    self.modified = True
+        return self.generic_visit(node)
+    def visit_BinOp(self, node: ast.BinOp):
+        # sign_error: flip Add/Sub, wrong_operator: Add->Sub, float_precision: Div->FloorDiv
+        if not self.modified:
+            if self.bug_type in ("wrong_operator", "sign_error"):
+                if isinstance(node.op, ast.Add):
+                    node.op = ast.Sub()
+                    self.modified = True
+                elif isinstance(node.op, ast.Sub):
+                    node.op = ast.Add()
+                    self.modified = True
+            elif self.bug_type == "float_precision" and isinstance(node.op, ast.Div):
+                node.op = ast.FloorDiv()
+                self.modified = True
+        return self.generic_visit(node)
+    def visit_arguments(self, node: ast.arguments):
+        # swap_args: swap first two arguments of a function
+        if self.bug_type == "swap_args" and not self.modified and len(node.args) >= 2:
+            self.modified = True
+            node.args[0], node.args[1] = node.args[1], node.args[0]
+        return self.generic_visit(node)
+    def visit_Assign(self, node: ast.Assign):
+        # uninitialised_var: remove an assignment statement (replaced with Pass)
+        if self.bug_type == "uninitialised_var" and not self.modified:
+            self.modified = True
+            return ast.Pass()
+        return self.generic_visit(node)
+# ----------------------------------------------------------------------
+# 2. Bug database (25 bugs, categorized by difficulty)
+# ----------------------------------------------------------------------
+BUG_DB = {
+    "easy": {
+        "null_check":    {"type": "ast", "bug_type": "null_check"},
+        "simple_typo":   {"type": "ast", "bug_type": "simple_typo"},
+        "string_index":  {"type": "ast", "bug_type": "string_index"},
+        "default_value": {"type": "ast", "bug_type": "default_value"},
+        "empty_return":  {"type": "ast", "bug_type": "empty_return"},
+    },
+    "medium": {
+        "off_by_one":     {"type": "ast", "bug_type": "off_by_one"},
+        "loop_skip":      {"type": "ast", "bug_type": "loop_skip"},
+        "sign_error":     {"type": "ast", "bug_type": "sign_error"},
+        "swap_args":      {"type": "ast", "bug_type": "swap_args"},
+        "uninitialised_var": {"type": "ast", "bug_type": "uninitialised_var"},
+    },
+    "hard": {
+        "division_by_zero_empty": {"type": "ast", "bug_type": "division_by_zero_empty"},
+        "division_by_zero_zero":  {"type": "ast", "bug_type": "division_by_zero_empty"},  # same injector
+        "float_precision":        {"type": "ast", "bug_type": "float_precision"},
+        "abs_usage":              {"type": "ast", "bug_type": "abs_usage"},
+        "round_error":            {"type": "ast", "bug_type": "round_error"},  # can be extended
+    },
+    "harder": {
+        "missing_lock": {
+            "type": "template",
+            "buggy": "counter = 0\ndef increment():\n    global counter\n    counter += 1",
+            "oracle": "counter = 0\nimport threading\nlock = threading.Lock()\ndef increment():\n    global counter\n    with lock:\n        counter += 1",
+        },
+        "double_lock": {
+            "type": "template",
+            "buggy": "import threading\nlock = threading.Lock()\ndef do_work():\n    lock.acquire()\n    lock.acquire()\n    print('working')\n    lock.release()",
+            "oracle": "import threading\nlock = threading.Lock()\ndef do_work():\n    with lock:\n        print('working')",
+        },
+        "global_nonatomic": {
+            "type": "template",
+            "buggy": "count = 0\ndef add():\n    global count\n    count = count + 1",
+            "oracle": "count = 0\ndef add():\n    global count\n    count += 1",
+        },
+        "thread_safe_list": {
+            "type": "template",
+            "buggy": "import threading\nitems = []\ndef append_item(item):\n    items.append(item)",
+            "oracle": "import threading\nitems = []\nlock = threading.Lock()\ndef append_item(item):\n    with lock:\n        items.append(item)",
+        },
+        "volatile_read": {
+            "type": "template",
+            "buggy": "import threading\nstop = False\ndef worker():\n    while not stop:\n        pass",
+            "oracle": "import threading\nstop = False\nlock = threading.Lock()\ndef worker():\n    while True:\n        with lock:\n            if stop:\n                break",
+        },
+    },
+    "hardest": {
+        "deadlock_order": {
+            "type": "template",
+            "buggy": "import threading\nlock1 = threading.Lock()\nlock2 = threading.Lock()\ndef thread1():\n    with lock1:\n        with lock2:\n            pass\ndef thread2():\n    with lock2:\n        with lock1:\n            pass",
+            "oracle": "import threading\nlock1 = threading.Lock()\nlock2 = threading.Lock()\ndef thread1():\n    with lock1:\n        with lock2:\n            pass\ndef thread2():\n    with lock1:\n        with lock2:\n            pass",
+        },
+        "nested_lock_timeout": {
+            "type": "template",
+            "buggy": "import threading\nlock = threading.Lock()\ndef work():\n    lock.acquire()\n    # critical section\n    lock.release()",
+            "oracle": "import threading\nlock = threading.Lock()\ndef work():\n    if lock.acquire(timeout=1):\n        try:\n            # critical section\n        finally:\n            lock.release()",
+        },
+        "fork_join": {
+            "type": "template",
+            "buggy": "import threading\ndef worker():\n    pass\nt = threading.Thread(target=worker)\nt.start()",
+            "oracle": "import threading\ndef worker():\n    pass\nt = threading.Thread(target=worker)\nt.start()\nt.join()",
+        },
+        "mutex_release": {
+            "type": "template",
+            "buggy": "import threading\nlock = threading.Lock()\ndef thread_A():\n    lock.acquire()\n    lock.release()\ndef thread_B():\n    lock.release()",
+            "oracle": "import threading\nlock = threading.Lock()\ndef thread_A():\n    with lock:\n        pass\ndef thread_B():\n    with lock:\n        pass",
+        },
+        "race_on_init": {
+            "type": "template",
+            "buggy": "import threading\nitems = []\ndef init():\n    global items\n    items = [1,2,3]\nt = threading.Thread(target=init)\nt.start()\nprint(items)",
+            "oracle": "import threading\nitems = []\ndef init():\n    global items\n    items = [1,2,3]\nt = threading.Thread(target=init)\nt.start()\nt.join()\nprint(items)",
+        },
+    },
+}
+# ----------------------------------------------------------------------
+# 3. Derived helpers
+# ----------------------------------------------------------------------
+TASK_BUG_MAP = {level: list(bugs.keys()) for level, bugs in BUG_DB.items()}
+TEMPLATE_BUGS = {}
+for level, bugs in BUG_DB.items():
+    for bug_id, bug in bugs.items():
+        if bug["type"] == "template":
+            TEMPLATE_BUGS[bug_id] = (bug["buggy"], bug["oracle"])
+# ----------------------------------------------------------------------
+# 4. RedTeam Controller (task‑aware)
+# ----------------------------------------------------------------------
+@dataclass
+class RedTeam:
+    task: str
+    seed: Optional[int] = 42
+    noise_prob: float = 0.2
+    _random: random.Random = field(init=False)
+    def __post_init__(self):
+        self._random = random.Random(self.seed)
+    def inject_bug(self, original_code: str) -> Tuple[str, str, str, str]:
+        """
+        Returns: (buggy_code, bug_type, description, oracle_fix)
+        Selects a bug appropriate for the task difficulty.
+        """
+        bug_list = TASK_BUG_MAP.get(self.task, ["null_check"])
+        bug_type = self._random.choice(bug_list)
+        # Template bug: return hardcoded buggy + oracle
+        if bug_type in TEMPLATE_BUGS:
+            buggy_code, oracle_code = TEMPLATE_BUGS[bug_type]
+            description = f"Template bug: {bug_type}"
+            if self._random.random() < self.noise_prob:
+                buggy_code += "\n# TODO: refactor later"
+            return buggy_code, bug_type, description, oracle_code
+        # AST injection
+        try:
+            tree = ast.parse(original_code)
+        except SyntaxError:
+            return original_code, "parse_error", "Syntax error in original code", original_code
+        injector = ASTBugInjector(bug_type)
+        modified_tree = injector.visit(tree)
+        ast.fix_missing_locations(modified_tree)
+        if injector.modified:
+            buggy_code = ast.unparse(modified_tree)
+            oracle_fix = original_code
+            description = f"AST bug: {bug_type}"
+        else:
+            buggy_code = original_code
+            oracle_fix = original_code
+            bug_type = "no_op"
+            description = "No suitable code structure found for injection"
+        if self._random.random() < self.noise_prob:
+            buggy_code += "\n# TODO: refactor later"
+        return buggy_code, bug_type, description, oracle_fix

rubrics.py CHANGED Viewed

@@ -1,123 +1,136 @@
-# rubrics.py – Self-contained Rubrics (no external OpenEnv dependency)
-class Rubric:
-    """Minimal Rubric base – compatible with OpenEnv but self‑contained."""
-    def __call__(self, env, action, obs, reward, done, info):
-        return 0.0
-# --------------------------------------------------------------------------------
-# 1. TOOL‑USAGE BONUS
-# --------------------------------------------------------------------------------
-class ToolUsageRubric(Rubric):
-    def __init__(self, bonus: float = 0.05):
-        self.bonus = bonus
-    def __call__(self, env, action, obs, reward, done, info):
-        score = 0.0
-        action_type = info.get("action_type", "")
-        if action_type == "run_tests":
-            if not env._tests_run:
-                score += self.bonus
-            score += 0.015
-        elif action_type == "run_linter":
-            if not env._linter_run:
-                score += self.bonus
-            score += 0.015
-        elif action_type == "query_docs":
-            if not env._docs_queried:
-                score += self.bonus * 0.5
-        elif action_type == "ask_question" and env._step_count <= 3:
-            score += 0.02
-        return score
-# --------------------------------------------------------------------------------
-# 2. DELTA‑BASED REWARDS
-# --------------------------------------------------------------------------------
-class TestDeltaRubric(Rubric):
-    def __init__(self, weight: float = 0.3):
-        self.weight = weight
-    def __call__(self, env, action, obs, reward, done, info):
-        delta = env._current_test_score - env._previous_test_score
-        effective = self.weight
-        if info.get("action_type") == "propose_fix":
-            effective *= 0.4
-        return effective * delta
-class LintDeltaRubric(Rubric):
-    def __init__(self, weight: float = 0.3):
-        self.weight = weight
-    def __call__(self, env, action, obs, reward, done, info):
-        delta = env._current_lint_score - env._previous_lint_score
-        effective = self.weight * 0.5
-        if info.get("action_type") == "propose_fix":
-            effective *= 0.4
-        return effective * delta
-# --------------------------------------------------------------------------------
-# 3. TERMINAL SUCCESS BONUS
-# --------------------------------------------------------------------------------
-class TerminalSuccessRubric(Rubric):
-    def __call__(self, env, action, obs, reward, done, info):
-        if info.get("action_type") != "propose_fix":
-            return 0.0
-        score = 0.0
-        if env._current_test_score > 0.95:
-            score += 0.4
-        elif env._current_test_score > 0.85:
-            score += 0.2
-        return score
-# --------------------------------------------------------------------------------
-# 4. EXPLORATION & DIVERSITY
-# --------------------------------------------------------------------------------
-class ExplorationRubric(Rubric):
-    def __init__(self, penalty: float = -0.05, bonus: float = 0.021):
-        self.penalty = penalty
-        self.bonus = bonus
-    def __call__(self, env, action, obs, reward, done, info):
-        if len(env._action_history) < 3:
-            return 0.0
-        recent = env._action_history[-3:]
-        unique = len(set(recent))
-        if unique == 1:
-            return self.penalty
-        elif unique == 3:
-            return self.bonus
-        return 0.0
-# --------------------------------------------------------------------------------
-# 5. ANTI‑HACKING & CONSISTENCY
-# --------------------------------------------------------------------------------
-class AntiHackingRubric(Rubric):
-    def __call__(self, env, action, obs, reward, done, info):
-        if info.get("action_type") != "propose_fix":
-            return 0.0
-        score = 0.0
-        if not env._tests_run:
-            score -= 0.25
-        if env._step_count < 2:
-            score -= 0.1
-        if env._tests_run and env._linter_run:
-            score += 0.02
-        return score
-# --------------------------------------------------------------------------------
-# 6. STEP PENALTY
-# --------------------------------------------------------------------------------
-class StepPenaltyRubric(Rubric):
-    def __init__(self, penalty: float = -0.01):
-        self.penalty = penalty
-    def __call__(self, env, action, obs, reward, done, info):
-        return self.penalty

+# rubrics.py – Self-contained Rubrics (no external OpenEnv dependency)
+class Rubric:
+    """Minimal Rubric base – compatible with OpenEnv but self‑contained."""
+    def __call__(self, env, action, obs, reward, done, info):
+        return 0.0
+# --------------------------------------------------------------------------------
+# 1. TOOL‑USAGE BONUS
+# --------------------------------------------------------------------------------
+class ToolUsageRubric(Rubric):
+    def __init__(self, bonus: float = 0.05):
+        self.bonus = bonus
+    def __call__(self, env, action, obs, reward, done, info):
+        score = 0.0
+        action_type = info.get("action_type", "")
+        # Use pre-action flags from `info` so first-use bonuses are
+        # computed correctly even though env flags are mutated in-step.
+        prev_tests_run = info.get("prev_tests_run", env._tests_run)
+        prev_linter_run = info.get("prev_linter_run", env._linter_run)
+        prev_docs_queried = info.get("prev_docs_queried", env._docs_queried)
+        if action_type == "run_tests":
+            if not prev_tests_run:
+                score += self.bonus
+            score += 0.015
+        elif action_type == "run_linter":
+            if not prev_linter_run:
+                score += self.bonus
+            score += 0.015
+        elif action_type == "query_docs":
+            if not prev_docs_queried:
+                score += self.bonus * 0.5
+            # Encourage docs usage when it is likely useful:
+            # - early exploration phase
+            # - non-trivial query text
+            if env._step_count <= 4 and info.get("docs_query_len", 0) >= 8:
+                score += 0.01
+            # Discourage repeated docs calls after the first-use signal.
+            if prev_docs_queried:
+                score -= 0.01
+        elif action_type == "ask_question" and env._step_count <= 3:
+            score += 0.02
+        return score
+# --------------------------------------------------------------------------------
+# 2. DELTA‑BASED REWARDS
+# --------------------------------------------------------------------------------
+class TestDeltaRubric(Rubric):
+    def __init__(self, weight: float = 0.3):
+        self.weight = weight
+    def __call__(self, env, action, obs, reward, done, info):
+        delta = env._current_test_score - env._previous_test_score
+        effective = self.weight
+        if info.get("action_type") == "propose_fix":
+            effective *= 0.4
+        return effective * delta
+class LintDeltaRubric(Rubric):
+    def __init__(self, weight: float = 0.3):
+        self.weight = weight
+    def __call__(self, env, action, obs, reward, done, info):
+        delta = env._current_lint_score - env._previous_lint_score
+        effective = self.weight * 0.5
+        if info.get("action_type") == "propose_fix":
+            effective *= 0.4
+        return effective * delta
+# --------------------------------------------------------------------------------
+# 3. TERMINAL SUCCESS BONUS
+# --------------------------------------------------------------------------------
+class TerminalSuccessRubric(Rubric):
+    def __call__(self, env, action, obs, reward, done, info):
+        if info.get("action_type") != "propose_fix":
+            return 0.0
+        score = 0.0
+        if env._current_test_score > 0.95:
+            score += 0.4
+        elif env._current_test_score > 0.85:
+            score += 0.2
+        return score
+# --------------------------------------------------------------------------------
+# 4. EXPLORATION & DIVERSITY
+# --------------------------------------------------------------------------------
+class ExplorationRubric(Rubric):
+    def __init__(self, penalty: float = -0.05, bonus: float = 0.021):
+        self.penalty = penalty
+        self.bonus = bonus
+    def __call__(self, env, action, obs, reward, done, info):
+        if len(env._action_history) < 3:
+            return 0.0
+        recent = env._action_history[-3:]
+        unique = len(set(recent))
+        if unique == 1:
+            return self.penalty
+        elif unique == 3:
+            return self.bonus
+        return 0.0
+# --------------------------------------------------------------------------------
+# 5. ANTI‑HACKING & CONSISTENCY
+# --------------------------------------------------------------------------------
+class AntiHackingRubric(Rubric):
+    def __call__(self, env, action, obs, reward, done, info):
+        if info.get("action_type") != "propose_fix":
+            return 0.0
+        score = 0.0
+        if not env._tests_run:
+            score -= 0.25
+        if env._step_count < 2:
+            score -= 0.1
+        if env._tests_run and env._linter_run:
+            score += 0.02
+        return score
+# --------------------------------------------------------------------------------
+# 6. STEP PENALTY
+# --------------------------------------------------------------------------------
+class StepPenaltyRubric(Rubric):
+    def __init__(self, penalty: float = -0.01):
+        self.penalty = penalty
+    def __call__(self, env, action, obs, reward, done, info):
+        return self.penalty

test_runner.py CHANGED Viewed

@@ -1,181 +1,208 @@
-# test_runner.py – Full production version with continuous scoring, dynamic function detection, and randomised tests
-import subprocess
-import tempfile
-import os
-import json
-import ast
-import random
-import sys
-from typing import Tuple, List, Any, Optional
-from dataclasses import dataclass
-@dataclass
-class TestRunner:
-    bug_id: str
-    timeout_sec: int = 5
-    max_memory_mb: int = 256
-    fuzz_rounds: int = 3   # number of random test cases per bug
-    def run_tests(self, fix_code: str) -> Tuple[float, str]:
-        """
-        Returns (score, output_message) where score is proportion of passed tests (0.0–1.0).
-        """
-        # 1. Detect the function defined in the agent's code (dynamic)
-        func_name = self._get_defined_function_name(fix_code)
-        if not func_name:
-            return 0.0, "No function definition found in agent code"
-        # 2. Generate the test script (includes fixed + fuzzed test cases)
-        test_script = self._generate_test_script(fix_code, func_name)
-        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding='utf-8') as f:
-            f.write(test_script)
-            tmp_path = f.name
-        try:
-            # Resource limiting (Linux only; fallback otherwise)
-            try:
-                import resource
-                resource.setrlimit(resource.RLIMIT_AS, (self.max_memory_mb * 1024 * 1024, self.max_memory_mb * 1024 * 1024))
-            except Exception:
-                pass
-            result = subprocess.run(
-                [sys.executable, tmp_path],
-                capture_output=True,
-                text=True,
-                timeout=self.timeout_sec,
-                encoding='utf-8'
-            )
-            # Parse JSON output
-            try:
-                data = json.loads(result.stdout.strip())
-                passed = data.get("passed", 0)
-                total = data.get("total", 1)
-                score = passed / total if total > 0 else 0.0
-                return score, result.stdout.strip()
-            except json.JSONDecodeError:
-                # Fallback: look for "True" (legacy)
-                if "True" in result.stdout:
-                    return 1.0, result.stdout
-                return 0.0, result.stdout
-        except subprocess.TimeoutExpired:
-            return 0.0, "Test execution timed out"
-        except Exception as e:
-            return 0.0, f"Unexpected error: {str(e)}"
-        finally:
-            try:
-                os.unlink(tmp_path)
-            except:
-                pass
-    def _get_defined_function_name(self, code: str) -> Optional[str]:
-        """Extract the target function name from the code.
-           Looks for a function named 'fix' first; otherwise returns the first function found.
-        """
-        try:
-            tree = ast.parse(code)
-            first_func = None
-            for node in ast.walk(tree):
-                if isinstance(node, ast.FunctionDef):
-                    if node.name == "fix":
-                        return "fix"
-                    if first_func is None:
-                        first_func = node.name
-            return first_func   # fallback if no 'fix' function exists
-        except SyntaxError:
-            pass
-        return None
-    def _generate_test_script(self, fix_code: str, func_name: str) -> str:
-        """Generate a test script that runs fixed + fuzzed test cases and outputs JSON."""
-        test_cases = self._get_test_cases(func_name)
-        fuzzed_cases = self._generate_fuzzed_cases(func_name)
-        all_cases = test_cases + fuzzed_cases
-        lines = []
-        lines.append(fix_code)
-        lines.append("")
-        lines.append("import json")
-        lines.append("")
-        lines.append("def run_tests():")
-        lines.append(f"    test_cases = {json.dumps(all_cases)}")
-        lines.append("    passed = 0")
-        lines.append("    total = len(test_cases)")
-        lines.append("    for args, expected in test_cases:")
-        lines.append(f"        try:")
-        lines.append(f"            result = {func_name}(*args) if isinstance(args, list) else {func_name}(args)")
-        lines.append(f"            if result == expected:")
-        lines.append(f"                passed += 1")
-        lines.append(f"        except Exception:")
-        lines.append(f"            pass")
-        lines.append("    return {'passed': passed, 'total': total}")
-        lines.append("")
-        lines.append("if __name__ == '__main__':")
-        lines.append("    result = run_tests()")
-        lines.append("    print(json.dumps(result))")
-        return "\n".join(lines)
-    def _get_test_cases(self, func_name: str) -> List[Tuple[List[Any], Any]]:
-        """
-        Return a list of (arguments, expected_output) for the given bug_id.
-        Uses the actual function name (dynamic) for consistency.
-        """
-        if self.bug_id == "null_check":
-            return [
-                ([{"users": {"alice": "Alice"}, "id": "bob"}], None),   # missing key should not crash
-                ([{"users": {"alice": "Alice"}, "id": "alice"}], "Alice"),
-            ]
-        elif self.bug_id == "off_by_one":
-            return [
-                ([[1,2,3,4]], 4),
-                ([[]], 0),
-            ]
-        elif self.bug_id == "division_by_zero":
-            return [
-                ([[]], 0),
-                ([[1,2,3]], 2.0),
-            ]
-        elif self.bug_id == "wrong_operator":
-            return [
-                ([5,3], 8),
-                ([-1,1], 0),
-            ]
-        else:
-            # For missing_lock, deadlock_order, etc., return empty list (will be handled gracefully)
-            return []
-    def _generate_fuzzed_cases(self, func_name: str) -> List[Tuple[List[Any], Any]]:
-        """
-        Generate random test cases to prevent memorisation.
-        Only for bugs where meaningful fuzzing is possible.
-        """
-        cases = []
-        if self.bug_id == "null_check":
-            # Random users dictionary and random ids
-            for _ in range(self.fuzz_rounds):
-                users = {f"user_{i}": f"name_{i}" for i in range(random.randint(1, 5))}
-                # Pick existing or missing key
-                if random.random() > 0.5:
-                    key = random.choice(list(users.keys()))
-                    expected = users[key]
-                else:
-                    key = "missing_" + str(random.randint(100, 999))
-                    expected = None
-                cases.append(([{"users": users, "id": key}], expected))
-        elif self.bug_id == "off_by_one":
-            for _ in range(self.fuzz_rounds):
-                length = random.randint(0, 10)
-                arr = list(range(length))
-                cases.append(([arr], length))
-        elif self.bug_id == "division_by_zero":
-            for _ in range(self.fuzz_rounds):
-                length = random.randint(0, 10)
-                data = [random.randint(-100, 100) for _ in range(length)]
-                expected = sum(data)/length if length else 0
-                cases.append(([data], expected))
-        elif self.bug_id == "wrong_operator":
-            for _ in range(self.fuzz_rounds):
-                a = random.randint(-100, 100)
-                b = random.randint(-100, 100)
-                cases.append(([a, b], a + b))
-        return cases

+# test_runner.py – Full production version with continuous scoring, dynamic function detection, and randomised tests
+import subprocess
+import tempfile
+import os
+import json
+import ast
+import random
+import sys
+from typing import Tuple, List, Any, Optional
+from dataclasses import dataclass
+# Bridge fine-grained RedTeam ids to canonical TestRunner families.
+# This keeps evaluation stable even when bug generators use richer labels.
+BUG_ID_CANONICAL_MAP = {
+    # Easy-family bugs on `get_user`-style behavior.
+    "simple_typo": "null_check",
+    "default_value": "null_check",
+    "empty_return": "null_check",
+    # Medium arithmetic/control-flow aliases.
+    "loop_skip": "off_by_one",
+    "sign_error": "wrong_operator",
+    # Hard numeric-safety aliases.
+    "division_by_zero_empty": "division_by_zero",
+    "division_by_zero_zero": "division_by_zero",
+    "float_precision": "division_by_zero",
+    "abs_usage": "division_by_zero",
+    "round_error": "division_by_zero",
+}
+@dataclass
+class TestRunner:
+    bug_id: str
+    timeout_sec: int = 5
+    max_memory_mb: int = 256
+    fuzz_rounds: int = 3   # number of random test cases per bug
+    def run_tests(self, fix_code: str) -> Tuple[float, str]:
+        """
+        Returns (score, output_message) where score is proportion of passed tests (0.0–1.0).
+        """
+        # 1. Detect the function defined in the agent's code (dynamic)
+        func_name = self._get_defined_function_name(fix_code)
+        if not func_name:
+            return 0.0, "No function definition found in agent code"
+        # 2. Normalize bug id so broader RedTeam ids still hit meaningful tests.
+        canonical_bug_id = self._canonical_bug_id()
+        # 3. Generate the test script (includes fixed + fuzzed test cases)
+        test_script = self._generate_test_script(fix_code, func_name, canonical_bug_id)
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding='utf-8') as f:
+            f.write(test_script)
+            tmp_path = f.name
+        try:
+            # Resource limiting (Linux only; fallback otherwise)
+            try:
+                import resource
+                resource.setrlimit(resource.RLIMIT_AS, (self.max_memory_mb * 1024 * 1024, self.max_memory_mb * 1024 * 1024))
+            except Exception:
+                pass
+            result = subprocess.run(
+                [sys.executable, tmp_path],
+                capture_output=True,
+                text=True,
+                timeout=self.timeout_sec,
+                encoding='utf-8'
+            )
+            # Parse JSON output
+            try:
+                data = json.loads(result.stdout.strip())
+                passed = data.get("passed", 0)
+                total = data.get("total", 1)
+                score = passed / total if total > 0 else 0.0
+                return score, result.stdout.strip()
+            except json.JSONDecodeError:
+                # Fallback: look for "True" (legacy)
+                if "True" in result.stdout:
+                    return 1.0, result.stdout
+                return 0.0, result.stdout
+        except subprocess.TimeoutExpired:
+            return 0.0, "Test execution timed out"
+        except Exception as e:
+            return 0.0, f"Unexpected error: {str(e)}"
+        finally:
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+    def _get_defined_function_name(self, code: str) -> Optional[str]:
+        """Extract the target function name from the code.
+           Looks for a function named 'fix' first; otherwise returns the first function found.
+        """
+        try:
+            tree = ast.parse(code)
+            first_func = None
+            for node in ast.walk(tree):
+                if isinstance(node, ast.FunctionDef):
+                    if node.name == "fix":
+                        return "fix"
+                    if first_func is None:
+                        first_func = node.name
+            return first_func   # fallback if no 'fix' function exists
+        except SyntaxError:
+            pass
+        return None
+    def _canonical_bug_id(self) -> str:
+        """Return canonical bug family used by this test harness."""
+        return BUG_ID_CANONICAL_MAP.get(self.bug_id, self.bug_id)
+    def _generate_test_script(self, fix_code: str, func_name: str, canonical_bug_id: str) -> str:
+        """Generate a test script that runs fixed + fuzzed test cases and outputs JSON."""
+        test_cases = self._get_test_cases(canonical_bug_id, func_name)
+        fuzzed_cases = self._generate_fuzzed_cases(canonical_bug_id, func_name)
+        all_cases = test_cases + fuzzed_cases
+        lines = []
+        lines.append(fix_code)
+        lines.append("")
+        lines.append("import json")
+        lines.append("")
+        lines.append("def run_tests():")
+        lines.append(f"    test_cases = {json.dumps(all_cases)}")
+        lines.append("    passed = 0")
+        lines.append("    total = len(test_cases)")
+        lines.append("    for args, expected in test_cases:")
+        lines.append(f"        try:")
+        lines.append(f"            result = {func_name}(*args) if isinstance(args, list) else {func_name}(args)")
+        lines.append(f"            if result == expected:")
+        lines.append(f"                passed += 1")
+        lines.append(f"        except Exception:")
+        lines.append(f"            pass")
+        lines.append("    return {'passed': passed, 'total': total}")
+        lines.append("")
+        lines.append("if __name__ == '__main__':")
+        lines.append("    result = run_tests()")
+        lines.append("    print(json.dumps(result))")
+        return "\n".join(lines)
+    def _get_test_cases(self, canonical_bug_id: str, func_name: str) -> List[Tuple[List[Any], Any]]:
+        """
+        Return a list of (arguments, expected_output) for the given bug_id.
+        Uses the actual function name (dynamic) for consistency.
+        """
+        if canonical_bug_id == "null_check":
+            return [
+                ([{"users": {"alice": "Alice"}, "id": "bob"}], None),   # missing key should not crash
+                ([{"users": {"alice": "Alice"}, "id": "alice"}], "Alice"),
+            ]
+        elif canonical_bug_id == "off_by_one":
+            return [
+                ([[1,2,3,4]], 4),
+                ([[]], 0),
+            ]
+        elif canonical_bug_id == "division_by_zero":
+            return [
+                ([[]], 0),
+                ([[1,2,3]], 2.0),
+            ]
+        elif canonical_bug_id == "wrong_operator":
+            return [
+                ([5,3], 8),
+                ([-1,1], 0),
+            ]
+        else:
+            # For missing_lock, deadlock_order, etc., return empty list (will be handled gracefully)
+            return []
+    def _generate_fuzzed_cases(self, canonical_bug_id: str, func_name: str) -> List[Tuple[List[Any], Any]]:
+        """
+        Generate random test cases to prevent memorisation.
+        Only for bugs where meaningful fuzzing is possible.
+        """
+        cases = []
+        if canonical_bug_id == "null_check":
+            # Random users dictionary and random ids
+            for _ in range(self.fuzz_rounds):
+                users = {f"user_{i}": f"name_{i}" for i in range(random.randint(1, 5))}
+                # Pick existing or missing key
+                if random.random() > 0.5:
+                    key = random.choice(list(users.keys()))
+                    expected = users[key]
+                else:
+                    key = "missing_" + str(random.randint(100, 999))
+                    expected = None
+                cases.append(([{"users": users, "id": key}], expected))
+        elif canonical_bug_id == "off_by_one":
+            for _ in range(self.fuzz_rounds):
+                length = random.randint(0, 10)
+                arr = list(range(length))
+                cases.append(([arr], length))
+        elif canonical_bug_id == "division_by_zero":
+            for _ in range(self.fuzz_rounds):
+                length = random.randint(0, 10)
+                data = [random.randint(-100, 100) for _ in range(length)]
+                expected = sum(data)/length if length else 0
+                cases.append(([data], expected))
+        elif canonical_bug_id == "wrong_operator":
+            for _ in range(self.fuzz_rounds):
+                a = random.randint(-100, 100)
+                b = random.randint(-100, 100)
+                cases.append(([a, b], a + b))
+        return cases

training.py CHANGED Viewed

@@ -1,708 +1,792 @@
-# training.py
-import json
-import torch
-import torch.nn.functional as F
-from torch.optim import AdamW
-from dataclasses import dataclass
-from typing import List, Dict, Tuple, Optional
-import numpy as np
-import re
-import random
-from unsloth import FastLanguageModel
-from transformers import TrainingArguments
-from trl import SFTTrainer
-from datasets import Dataset
-# Import your environment and actions (unchanged)
-from environment import CodeReviewEnv
-from models import (
-    RunTests, RunLinter, Inspect,
-    ProposeFix, WriteComment, AskQuestion,
-    Done, Skip , QueryDocs
-)
-# ======================================================================
-# 1. ACTION PARSING (improved with fallback)
-# ======================================================================
-@dataclass
-class AgentAction:
-    action_type: str
-    content: Optional[str] = None
-def parse_action(output: str) -> AgentAction:
-    """Robust JSON parsing with regex fallback and keyword detection."""
-    # Try strict JSON first
-    try:
-        data = json.loads(output)
-        return AgentAction(
-            action_type=data.get("action_type", "").lower(),
-            content=data.get("content")
-        )
-    except:
-        pass
-    # Try to extract JSON from markdown blocks
-    json_match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', output, re.DOTALL)
-    if json_match:
-        try:
-            data = json.loads(json_match.group(1))
-            return AgentAction(
-                action_type=data.get("action_type", "").lower(),
-                content=data.get("content")
-            )
-        except:
-            pass
-    # Try to find "action_type" field with regex
-    action_pattern = r'"action_type"\s*:\s*"(\w+)"'
-    match = re.search(action_pattern, output)
-    if match:
-        return AgentAction(action_type=match.group(1).lower())
-    # Keyword detection as last resort
-    output_lower = output.lower()
-    if "test" in output_lower:
-        return AgentAction("run_tests")
-    if "lint" in output_lower:
-        return AgentAction("run_linter")
-    if "inspect" in output_lower:
-        return AgentAction("inspect")
-    return AgentAction("invalid", output)
-def map_to_env(action: AgentAction):
-    if action.action_type == "run_tests":
-        return RunTests()
-    elif action.action_type == "run_linter":
-        return RunLinter()
-    elif action.action_type == "inspect":
-        return Inspect()
-    elif action.action_type == "fix":
-        return ProposeFix(fix_code=action.content or "")
-    elif action.action_type == "comment":
-        return WriteComment(comment_text=action.content or "")
-    elif action.action_type == "question":
-        return AskQuestion(question=action.content or "")
-    elif action.action_type == "query_docs":               # <-- new
-        return QueryDocs(query_topic=action.content or "")
-    elif action.action_type == "done":
-        return Done()
-    else:
-        return Skip()
-# ======================================================================
-# 2. MODEL SETUP (stabilised LoRA)
-# ======================================================================
-def load_model():
-    model, tokenizer = FastLanguageModel.from_pretrained(
-        model_name="unsloth/gemma-2-2b-it-bnb-4bit",
-        max_seq_length=2048,
-        load_in_4bit=True,
-    )
-    # FIXED: Lower rank (16), dropout=0 for stability
-    model = FastLanguageModel.get_peft_model(
-        model,
-        r=16,                     # was 64 → causes collapse
-        target_modules=[
-            "q_proj", "k_proj", "v_proj", "o_proj",
-            "gate_proj", "up_proj", "down_proj"
-        ],
-        lora_alpha=32,            # adjusted for r=16
-        lora_dropout=0.0,         # dropout can cause empty outputs
-    )
-    # Ensure tokenizer has correct chat template for Gemma-2
-    if tokenizer.chat_template is None:
-        tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}<start_of_turn>user\n{{ message['content'] }}<end_of_turn>\n<start_of_turn>model\n{% elif message['role'] == 'assistant' %}{{ message['content'] }}<end_of_turn>\n{% endif %}{% endfor %}"
-    return model, tokenizer
-# ======================================================================
-# 3. MODEL SANITY CHECK (new – ensures model can generate text)
-# ======================================================================
-def test_model_sanity(model, tokenizer) -> bool:
-    print("\n" + "="*60)
-    print("SANITY CHECK: Testing base model generation")
-    print("="*60)
-    test_prompt = "Hello, how are you?"
-    messages = [{"role": "user", "content": test_prompt}]
-    formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-    inputs = tokenizer(formatted, return_tensors="pt").to("cuda")
-    with torch.no_grad():
-        outputs = model.generate(
-            **inputs,
-            max_new_tokens=30,
-            do_sample=True,
-            temperature=0.7,
-            min_new_tokens=1,
-            eos_token_id=tokenizer.eos_token_id,
-            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
-        )
-    generated_ids = outputs[0][inputs['input_ids'].shape[1]:]
-    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
-    print(f"Prompt: {test_prompt}")
-    print(f"Response: {repr(response)}")
-    if len(response) == 0:
-        print("❌ Model produces empty output – cannot train.")
-        return False
-    print("✓ Model sanity check PASSED\n")
-    return True
-# ======================================================================
-# 4. SUPERVISED WARM-UP (teaches JSON output)
-# ======================================================================
-def supervised_warmup(model, tokenizer, n_examples=500, epochs=2):
-    print("\n" + "="*60)
-    print("SUPERVISED WARM-UP: Teaching JSON format")
-    print("="*60)
-    examples = []
-    action_templates = [
-        '{"action_type": "run_tests"}',
-        '{"action_type": "run_linter"}',
-        '{"action_type": "inspect"}',
-        '{"action_type": "fix", "content": "def corrected():\n    pass"}',
-        '{"action_type": "comment", "content": "This looks good."}',
-        '{"action_type": "question", "content": "Why is this variable used?"}',
-        '{"action_type": "done"}',
-    ]
-    for i in range(n_examples):
-        code = f"def example_{i}():\n    return {i % 10}"
-        last_outputs = [
-            "Tests passed: 2/3",
-            "Linter found 1 error",
-            "Inspection complete",
-            "No previous action",
-        ]
-        last_output = random.choice(last_outputs)
-        # Use same prompt structure as build_prompt
-        prompt = f"""You are an AI code review agent. Your goal is to convince a simulated human developer to accept your proposed fix and name your proposed fix function fix.
-The developer has a **{author_personality}** personality and will only accept if you provide solid evidence:
-- Tests pass (high pass ratio)
-- Lint is clean (zero errors)
-- Documentation or references are provided
-- Your reasoning is clear, uses words like "because" or "therefore", and is detailed (over 30 words if needed)
-Workflow:
-1. Use `inspect` to understand the code.
-2. Use `run_tests` and `run_linter` to gather evidence.
-3. Propose a fix (`fix`) and explain why it works (`comment` or `question`).
-4. If the developer pushes back, read their response carefully and address their specific concern.
-5. Once convinced, use `done` to finish.
-Code:
-{obs.code_snippet}
-Author says:
-{author_msg if author_msg else "(no response yet – start with inspection)"}
-Last tool output:
-{tool_output if tool_output else "(none)"}
-Available actions:
-run_tests, run_linter, inspect, fix, comment, question, done
-Respond ONLY in JSON:
-{{"action_type": "...", "content": "..."}}"""
-        action_json = random.choice(action_templates)
-        messages = [
-            {"role": "user", "content": prompt},
-            {"role": "assistant", "content": action_json}
-        ]
-        full_text = tokenizer.apply_chat_template(messages, tokenize=False)
-        examples.append({"text": full_text})
-    dataset = Dataset.from_list(examples)
-    trainer = SFTTrainer(
-        model=model,
-        tokenizer=tokenizer,
-        train_dataset=dataset,
-        dataset_text_field="text",
-        max_seq_length=512,
-        args=TrainingArguments(
-            output_dir="warmup_output",
-            num_train_epochs=epochs,
-            per_device_train_batch_size=4,
-            gradient_accumulation_steps=2,
-            learning_rate=2e-5,
-            logging_steps=50,
-            save_strategy="no",
-            fp16=True,
-        ),
-    )
-    print(f"Training on {n_examples} examples for {epochs} epochs...")
-    trainer.train()
-    print("✓ Warm-up complete\n")
-# ======================================================================
-# 5. ACTION GENERATION WITH LOGPROB TRACKING (fixed)
-# ======================================================================
-def generate_action_with_logprob(
-    prompt: str,
-    model,
-    tokenizer,
-    temperature: float = 0.0,   # changed: greedy by default for stability
-    max_retries: int = 2
-) -> Tuple[str, float]:
-    """Generate action using correct chat template, with fallback."""
-    messages = [{"role": "user", "content": prompt}]
-    formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-    inputs = tokenizer(formatted, return_tensors="pt").to("cuda")
-    for attempt in range(max_retries):
-        with torch.no_grad():
-            outputs = model.generate(
-                **inputs,
-                max_new_tokens=128,
-                do_sample=(temperature > 0),
-                temperature=max(temperature, 0.01) if temperature > 0 else 1.0,
-                min_new_tokens=1,
-                return_dict_in_generate=True,
-                output_scores=True,
-            )
-        generated_ids = outputs.sequences[0][inputs['input_ids'].shape[1]:]
-        action_text = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
-        # Compute logprob
-        logprobs = []
-        for idx, token_id in enumerate(generated_ids):
-            if idx < len(outputs.scores):
-                token_logits = outputs.scores[idx][0]
-                token_logprob = F.log_softmax(token_logits, dim=-1)[token_id].item()
-                logprobs.append(token_logprob)
-        total_logprob = sum(logprobs) if logprobs else -100.0
-        # If empty, use fallback
-        if not action_text:
-            fallback_actions = [
-                '{"action_type": "run_tests"}',
-                '{"action_type": "run_linter"}',
-                '{"action_type": "inspect"}',
-                '{"action_type": "skip"}',
-            ]
-            action_text = random.choice(fallback_actions)
-            total_logprob = -50.0
-            print(f"[WARN] Empty generation → using fallback: {action_text}")
-            return action_text, total_logprob
-        # Validate JSON
-        try:
-            json.loads(action_text)
-            return action_text, total_logprob
-        except:
-            if attempt == max_retries - 1:
-                return '{"action_type":"skip"}', -100.0
-            continue
-    return '{"action_type":"skip"}', -100.0
-# ======================================================================
-# 6. PROMPT BUILDER (unchanged – exactly as you wrote)
-# ======================================================================
-def build_prompt(obs, history_lines: List[str]) -> str:
-    author_msg = getattr(obs, "author_response", "") or ""
-    tool_output = getattr(obs, "last_tool_output", "") or ""
-    # Personality hint (optional but helpful)
-    author_personality = getattr(obs, "author_personality", "defensive")  # e.g., from env
-    prompt = f"""You are an AI code review agent. Your goal is to convince a simulated human developer to accept your proposed fix and name your proposed fix function fix.
-The developer has a **{author_personality}** personality and will only accept if you provide solid evidence:
-- Tests pass (high pass ratio)
-- Lint is clean (zero errors)
-- Documentation or references are provided
-- Your reasoning is clear, uses words like "because" or "therefore", and is detailed (over 30 words if needed)
-Workflow:
-1. Use `inspect` to understand the code.
-2. Use `run_tests` and `run_linter` to gather evidence.
-3. Propose a fix (`fix`) and explain why it works (`comment` or `question`).
-4. If the developer pushes back, read their response carefully and address their specific concern.
-5. Once convinced, use `done` to finish.
-Code:
-{obs.code_snippet}
-Author says:
-{author_msg if author_msg else "(no response yet – start with inspection)"}
-Last tool output:
-{tool_output if tool_output else "(none)"}
-Available actions:
-run_tests, run_linter, inspect, fix, comment, question, done
-Respond ONLY in JSON:
-{{"action_type": "...", "content": "..."}}"""
-    if history_lines:
-        history = "\n".join(history_lines[-6:])
-        prompt += f"\n\nPrevious steps:\n{history}"
-    return prompt
-# ======================================================================
-# 7. TRAJECTORY STORAGE (unchanged)
-# ======================================================================
-@dataclass
-class Trajectory:
-    states: List[str]
-    actions: List[str]
-    rewards: List[float]
-    logprobs: List[float]
-    dones: List[bool]
-    def __len__(self):
-        return len(self.states)
-    def to_dict(self):
-        return {
-            "states": self.states,
-            "actions": self.actions,
-            "rewards": self.rewards,
-            "logprobs": self.logprobs,
-            "dones": self.dones,
-        }
-# ======================================================================
-# 8. ROLLOUT COLLECTION (uses fixed generate)
-# ======================================================================
-def collect_trajectory(
-    env: CodeReviewEnv,
-    model,
-    tokenizer,
-    max_steps: int = 10,
-    temperature: float = 0.0   # changed to greedy
-) -> Trajectory:
-    obs = env.reset()
-    history_lines = []
-    states = []
-    actions = []
-    rewards = []
-    logprobs = []
-    dones = []
-    for step in range(max_steps):
-        prompt = build_prompt(obs, history_lines)
-        states.append(prompt)
-        action_text, logprob = generate_action_with_logprob(
-            prompt, model, tokenizer, temperature
-        )
-        actions.append(action_text)
-        logprobs.append(logprob)
-        action = parse_action(action_text)
-        env_action = map_to_env(action)
-        next_obs, reward, done, _ = env.step(env_action)
-        rewards.append(reward.value)
-        dones.append(done)
-        history_lines.append(f"Agent: {action_text}")
-        history_lines.append(f"Env: {next_obs.last_tool_output}")
-        obs = next_obs
-        if done:
-            break
-    return Trajectory(states, actions, rewards, logprobs, dones)
-def collect_trajectories(
-    env: CodeReviewEnv,
-    model,
-    tokenizer,
-    n_trajectories: int,
-    max_steps: int = 10
-) -> List[Trajectory]:
-    trajectories = []
-    for i in range(n_trajectories):
-        traj = collect_trajectory(env, model, tokenizer, max_steps)
-        total_reward = sum(traj.rewards)
-        print(f"Trajectory {i+1}/{n_trajectories}: "
-              f"steps={len(traj)}, reward={total_reward:.3f}")
-        trajectories.append(traj)
-    return trajectories
-# ======================================================================
-# 9. ADVANTAGE ESTIMATION (unchanged)
-# ======================================================================
-def compute_returns_and_advantages(
-    rewards: List[float],
-    dones: List[bool],
-    gamma: float = 0.99,
-    standardize: bool = True
-) -> Tuple[List[float], List[float]]:
-    """
-    Computes discounted returns and normalised advantages (no critic).
-    Advantages = returns - mean(returns)  (or zero baseline).
-    """
-    n = len(rewards)
-    returns = [0.0] * n
-    running_return = 0.0
-    for t in reversed(range(n)):
-        if dones[t]:
-            running_return = 0.0
-        running_return = rewards[t] + gamma * running_return
-        returns[t] = running_return
-    if standardize:
-        advantages = np.array(returns) - np.mean(returns)
-        adv_std = np.std(advantages) + 1e-8
-        advantages = (advantages / adv_std).tolist()
-    else:
-        advantages = returns.copy()
-    return advantages, returns
-# ======================================================================
-# 10. COMPUTE NEW LOGPROBS (unchanged)
-# ======================================================================
-def compute_logprob(prompt: str, action: str, model, tokenizer) -> float:
-    messages = [{"role": "user", "content": prompt}]
-    formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-    full_text = formatted + action
-    inputs = tokenizer(full_text, return_tensors="pt").to("cuda")
-    with torch.no_grad():
-        outputs = model(**inputs)
-        logits = outputs.logits
-    action_ids = tokenizer.encode(action, add_special_tokens=False)
-    prefix_ids = tokenizer.encode(formatted, add_special_tokens=False)
-    action_start = len(prefix_ids)
-    logprobs = []
-    for idx, token_id in enumerate(action_ids):
-        position = action_start + idx - 1
-        if 0 <= position < logits.shape[1]:
-            token_logits = logits[0, position]
-            token_logprob = F.log_softmax(token_logits, dim=-1)[token_id].item()
-            logprobs.append(token_logprob)
-    return sum(logprobs) if logprobs else -100.0
-# ======================================================================
-# 11. PPO UPDATE (unchanged except uses compute_logprob correctly)
-# ======================================================================
-def ppo_update(
-    trajectories: List[Trajectory],
-    model,
-    tokenizer,
-    optimizer,
-    n_epochs: int = 4,
-    clip_epsilon: float = 0.2,
-    entropy_coef: float = 0.01,
-    gamma: float = 0.99,
-) -> Dict[str, float]:
-    model.train()
-    all_states = []
-    all_actions = []
-    all_old_logprobs = []
-    all_advantages = []
-    all_returns = []
-    for traj in trajectories:
-        advantages, returns = compute_returns_and_advantages(
-            traj.rewards, traj.dones, gamma=gamma, standardize=True
-        )
-        all_states.extend(traj.states)
-        all_actions.extend(traj.actions)
-        all_old_logprobs.extend(traj.logprobs)
-        all_advantages.extend(advantages)
-        all_returns.extend(returns)
-    n_samples = len(all_states)
-    total_loss = 0.0
-    total_policy_loss = 0.0
-    total_entropy = 0.0
-    n_updates = 0
-    for epoch in range(n_epochs):
-        indices = np.random.permutation(n_samples)
-        for i in indices:
-            state = all_states[i]
-            action = all_actions[i]
-            old_logprob = all_old_logprobs[i]
-            advantage = all_advantages[i]
-            # Use the same chat template for PPO update
-            messages = [{"role": "user", "content": state}]
-            formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-            full_text = formatted + action
-            inputs = tokenizer(full_text, return_tensors="pt").to("cuda")
-            outputs = model(**inputs)
-            logits = outputs.logits
-            action_ids = tokenizer.encode(action, add_special_tokens=False)
-            prefix_ids = tokenizer.encode(formatted, add_special_tokens=False)
-            action_start = len(prefix_ids)
-            logprobs = []
-            entropy = 0.0
-            for idx, token_id in enumerate(action_ids):
-                position = action_start + idx - 1
-                if 0 <= position < logits.shape[1]:
-                    token_logits = logits[0, position]
-                    log_probs = F.log_softmax(token_logits, dim=-1)
-                    token_logprob = log_probs[token_id]
-                    logprobs.append(token_logprob)
-                    probs = F.softmax(token_logits, dim=-1)
-                    entropy += -(probs * log_probs).sum()
-            if not logprobs:
-                continue
-            new_logprob = sum(logprobs)
-            avg_entropy = entropy / len(logprobs) if logprobs else 0.0
-            ratio = torch.exp(new_logprob - old_logprob)
-            surr1 = ratio * advantage
-            surr2 = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantage
-            policy_loss = -torch.min(surr1, surr2)
-            loss = policy_loss - entropy_coef * avg_entropy
-            optimizer.zero_grad()
-            loss.backward()
-            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
-            optimizer.step()
-            total_loss += loss.item()
-            total_policy_loss += policy_loss.item()
-            total_entropy += avg_entropy.item()
-            n_updates += 1
-    return {
-        "loss": total_loss / n_updates if n_updates > 0 else 0.0,
-        "policy_loss": total_policy_loss / n_updates if n_updates > 0 else 0.0,
-        "entropy": total_entropy / n_updates if n_updates > 0 else 0.0,
-    }
-# ======================================================================
-# 12. EVALUATION (unchanged)
-# ======================================================================
-def evaluate_policy(
-    env: CodeReviewEnv,
-    model,
-    tokenizer,
-    n_episodes: int = 10,
-    max_steps: int = 10
-) -> Dict[str, float]:
-    model.eval()
-    total_rewards = []
-    episode_lengths = []
-    success_count = 0
-    for _ in range(n_episodes):
-        traj = collect_trajectory(env, model, tokenizer, max_steps, temperature=0.0)
-        total_reward = sum(traj.rewards)
-        total_rewards.append(total_reward)
-        episode_lengths.append(len(traj))
-        if total_reward > 0.5:
-            success_count += 1
-    return {
-        "avg_reward": np.mean(total_rewards),
-        "std_reward": np.std(total_rewards),
-        "avg_length": np.mean(episode_lengths),
-        "success_rate": success_count / n_episodes,
-    }
-# ======================================================================
-# 13. MAIN TRAINING LOOP (added sanity check and warm-up)
-# ======================================================================
-def train_ppo(
-    n_iterations: int = 50,
-    trajectories_per_iter: int = 10,
-    n_epochs: int = 4,
-    max_steps: int = 10,
-    learning_rate: float = 3e-5,
-    clip_epsilon: float = 0.2,
-    entropy_coef: float = 0.01,
-    gamma: float = 0.99,
-    eval_every: int = 5,
-):
-    print("Loading model...")
-    model, tokenizer = load_model()
-    # NEW: Sanity check before any training
-    if not test_model_sanity(model, tokenizer):
-        print("\n❌ Model sanity check failed – cannot proceed.")
-        return
-    # NEW: Supervised warm-up to teach JSON format
-    supervised_warmup(model, tokenizer, n_examples=500, epochs=2)
-    optimizer = AdamW(model.parameters(), lr=learning_rate)
-    env = CodeReviewEnv()
-    print(f"\n{'='*60}")
-    print(f"Starting PPO Training")
-    print(f"Iterations: {n_iterations}")
-    print(f"Trajectories per iteration: {trajectories_per_iter}")
-    print(f"PPO epochs: {n_epochs}")
-    print(f"{'='*60}\n")
-    for iteration in range(n_iterations):
-        print(f"\n--- Iteration {iteration + 1}/{n_iterations} ---")
-        print("Collecting trajectories...")
-        trajectories = collect_trajectories(
-            env, model, tokenizer, trajectories_per_iter, max_steps
-        )
-        avg_reward = np.mean([sum(t.rewards) for t in trajectories])
-        avg_length = np.mean([len(t) for t in trajectories])
-        print(f"Avg reward: {avg_reward:.3f}")
-        print(f"Avg length: {avg_length:.1f}")
-        print("Updating policy...")
-        metrics = ppo_update(
-            trajectories,
-            model,
-            tokenizer,
-            optimizer,
-            n_epochs=n_epochs,
-            clip_epsilon=clip_epsilon,
-            entropy_coef=entropy_coef,
-            gamma=gamma,
-        )
-        print(f"Loss: {metrics['loss']:.4f}")
-        print(f"Policy loss: {metrics['policy_loss']:.4f}")
-        print(f"Entropy: {metrics['entropy']:.4f}")
-        if (iteration + 1) % eval_every == 0:
-            print("\nEvaluating policy...")
-            eval_metrics = evaluate_policy(env, model, tokenizer, n_episodes=10)
-            print(f"Eval avg reward: {eval_metrics['avg_reward']:.3f} ± {eval_metrics['std_reward']:.3f}")
-            print(f"Eval success rate: {eval_metrics['success_rate']:.2%}")
-            print(f"Eval avg length: {eval_metrics['avg_length']:.1f}")
-    print("\n" + "="*60)
-    print("Training complete. Saving model...")
-    model.save_pretrained("ppo_final_model")
-    tokenizer.save_pretrained("ppo_final_model")
-    print("Model saved to ppo_final_model/")
-    print("="*60)
-# ======================================================================
-# 14. ENTRY POINT (unchanged)
-# ======================================================================
-if __name__ == "__main__":
-    train_ppo(
-        n_iterations=50,
-        trajectories_per_iter=10,
-        n_epochs=4,
-        max_steps=10,
-        learning_rate=3e-5,
-        clip_epsilon=0.2,
-        entropy_coef=0.01,
-        gamma=0.99,
-        eval_every=5,
-    )

+# training.py
+import json
+import os
+import torch
+import torch.nn.functional as F
+from torch.optim import AdamW
+from dataclasses import dataclass
+from typing import List, Dict, Tuple, Optional
+import numpy as np
+import re
+import random
+import matplotlib.pyplot as plt
+from unsloth import FastLanguageModel
+from transformers import TrainingArguments
+from trl import SFTTrainer
+from datasets import Dataset
+# Import your environment and actions (unchanged)
+from environment import CodeReviewEnv
+from redteam import BUG_DB
+from models import (
+    RunTests, RunLinter, Inspect,
+    ProposeFix, WriteComment, AskQuestion,
+    Done, Skip , QueryDocs
+)
+# ======================================================================
+# 1. ACTION PARSING (improved with fallback)
+# ======================================================================
+@dataclass
+class AgentAction:
+    action_type: str
+    content: Optional[str] = None
+def parse_action(output: str) -> AgentAction:
+    """Robust JSON parsing with regex fallback and keyword detection."""
+    # Try strict JSON first
+    try:
+        data = json.loads(output)
+        return AgentAction(
+            action_type=data.get("action_type", "").lower(),
+            content=data.get("content")
+        )
+    except:
+        pass
+    # Try to extract JSON from markdown blocks
+    json_match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', output, re.DOTALL)
+    if json_match:
+        try:
+            data = json.loads(json_match.group(1))
+            return AgentAction(
+                action_type=data.get("action_type", "").lower(),
+                content=data.get("content")
+            )
+        except:
+            pass
+    # Try to find "action_type" field with regex
+    action_pattern = r'"action_type"\s*:\s*"(\w+)"'
+    match = re.search(action_pattern, output)
+    if match:
+        return AgentAction(action_type=match.group(1).lower())
+    # Keyword detection as last resort
+    output_lower = output.lower()
+    if "test" in output_lower:
+        return AgentAction("run_tests")
+    if "lint" in output_lower:
+        return AgentAction("run_linter")
+    if "inspect" in output_lower:
+        return AgentAction("inspect")
+    if "doc" in output_lower or "documentation" in output_lower:
+        # Bridge natural language mentions to rltool-backed retrieval action.
+        return AgentAction("query_docs", "bug fix guidance")
+    return AgentAction("invalid", output)
+def map_to_env(action: AgentAction):
+    if action.action_type == "run_tests":
+        return RunTests()
+    elif action.action_type == "run_linter":
+        return RunLinter()
+    elif action.action_type == "inspect":
+        return Inspect()
+    elif action.action_type == "fix":
+        return ProposeFix(fix_code=action.content or "")
+    elif action.action_type == "comment":
+        return WriteComment(comment_text=action.content or "")
+    elif action.action_type == "question":
+        return AskQuestion(question=action.content or "")
+    elif action.action_type == "query_docs":               # <-- new
+        return QueryDocs(query_topic=action.content or "")
+    elif action.action_type == "done":
+        return Done()
+    else:
+        return Skip()
+# ======================================================================
+# 2. MODEL SETUP (stabilised LoRA)
+# ======================================================================
+def load_model():
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name="unsloth/gemma-2-2b-it-bnb-4bit",
+        max_seq_length=2048,
+        load_in_4bit=True,
+    )
+    # FIXED: Lower rank (16), dropout=0 for stability
+    model = FastLanguageModel.get_peft_model(
+        model,
+        r=16,                     # was 64 → causes collapse
+        target_modules=[
+            "q_proj", "k_proj", "v_proj", "o_proj",
+            "gate_proj", "up_proj", "down_proj"
+        ],
+        lora_alpha=32,            # adjusted for r=16
+        lora_dropout=0.0,         # dropout can cause empty outputs
+    )
+    # Ensure tokenizer has correct chat template for Gemma-2
+    if tokenizer.chat_template is None:
+        tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}<start_of_turn>user\n{{ message['content'] }}<end_of_turn>\n<start_of_turn>model\n{% elif message['role'] == 'assistant' %}{{ message['content'] }}<end_of_turn>\n{% endif %}{% endfor %}"
+    return model, tokenizer
+# ======================================================================
+# 3. MODEL SANITY CHECK (new – ensures model can generate text)
+# ======================================================================
+def test_model_sanity(model, tokenizer) -> bool:
+    print("\n" + "="*60)
+    print("SANITY CHECK: Testing base model generation")
+    print("="*60)
+    test_prompt = "Hello, how are you?"
+    messages = [{"role": "user", "content": test_prompt}]
+    formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    inputs = tokenizer(formatted, return_tensors="pt").to("cuda")
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=30,
+            do_sample=True,
+            temperature=0.7,
+            min_new_tokens=1,
+            eos_token_id=tokenizer.eos_token_id,
+            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
+        )
+    generated_ids = outputs[0][inputs['input_ids'].shape[1]:]
+    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
+    print(f"Prompt: {test_prompt}")
+    print(f"Response: {repr(response)}")
+    if len(response) == 0:
+        print("❌ Model produces empty output – cannot train.")
+        return False
+    print("✓ Model sanity check PASSED\n")
+    return True
+# ======================================================================
+# 4. SUPERVISED WARM-UP (teaches JSON output)
+# ======================================================================
+def supervised_warmup(model, tokenizer, n_examples=500, epochs=2):
+    print("\n" + "="*60)
+    print("SUPERVISED WARM-UP: Teaching JSON format")
+    print("="*60)
+    examples = []
+    action_templates = [
+        '{"action_type": "run_tests"}',
+        '{"action_type": "run_linter"}',
+        '{"action_type": "inspect"}',
+        '{"action_type": "query_docs", "content": "python keyerror handling"}',
+        '{"action_type": "fix", "content": "def corrected():\n    pass"}',
+        '{"action_type": "comment", "content": "This looks good."}',
+        '{"action_type": "question", "content": "Why is this variable used?"}',
+        '{"action_type": "done"}',
+    ]
+    for i in range(n_examples):
+        code = f"def example_{i}():\n    return {i % 10}"
+        last_outputs = [
+            "Tests passed: 2/3",
+            "Linter found 1 error",
+            "Inspection complete",
+            "No previous action",
+        ]
+        last_output = random.choice(last_outputs)
+        # Use same prompt structure as build_prompt
+        prompt = f"""You are an AI code review agent. Your goal is to convince a simulated human developer to accept your proposed fix and name your proposed fix function fix.
+The developer has a **{author_personality}** personality and will only accept if you provide solid evidence:
+- Tests pass (high pass ratio)
+- Lint is clean (zero errors)
+- Documentation or references are provided
+- Your reasoning is clear, uses words like "because" or "therefore", and is detailed (over 30 words if needed)
+Workflow:
+1. Use `inspect` to understand the code.
+2. Use `run_tests` and `run_linter` to gather evidence.
+3. Propose a fix (`fix`) and explain why it works (`comment` or `question`).
+4. If the developer pushes back, read their response carefully and address their specific concern.
+5. Once convinced, use `done` to finish.
+Code:
+{obs.code_snippet}
+Author says:
+{author_msg if author_msg else "(no response yet – start with inspection)"}
+Last tool output:
+{tool_output if tool_output else "(none)"}
+Available actions:
+run_tests, run_linter, inspect, query_docs, fix, comment, question, done
+Respond ONLY in JSON:
+{{"action_type": "...", "content": "..."}}"""
+        action_json = random.choice(action_templates)
+        messages = [
+            {"role": "user", "content": prompt},
+            {"role": "assistant", "content": action_json}
+        ]
+        full_text = tokenizer.apply_chat_template(messages, tokenize=False)
+        examples.append({"text": full_text})
+    dataset = Dataset.from_list(examples)
+    trainer = SFTTrainer(
+        model=model,
+        tokenizer=tokenizer,
+        train_dataset=dataset,
+        dataset_text_field="text",
+        max_seq_length=512,
+        args=TrainingArguments(
+            output_dir="warmup_output",
+            num_train_epochs=epochs,
+            per_device_train_batch_size=4,
+            gradient_accumulation_steps=2,
+            learning_rate=2e-5,
+            logging_steps=50,
+            save_strategy="no",
+            fp16=True,
+        ),
+    )
+    print(f"Training on {n_examples} examples for {epochs} epochs...")
+    trainer.train()
+    print("✓ Warm-up complete\n")
+# ======================================================================
+# 5. ACTION GENERATION WITH LOGPROB TRACKING (fixed)
+# ======================================================================
+def generate_action_with_logprob(
+    prompt: str,
+    model,
+    tokenizer,
+    temperature: float = 0.0,   # changed: greedy by default for stability
+    max_retries: int = 2
+) -> Tuple[str, float]:
+    """Generate action using correct chat template, with fallback."""
+    messages = [{"role": "user", "content": prompt}]
+    formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    inputs = tokenizer(formatted, return_tensors="pt").to("cuda")
+    for attempt in range(max_retries):
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                max_new_tokens=128,
+                do_sample=(temperature > 0),
+                temperature=max(temperature, 0.01) if temperature > 0 else 1.0,
+                min_new_tokens=1,
+                return_dict_in_generate=True,
+                output_scores=True,
+            )
+        generated_ids = outputs.sequences[0][inputs['input_ids'].shape[1]:]
+        action_text = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
+        # Compute logprob
+        logprobs = []
+        for idx, token_id in enumerate(generated_ids):
+            if idx < len(outputs.scores):
+                token_logits = outputs.scores[idx][0]
+                token_logprob = F.log_softmax(token_logits, dim=-1)[token_id].item()
+                logprobs.append(token_logprob)
+        total_logprob = sum(logprobs) if logprobs else -100.0
+        # If empty, use fallback
+        if not action_text:
+            fallback_actions = [
+                '{"action_type": "run_tests"}',
+                '{"action_type": "run_linter"}',
+                '{"action_type": "inspect"}',
+                '{"action_type": "skip"}',
+            ]
+            action_text = random.choice(fallback_actions)
+            total_logprob = -50.0
+            print(f"[WARN] Empty generation → using fallback: {action_text}")
+            return action_text, total_logprob
+        # Validate JSON
+        try:
+            json.loads(action_text)
+            return action_text, total_logprob
+        except:
+            if attempt == max_retries - 1:
+                return '{"action_type":"skip"}', -100.0
+            continue
+    return '{"action_type":"skip"}', -100.0
+# ======================================================================
+# 6. PROMPT BUILDER (unchanged – exactly as you wrote)
+# ======================================================================
+def build_prompt(obs, history_lines: List[str]) -> str:
+    author_msg = getattr(obs, "author_response", "") or ""
+    tool_output = getattr(obs, "last_tool_output", "") or ""
+    # Personality hint (optional but helpful)
+    author_personality = getattr(obs, "author_personality", "defensive")  # e.g., from env
+    prompt = f"""You are an AI code review agent. Your goal is to convince a simulated human developer to accept your proposed fix and name your proposed fix function fix.
+The developer has a **{author_personality}** personality and will only accept if you provide solid evidence:
+- Tests pass (high pass ratio)
+- Lint is clean (zero errors)
+- Documentation or references are provided
+- Your reasoning is clear, uses words like "because" or "therefore", and is detailed (over 30 words if needed)
+Workflow:
+1. Use `inspect` to understand the code.
+2. Use `run_tests` and `run_linter` to gather evidence.
+3. Use `query_docs` when you need references or language-specific guidance.
+4. Propose a fix (`fix`) and explain why it works (`comment` or `question`).
+5. If the developer pushes back, read their response carefully and address their specific concern.
+6. Once convinced, use `done` to finish.
+Code:
+{obs.code_snippet}
+Author says:
+{author_msg if author_msg else "(no response yet – start with inspection)"}
+Last tool output:
+{tool_output if tool_output else "(none)"}
+Available actions:
+run_tests, run_linter, inspect, query_docs, fix, comment, question, done
+Respond ONLY in JSON:
+{{"action_type": "...", "content": "..."}}"""
+    if history_lines:
+        history = "\n".join(history_lines[-6:])
+        prompt += f"\n\nPrevious steps:\n{history}"
+    return prompt
+# ======================================================================
+# 7. TRAJECTORY STORAGE (unchanged)
+# ======================================================================
+@dataclass
+class Trajectory:
+    states: List[str]
+    actions: List[str]
+    rewards: List[float]
+    logprobs: List[float]
+    dones: List[bool]
+    def __len__(self):
+        return len(self.states)
+    def to_dict(self):
+        return {
+            "states": self.states,
+            "actions": self.actions,
+            "rewards": self.rewards,
+            "logprobs": self.logprobs,
+            "dones": self.dones,
+        }
+# ======================================================================
+# 8. ROLLOUT COLLECTION (uses fixed generate)
+# ======================================================================
+def collect_trajectory(
+    env: CodeReviewEnv,
+    model,
+    tokenizer,
+    max_steps: int = 10,
+    temperature: float = 0.0   # changed to greedy
+) -> Trajectory:
+    obs = env.reset()
+    history_lines = []
+    states = []
+    actions = []
+    rewards = []
+    logprobs = []
+    dones = []
+    for step in range(max_steps):
+        prompt = build_prompt(obs, history_lines)
+        states.append(prompt)
+        action_text, logprob = generate_action_with_logprob(
+            prompt, model, tokenizer, temperature
+        )
+        actions.append(action_text)
+        logprobs.append(logprob)
+        action = parse_action(action_text)
+        env_action = map_to_env(action)
+        next_obs, reward, done, _ = env.step(env_action)
+        rewards.append(reward.value)
+        dones.append(done)
+        history_lines.append(f"Agent: {action_text}")
+        history_lines.append(f"Env: {next_obs.last_tool_output}")
+        obs = next_obs
+        if done:
+            break
+    return Trajectory(states, actions, rewards, logprobs, dones)
+def collect_trajectories(
+    env: CodeReviewEnv,
+    model,
+    tokenizer,
+    n_trajectories: int,
+    max_steps: int = 10,
+    task_levels: Optional[List[str]] = None,
+    task_weights: Optional[List[float]] = None,
+) -> List[Trajectory]:
+    # Link training to RedTeam's full bug distribution by sampling tasks
+    # per trajectory instead of training only on env default ("easy").
+    if task_levels is None:
+        task_levels = list(BUG_DB.keys())
+    if task_weights is not None and len(task_weights) != len(task_levels):
+        raise ValueError("task_weights must match task_levels length")
+    if task_weights is not None and sum(task_weights) <= 0:
+        raise ValueError("task_weights must have a positive total")
+    trajectories = []
+    for i in range(n_trajectories):
+        # Weighted sampling supports curriculum-style training schedules.
+        sampled_task = random.choices(task_levels, weights=task_weights, k=1)[0]
+        env.set_task(sampled_task)
+        traj = collect_trajectory(env, model, tokenizer, max_steps)
+        total_reward = sum(traj.rewards)
+        print(f"Trajectory {i+1}/{n_trajectories}: "
+              f"task={sampled_task}, steps={len(traj)}, reward={total_reward:.3f}")
+        trajectories.append(traj)
+    return trajectories
+# ======================================================================
+# 9. ADVANTAGE ESTIMATION (unchanged)
+# ======================================================================
+def compute_returns_and_advantages(
+    rewards: List[float],
+    dones: List[bool],
+    gamma: float = 0.99,
+    standardize: bool = True
+) -> Tuple[List[float], List[float]]:
+    """
+    Computes discounted returns and normalised advantages (no critic).
+    Advantages = returns - mean(returns)  (or zero baseline).
+    """
+    n = len(rewards)
+    returns = [0.0] * n
+    running_return = 0.0
+    for t in reversed(range(n)):
+        if dones[t]:
+            running_return = 0.0
+        running_return = rewards[t] + gamma * running_return
+        returns[t] = running_return
+    if standardize:
+        advantages = np.array(returns) - np.mean(returns)
+        adv_std = np.std(advantages) + 1e-8
+        advantages = (advantages / adv_std).tolist()
+    else:
+        advantages = returns.copy()
+    return advantages, returns
+# ======================================================================
+# 10. COMPUTE NEW LOGPROBS (unchanged)
+# ======================================================================
+def compute_logprob(prompt: str, action: str, model, tokenizer) -> float:
+    messages = [{"role": "user", "content": prompt}]
+    formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    full_text = formatted + action
+    inputs = tokenizer(full_text, return_tensors="pt").to("cuda")
+    with torch.no_grad():
+        outputs = model(**inputs)
+        logits = outputs.logits
+    action_ids = tokenizer.encode(action, add_special_tokens=False)
+    prefix_ids = tokenizer.encode(formatted, add_special_tokens=False)
+    action_start = len(prefix_ids)
+    logprobs = []
+    for idx, token_id in enumerate(action_ids):
+        position = action_start + idx - 1
+        if 0 <= position < logits.shape[1]:
+            token_logits = logits[0, position]
+            token_logprob = F.log_softmax(token_logits, dim=-1)[token_id].item()
+            logprobs.append(token_logprob)
+    return sum(logprobs) if logprobs else -100.0
+# ======================================================================
+# 11. PPO UPDATE (unchanged except uses compute_logprob correctly)
+# ======================================================================
+def ppo_update(
+    trajectories: List[Trajectory],
+    model,
+    tokenizer,
+    optimizer,
+    n_epochs: int = 4,
+    clip_epsilon: float = 0.2,
+    entropy_coef: float = 0.01,
+    gamma: float = 0.99,
+) -> Dict[str, float]:
+    model.train()
+    all_states = []
+    all_actions = []
+    all_old_logprobs = []
+    all_advantages = []
+    all_returns = []
+    for traj in trajectories:
+        advantages, returns = compute_returns_and_advantages(
+            traj.rewards, traj.dones, gamma=gamma, standardize=True
+        )
+        all_states.extend(traj.states)
+        all_actions.extend(traj.actions)
+        all_old_logprobs.extend(traj.logprobs)
+        all_advantages.extend(advantages)
+        all_returns.extend(returns)
+    n_samples = len(all_states)
+    total_loss = 0.0
+    total_policy_loss = 0.0
+    total_entropy = 0.0
+    n_updates = 0
+    for epoch in range(n_epochs):
+        indices = np.random.permutation(n_samples)
+        for i in indices:
+            state = all_states[i]
+            action = all_actions[i]
+            old_logprob = all_old_logprobs[i]
+            advantage = all_advantages[i]
+            # Use the same chat template for PPO update
+            messages = [{"role": "user", "content": state}]
+            formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+            full_text = formatted + action
+            inputs = tokenizer(full_text, return_tensors="pt").to("cuda")
+            outputs = model(**inputs)
+            logits = outputs.logits
+            action_ids = tokenizer.encode(action, add_special_tokens=False)
+            prefix_ids = tokenizer.encode(formatted, add_special_tokens=False)
+            action_start = len(prefix_ids)
+            logprobs = []
+            entropy = 0.0
+            for idx, token_id in enumerate(action_ids):
+                position = action_start + idx - 1
+                if 0 <= position < logits.shape[1]:
+                    token_logits = logits[0, position]
+                    log_probs = F.log_softmax(token_logits, dim=-1)
+                    token_logprob = log_probs[token_id]
+                    logprobs.append(token_logprob)
+                    probs = F.softmax(token_logits, dim=-1)
+                    entropy += -(probs * log_probs).sum()
+            if not logprobs:
+                continue
+            new_logprob = sum(logprobs)
+            avg_entropy = entropy / len(logprobs) if logprobs else 0.0
+            ratio = torch.exp(new_logprob - old_logprob)
+            surr1 = ratio * advantage
+            surr2 = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantage
+            policy_loss = -torch.min(surr1, surr2)
+            loss = policy_loss - entropy_coef * avg_entropy
+            optimizer.zero_grad()
+            loss.backward()
+            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            optimizer.step()
+            total_loss += loss.item()
+            total_policy_loss += policy_loss.item()
+            total_entropy += avg_entropy.item()
+            n_updates += 1
+    return {
+        "loss": total_loss / n_updates if n_updates > 0 else 0.0,
+        "policy_loss": total_policy_loss / n_updates if n_updates > 0 else 0.0,
+        "entropy": total_entropy / n_updates if n_updates > 0 else 0.0,
+    }
+# ======================================================================
+# 12. EVALUATION (unchanged)
+# ======================================================================
+def evaluate_policy(
+    env: CodeReviewEnv,
+    model,
+    tokenizer,
+    n_episodes: int = 10,
+    max_steps: int = 10
+) -> Dict[str, float]:
+    model.eval()
+    total_rewards = []
+    episode_lengths = []
+    success_count = 0
+    for _ in range(n_episodes):
+        traj = collect_trajectory(env, model, tokenizer, max_steps, temperature=0.0)
+        total_reward = sum(traj.rewards)
+        total_rewards.append(total_reward)
+        episode_lengths.append(len(traj))
+        if total_reward > 0.5:
+            success_count += 1
+    return {
+        "avg_reward": np.mean(total_rewards),
+        "std_reward": np.std(total_rewards),
+        "avg_length": np.mean(episode_lengths),
+        "success_rate": success_count / n_episodes,
+    }
+# ======================================================================
+# 13. MAIN TRAINING LOOP (added sanity check and warm-up)
+# ======================================================================
+def train_ppo(
+    n_iterations: int = 50,
+    trajectories_per_iter: int = 10,
+    n_epochs: int = 4,
+    max_steps: int = 10,
+    learning_rate: float = 3e-5,
+    clip_epsilon: float = 0.2,
+    entropy_coef: float = 0.01,
+    gamma: float = 0.99,
+    eval_every: int = 5,
+    task_levels: Optional[List[str]] = None,
+    curriculum_weighted_sampling: bool = True,
+    reward_profile: str = "full",
+):
+    print("Loading model...")
+    model, tokenizer = load_model()
+    # NEW: Sanity check before any training
+    if not test_model_sanity(model, tokenizer):
+        print("\n❌ Model sanity check failed – cannot proceed.")
+        return
+    # NEW: Supervised warm-up to teach JSON format
+    supervised_warmup(model, tokenizer, n_examples=500, epochs=2)
+    optimizer = AdamW(model.parameters(), lr=learning_rate)
+    env = CodeReviewEnv(reward_profile=reward_profile)
+    if task_levels is None:
+        task_levels = list(BUG_DB.keys())
+    print(f"\n{'='*60}")
+    print(f"Starting PPO Training")
+    print(f"Iterations: {n_iterations}")
+    print(f"Trajectories per iteration: {trajectories_per_iter}")
+    print(f"PPO epochs: {n_epochs}")
+    print(f"Reward profile: {reward_profile}")
+    print(f"{'='*60}\n")
+    reward_history: List[float] = []
+    loss_history: List[float] = []
+    for iteration in range(n_iterations):
+        print(f"\n--- Iteration {iteration + 1}/{n_iterations} ---")
+        # Optional weighted curriculum:
+        # start with easier tasks and smoothly ramp difficulty over training.
+        if curriculum_weighted_sampling:
+            progress = (iteration + 1) / max(n_iterations, 1)
+            easy_w = max(0.15, 0.55 - 0.40 * progress)
+            medium_w = max(0.15, 0.25 - 0.10 * progress)
+            hard_w = 0.10 + 0.05 * progress
+            harder_w = 0.05 + 0.20 * progress
+            hardest_w = 0.05 + 0.25 * progress
+            task_weight_map = {
+                "easy": easy_w,
+                "medium": medium_w,
+                "hard": hard_w,
+                "harder": harder_w,
+                "hardest": hardest_w,
+            }
+            task_weights = [task_weight_map.get(level, 1.0) for level in task_levels]
+        else:
+            task_weights = None
+        print("Collecting trajectories...")
+        trajectories = collect_trajectories(
+            env,
+            model,
+            tokenizer,
+            trajectories_per_iter,
+            max_steps,
+            task_levels=task_levels,
+            task_weights=task_weights,
+        )
+        avg_reward = np.mean([sum(t.rewards) for t in trajectories])
+        avg_length = np.mean([len(t) for t in trajectories])
+        reward_history.append(float(avg_reward))
+        print(f"Avg reward: {avg_reward:.3f}")
+        print(f"Avg length: {avg_length:.1f}")
+        print("Updating policy...")
+        metrics = ppo_update(
+            trajectories,
+            model,
+            tokenizer,
+            optimizer,
+            n_epochs=n_epochs,
+            clip_epsilon=clip_epsilon,
+            entropy_coef=entropy_coef,
+            gamma=gamma,
+        )
+        print(f"Loss: {metrics['loss']:.4f}")
+        print(f"Policy loss: {metrics['policy_loss']:.4f}")
+        print(f"Entropy: {metrics['entropy']:.4f}")
+        loss_history.append(float(metrics["loss"]))
+        if (iteration + 1) % eval_every == 0:
+            print("\nEvaluating policy...")
+            eval_metrics = evaluate_policy(env, model, tokenizer, n_episodes=10)
+            print(f"Eval avg reward: {eval_metrics['avg_reward']:.3f} ± {eval_metrics['std_reward']:.3f}")
+            print(f"Eval success rate: {eval_metrics['success_rate']:.2%}")
+            print(f"Eval avg length: {eval_metrics['avg_length']:.1f}")
+    print("\n" + "="*60)
+    print("Training complete. Saving model...")
+    model.save_pretrained("ppo_final_model")
+    tokenizer.save_pretrained("ppo_final_model")
+    print("Model saved to ppo_final_model/")
+    # Save training curves for quick before/after comparisons.
+    # These are intentionally simple line plots to avoid extra dependencies.
+    if reward_history:
+        plt.figure(figsize=(8, 4))
+        plt.plot(range(1, len(reward_history) + 1), reward_history, marker="o")
+        plt.title("Average Reward per Iteration")
+        plt.xlabel("Iteration")
+        plt.ylabel("Average Reward")
+        plt.grid(alpha=0.3)
+        plt.tight_layout()
+        plt.savefig("reward_curve.png", dpi=150)
+        plt.close()
+    if loss_history:
+        plt.figure(figsize=(8, 4))
+        plt.plot(range(1, len(loss_history) + 1), loss_history, marker="o", color="tab:red")
+        plt.title("Training Loss per Iteration")
+        plt.xlabel("Iteration")
+        plt.ylabel("Loss")
+        plt.grid(alpha=0.3)
+        plt.tight_layout()
+        plt.savefig("loss_curve.png", dpi=150)
+        plt.close()
+    if os.path.exists("reward_curve.png") and os.path.exists("loss_curve.png"):
+        print("Saved reward_curve.png and loss_curve.png")
+    print("="*60)
+# ======================================================================
+# 14. ENTRY POINT (unchanged)
+# ======================================================================
+if __name__ == "__main__":
+    train_ppo(
+        n_iterations=50,
+        trajectories_per_iter=10,
+        n_epochs=4,
+        max_steps=10,
+        learning_rate=3e-5,
+        clip_epsilon=0.2,
+        entropy_coef=0.01,
+        gamma=0.99,
+        eval_every=5,
+    )