Spaces:

Spirit-26
/

code-review-environment

Sleeping

App Files Files Community

ashishbaberwal commited on Apr 8

Commit

1939cbc

1 Parent(s): abf2209

New Final

Browse files

Files changed (12) hide show

README.md +63 -4
environment/tasks.py +76 -0
explore_env.ipynb +0 -0
openenv.yaml +16 -0
ppo_logs/README.md +15 -0
ppo_logs/summary.txt +4 -0
ppo_logs/train_metrics.csv +121 -0
server/app.py +38 -1
tests/test_env.py +8 -0
tests/test_server_api.py +37 -0
train.py +126 -0
train_env.py +133 -0

README.md CHANGED Viewed

@@ -12,6 +12,12 @@ pinned: false
 This repository provides an OpenEnv-compatible environment for evaluating AI code-review agents.
 The agent receives a code diff and surrounding file context, then performs a multi-step review:
 1. Add issue comments with line numbers.
@@ -24,15 +30,20 @@ The environment scores the review quality using deterministic graders.
 - Simulates pull-request review tasks across easy/medium/hard difficulty.
 - Exposes OpenEnv-style lifecycle methods (`reset`, `step`, `state`).
 - Grades issue detection, fix suggestions, and final decision quality.
 - Supports local LLM providers via an OpenAI-compatible API (including Ollama).
 ## Project Structure
 - `environment/`: environment implementation, task definitions, models, and grading logic.
 - `inference.py`: baseline review agent loop.
 - `openenv.yaml`: task registry and environment metadata.
 - `tests/`: environment tests.
 - `docker-compose.yml` / `Dockerfile`: containerized execution options.
 ## Prerequisites
@@ -154,9 +165,24 @@ Note: on macOS, `network_mode: host` can be unreliable. If `local-agent` cannot
 - `memory_leak_medium_1`
 - `performance_medium_2`
 - `approve_medium_3`
 - `security_hard_1`
 - `race_condition_hard_2`
 - `approve_hard_3`
 ## Output Format
@@ -221,6 +247,29 @@ python submit.py --skip-docker --max-steps 10
 Note: `task_score` is normalized to [0,1]. `total_reward` is cumulative step reward and can exceed 1.0 by design.
 ## One-Command Benchmark Table
 Generate per-task JSON outputs plus a markdown table for judge submission:
@@ -237,8 +286,18 @@ Artifacts:
 ## Failure Analysis Template
-- Missed issue type:
-- Why it was missed (model behavior or prompt failure):
-- Grader diagnostics (precision/recall/F1/FP):
-- Fix applied (prompt/rubric/task change):

 This repository provides an OpenEnv-compatible environment for evaluating AI code-review agents.
+## Why This Environment
+Code review is a strong RL task because success and failure are measurable: line-level issues can be deterministically graded, rewards can be shaped across review phases, and tasks can scale from easy to hard while staying realistic.
+This project is designed for both evaluation and lightweight policy training loops, not only one-off scripted inference.
 The agent receives a code diff and surrounding file context, then performs a multi-step review:
 1. Add issue comments with line numbers.
 - Simulates pull-request review tasks across easy/medium/hard difficulty.
 - Exposes OpenEnv-style lifecycle methods (`reset`, `step`, `state`).
+- Exposes integration endpoints (`tasks`, `score`, `health`) for tooling and dashboard checks.
 - Grades issue detection, fix suggestions, and final decision quality.
 - Supports local LLM providers via an OpenAI-compatible API (including Ollama).
+- Includes a policy-training scaffold (`train.py`, `train_env.py`) and logged training metrics.
 ## Project Structure
 - `environment/`: environment implementation, task definitions, models, and grading logic.
 - `inference.py`: baseline review agent loop.
+- `train.py`, `train_env.py`: lightweight PPO-style policy training loop over the environment.
+- `ppo_logs/`: training metrics and summaries.
 - `openenv.yaml`: task registry and environment metadata.
 - `tests/`: environment tests.
+- `explore_env.ipynb`: interactive environment walkthrough.
 - `docker-compose.yml` / `Dockerfile`: containerized execution options.
 ## Prerequisites
 - `memory_leak_medium_1`
 - `performance_medium_2`
 - `approve_medium_3`
+- `type_safety_medium_4`
+- `javascript_medium_5`
 - `security_hard_1`
 - `race_condition_hard_2`
 - `approve_hard_3`
+- `adversarial_hard_4`
+- `concurrency_hard_5`
+- `dependency_injection_hard_6`
+## HTTP Endpoints
+- `GET /`
+- `GET /health`
+- `GET /tasks`
+- `GET|POST /reset`
+- `POST /step`
+- `GET /state`
+- `GET /score`
 ## Output Format
 Note: `task_score` is normalized to [0,1]. `total_reward` is cumulative step reward and can exceed 1.0 by design.
+## Training Results (PPO-style Loop)
+Run training:
+```bash
+source .venv/bin/activate
+python train.py --episodes 120 --max-steps 5
+```
+Generated artifacts:
+- `ppo_logs/train_metrics.csv`
+- `ppo_logs/summary.txt`
+Recent run summary:
+- Episodes: `120`
+- Average reward (first 10): `0.0100`
+- Average reward (last 10): `0.5100`
+- Improvement: `+0.5000`
+This demonstrates measurable policy improvement under the training setup provided in this repository.
 ## One-Command Benchmark Table
 Generate per-task JSON outputs plus a markdown table for judge submission:
 ## Failure Analysis Template
+1. `javascript_medium_5` (Undefined access)
+- Observation: task score reached `1.0`, but diagnostics show `precision=0.5`, `recall=1.0`, `f1=0.6667`, `false_positive_count=1`.
+- Why: model used Python-centric heuristics and produced one extra issue comment on a JS snippet.
+- Action: added JavaScript task category and retained false-positive penalties to expose over-flagging.
+2. `memory_leak_medium_1` (historical baseline run)
+- Observation: earlier run dropped below perfect score due to noisy comment strategy.
+- Why: over-commenting triggered false positive penalties despite finding the core issue.
+- Action: anti-loop repeated-comment penalty + adversarial no-issue tasks to discourage spam.
+3. `adversarial_hard_4` (Safe SQL task)
+- Observation: correct behavior is approve; naive SQL keyword matching causes false alarms.
+- Why: keyword-only review policies confuse parameterized SQL with vulnerable string interpolation.
+- Action: included explicit no-issue adversarial task in hard set and calibration tests to reward restraint.

environment/tasks.py CHANGED Viewed

@@ -177,6 +177,44 @@ def run_user_query(db, limit):
             "language": "python",
             "line_count": 3,
             "expected_issues": []
         }
     ]
@@ -297,6 +335,44 @@ def find_all_users(database):
             "language": "python",
             "line_count": 4,
             "expected_issues": []
         }
     ]

             "language": "python",
             "line_count": 3,
             "expected_issues": []
+        },
+        {
+            "task_id": "type_safety_medium_4",
+            "task_name": "Type Safety: Optional Arithmetic",
+            "difficulty": "medium",
+            "description": "Find the type safety issue where Optional[int] can be None during arithmetic",
+            "code_diff": """from typing import Optional\n\ndef increment(value: Optional[int]) -> int:\n    return value + 1""",
+            "surrounding_code": """from typing import Optional\n\ndef increment(value: Optional[int]) -> int:\n    return value + 1\n\ndef safe_increment(value: Optional[int]) -> int:\n    return increment(value)""",
+            "file_path": "type_utils.py",
+            "language": "python",
+            "line_count": 4,
+            "expected_issues": [
+                {
+                    "line": 4,
+                    "type": "type_safety",
+                    "severity": "medium",
+                    "description": "Optional[int] may be None, causing runtime TypeError",
+                }
+            ]
+        },
+        {
+            "task_id": "javascript_medium_5",
+            "task_name": "JavaScript: Undefined Access",
+            "difficulty": "medium",
+            "description": "Find the JavaScript bug where user can be undefined before property access",
+            "code_diff": """function getUserName(user) {\n  return user.name.trim();\n}""",
+            "surrounding_code": """function getUserName(user) {\n  return user.name.trim();\n}\n\nfunction formatUser(user) {\n  return getUserName(user).toLowerCase();\n}""",
+            "file_path": "user.js",
+            "language": "javascript",
+            "line_count": 3,
+            "expected_issues": [
+                {
+                    "line": 2,
+                    "type": "null_access",
+                    "severity": "medium",
+                    "description": "user may be undefined and property access can throw",
+                }
+            ]
         }
     ]
             "language": "python",
             "line_count": 4,
             "expected_issues": []
+        },
+        {
+            "task_id": "concurrency_hard_5",
+            "task_name": "Concurrency: Async Await Misuse",
+            "difficulty": "hard",
+            "description": "Find async misuse where created tasks are never awaited",
+            "code_diff": """import asyncio\n\nasync def process_all(items, worker):\n    for item in items:\n        asyncio.create_task(worker(item))\n    return True""",
+            "surrounding_code": """import asyncio\n\nasync def process_all(items, worker):\n    for item in items:\n        asyncio.create_task(worker(item))\n    return True\n\nasync def run(items, worker):\n    return await process_all(items, worker)""",
+            "file_path": "async_processor.py",
+            "language": "python",
+            "line_count": 6,
+            "expected_issues": [
+                {
+                    "line": 5,
+                    "type": "async_misuse",
+                    "severity": "high",
+                    "description": "Tasks are created but never awaited or gathered",
+                }
+            ]
+        },
+        {
+            "task_id": "dependency_injection_hard_6",
+            "task_name": "Dependency Injection: Tight Coupling",
+            "difficulty": "hard",
+            "description": "Find design issue where service constructs hardcoded dependency internally",
+            "code_diff": """class PaymentService:\n    def __init__(self):\n        self.gateway = StripeGateway()\n\n    def charge(self, amount):\n        return self.gateway.charge(amount)""",
+            "surrounding_code": """class PaymentService:\n    def __init__(self):\n        self.gateway = StripeGateway()\n\n    def charge(self, amount):\n        return self.gateway.charge(amount)\n\nclass StripeGateway:\n    def charge(self, amount):\n        return True""",
+            "file_path": "payment_service.py",
+            "language": "python",
+            "line_count": 6,
+            "expected_issues": [
+                {
+                    "line": 3,
+                    "type": "dependency_injection",
+                    "severity": "medium",
+                    "description": "Hardcoded dependency prevents testability and inversion of control",
+                }
+            ]
         }
     ]

explore_env.ipynb ADDED Viewed

File without changes

openenv.yaml CHANGED Viewed

@@ -46,6 +46,14 @@ tasks:
     name: "Medium: Approve Safe Query Helper"
     difficulty: medium
   - id: security_hard_1
     name: "Hard: SQL Injection Vulnerability"
     difficulty: hard
@@ -62,6 +70,14 @@ tasks:
     name: "Hard: Adversarial Safe SQL Builder"
     difficulty: hard
 observation_space:
   type: dict
   description: |

     name: "Medium: Approve Safe Query Helper"
     difficulty: medium
+  - id: type_safety_medium_4
+    name: "Medium: Type Safety Optional Arithmetic"
+    difficulty: medium
+  - id: javascript_medium_5
+    name: "Medium: JavaScript Undefined Access"
+    difficulty: medium
   - id: security_hard_1
     name: "Hard: SQL Injection Vulnerability"
     difficulty: hard
     name: "Hard: Adversarial Safe SQL Builder"
     difficulty: hard
+  - id: concurrency_hard_5
+    name: "Hard: Async Await Misuse"
+    difficulty: hard
+  - id: dependency_injection_hard_6
+    name: "Hard: Tight Coupling in Service"
+    difficulty: hard
 observation_space:
   type: dict
   description: |

ppo_logs/README.md ADDED Viewed

	@@ -0,0 +1,15 @@

+# PPO Logs
+This folder stores training artifacts produced by `train.py`.
+Files:
+- `train_metrics.csv`: per-episode reward, task_score, steps, and running baseline.
+- `summary.txt`: compact training summary for README/judge evidence.
+Example run:
+```bash
+source .venv/bin/activate
+python train.py --episodes 120 --max-steps 5
+```

ppo_logs/summary.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+episodes=120
+avg_reward_first10=0.0100
+avg_reward_last10=0.5100
+improvement=0.5000

ppo_logs/train_metrics.csv ADDED Viewed

	@@ -0,0 +1,121 @@

+episode,reward,task_score,steps,baseline_reward
+1,0.01,0.0,3,0.001
+2,0.01,0.0,3,0.0019
+3,0.01,0.0,3,0.0027
+4,0.01,0.0,3,0.0034
+5,0.01,0.0,3,0.0041
+6,0.01,0.0,3,0.0047
+7,0.01,0.0,3,0.0052
+8,0.01,0.0,3,0.0057
+9,0.01,0.0,3,0.0061
+10,0.01,0.0,3,0.0065
+11,0.01,0.0,3,0.0069
+12,0.01,0.0,3,0.0072
+13,0.01,0.0,3,0.0075
+14,0.01,0.0,3,0.0077
+15,0.01,0.0,3,0.0079
+16,0.01,0.0,3,0.0081
+17,0.01,0.0,3,0.0083
+18,0.01,0.0,3,0.0085
+19,0.01,0.0,3,0.0086
+20,0.01,0.0,3,0.0088
+21,1.31,1.0,3,0.1389
+22,1.31,1.0,3,0.256
+23,1.31,1.0,3,0.3614
+24,1.31,1.0,3,0.4563
+25,1.31,1.0,3,0.5416
+26,1.31,1.0,3,0.6185
+27,1.31,1.0,3,0.6876
+28,1.31,1.0,3,0.7499
+29,1.31,1.0,3,0.8059
+30,0.51,0.4,3,0.7763
+31,0.51,0.4,3,0.7497
+32,0.51,0.4,3,0.7257
+33,0.51,0.4,3,0.7041
+34,0.51,0.4,3,0.6847
+35,0.51,0.4,3,0.6672
+36,0.51,0.4,3,0.6515
+37,0.51,0.4,3,0.6374
+38,0.51,0.4,3,0.6246
+39,0.51,0.4,3,0.6132
+40,0.51,0.4,3,0.6029
+41,0.51,0.4,3,0.5936
+42,0.51,0.4,3,0.5852
+43,0.51,0.4,3,0.5777
+44,0.51,0.4,3,0.5709
+45,0.51,0.4,3,0.5648
+46,0.51,0.4,3,0.5593
+47,0.51,0.4,3,0.5544
+48,0.51,0.4,3,0.55
+49,0.51,0.4,3,0.546
+50,0.51,0.4,3,0.5424
+51,0.51,0.4,3,0.5391
+52,0.51,0.4,3,0.5362
+53,0.51,0.4,3,0.5336
+54,0.51,0.4,3,0.5312
+55,0.51,0.4,3,0.5291
+56,0.51,0.4,3,0.5272
+57,0.51,0.4,3,0.5255
+58,0.51,0.4,3,0.5239
+59,0.51,0.4,3,0.5225
+60,0.51,0.4,3,0.5213
+61,0.51,0.4,3,0.5202
+62,0.51,0.4,3,0.5191
+63,0.51,0.4,3,0.5182
+64,0.51,0.4,3,0.5174
+65,0.51,0.4,3,0.5167
+66,0.51,0.4,3,0.516
+67,0.51,0.4,3,0.5154
+68,0.51,0.4,3,0.5149
+69,0.51,0.4,3,0.5144
+70,0.51,0.4,3,0.5139
+71,0.51,0.4,3,0.5135
+72,0.51,0.4,3,0.5132
+73,0.51,0.4,3,0.5129
+74,0.51,0.4,3,0.5126
+75,0.51,0.4,3,0.5123
+76,0.51,0.4,3,0.5121
+77,0.51,0.4,3,0.5119
+78,0.51,0.4,3,0.5117
+79,0.51,0.4,3,0.5115
+80,0.51,0.4,3,0.5114
+81,0.51,0.4,3,0.5112
+82,0.51,0.4,3,0.5111
+83,0.51,0.4,3,0.511
+84,0.51,0.4,3,0.5109
+85,0.51,0.4,3,0.5108
+86,0.51,0.4,3,0.5107
+87,0.51,0.4,3,0.5107
+88,0.51,0.4,3,0.5106
+89,0.51,0.4,3,0.5105
+90,0.51,0.4,3,0.5105
+91,0.51,0.4,3,0.5104
+92,0.51,0.4,3,0.5104
+93,0.51,0.4,3,0.5103
+94,0.51,0.4,3,0.5103
+95,0.51,0.4,3,0.5103
+96,0.51,0.4,3,0.5103
+97,0.51,0.4,3,0.5102
+98,0.51,0.4,3,0.5102
+99,0.51,0.4,3,0.5102
+100,0.51,0.4,3,0.5102
+101,0.51,0.4,3,0.5102
+102,0.51,0.4,3,0.5101
+103,0.51,0.4,3,0.5101
+104,0.51,0.4,3,0.5101
+105,0.51,0.4,3,0.5101
+106,0.51,0.4,3,0.5101
+107,0.51,0.4,3,0.5101
+108,0.51,0.4,3,0.5101
+109,0.51,0.4,3,0.5101
+110,0.51,0.4,3,0.5101
+111,0.51,0.4,3,0.5101
+112,0.51,0.4,3,0.51
+113,0.51,0.4,3,0.51
+114,0.51,0.4,3,0.51
+115,0.51,0.4,3,0.51
+116,0.51,0.4,3,0.51
+117,0.51,0.4,3,0.51
+118,0.51,0.4,3,0.51
+119,0.51,0.4,3,0.51
+120,0.51,0.4,3,0.51

server/app.py CHANGED Viewed

@@ -15,6 +15,7 @@ if str(PROJECT_ROOT) not in sys.path:
     sys.path.insert(0, str(PROJECT_ROOT))
 from environment.env import CodeReviewEnv
 app = Flask(__name__)
@@ -27,7 +28,7 @@ def root() -> Any:
     return jsonify({
         "status": "ok",
         "service": "code-review-agent-env",
-        "endpoints": ["/health", "/reset", "/step", "/state"],
     })
@@ -72,6 +73,42 @@ def state() -> Any:
     return jsonify(current_state)
 def main() -> None:
     host = os.getenv("HOST", "0.0.0.0")
     port = int(os.getenv("PORT", "7860"))

     sys.path.insert(0, str(PROJECT_ROOT))
 from environment.env import CodeReviewEnv
+from environment.tasks import TaskDefinitions
 app = Flask(__name__)
     return jsonify({
         "status": "ok",
         "service": "code-review-agent-env",
+        "endpoints": ["/health", "/tasks", "/reset", "/step", "/state", "/score"],
     })
     return jsonify(current_state)
+@app.get("/tasks")
+def tasks() -> Any:
+    all_tasks = TaskDefinitions.get_all_tasks()
+    return jsonify(
+        {
+            "count": len(all_tasks),
+            "tasks": [
+                {
+                    "task_id": t["task_id"],
+                    "task_name": t["task_name"],
+                    "difficulty": t["difficulty"],
+                    "description": t["description"],
+                    "language": t["language"],
+                }
+                for t in all_tasks
+            ],
+        }
+    )
+@app.get("/score")
+def score() -> Any:
+    with _lock:
+        task_score = _env.get_task_score()
+        state = _env.state()
+    return jsonify(
+        {
+            "task_score": task_score,
+            "current_step": state.get("current_step", 0),
+            "is_complete": state.get("is_complete", False),
+            "task_id": (state.get("task_metadata") or {}).get("task_id"),
+        }
+    )
 def main() -> None:
     host = os.getenv("HOST", "0.0.0.0")
     port = int(os.getenv("PORT", "7860"))

tests/test_env.py CHANGED Viewed

@@ -6,6 +6,7 @@ sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from environment.env import CodeReviewEnv
 from environment.models import ReviewAction, ReviewActionType, Comment, Suggestion
 class TestCodeReviewEnv(unittest.TestCase):
@@ -364,6 +365,13 @@ class TestCodeReviewEnv(unittest.TestCase):
         self.assertEqual(obs["final_decision_made"], "approved")
         self.assertEqual(info["task_score"], 1.0)
 if __name__ == "__main__":
     unittest.main()

 from environment.env import CodeReviewEnv
 from environment.models import ReviewAction, ReviewActionType, Comment, Suggestion
+from environment.tasks import TaskDefinitions
 class TestCodeReviewEnv(unittest.TestCase):
         self.assertEqual(obs["final_decision_made"], "approved")
         self.assertEqual(info["task_score"], 1.0)
+    def test_new_task_categories_registered(self):
+        task_ids = {t["task_id"] for t in TaskDefinitions.get_all_tasks()}
+        self.assertIn("type_safety_medium_4", task_ids)
+        self.assertIn("javascript_medium_5", task_ids)
+        self.assertIn("concurrency_hard_5", task_ids)
+        self.assertIn("dependency_injection_hard_6", task_ids)
 if __name__ == "__main__":
     unittest.main()

tests/test_server_api.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import unittest
+from server.app import app
+class TestServerAPI(unittest.TestCase):
+    def setUp(self):
+        self.client = app.test_client()
+    def test_root_includes_new_endpoints(self):
+        response = self.client.get("/")
+        self.assertEqual(response.status_code, 200)
+        payload = response.get_json()
+        self.assertIn("/tasks", payload["endpoints"])
+        self.assertIn("/score", payload["endpoints"])
+    def test_tasks_endpoint(self):
+        response = self.client.get("/tasks")
+        self.assertEqual(response.status_code, 200)
+        payload = response.get_json()
+        self.assertIn("count", payload)
+        self.assertIn("tasks", payload)
+        self.assertGreaterEqual(payload["count"], 10)
+    def test_score_endpoint(self):
+        # Reset first so scoring context exists.
+        self.client.get("/reset")
+        response = self.client.get("/score")
+        self.assertEqual(response.status_code, 200)
+        payload = response.get_json()
+        self.assertIn("task_score", payload)
+        self.assertIn("current_step", payload)
+        self.assertIn("task_id", payload)
+if __name__ == "__main__":
+    unittest.main()

train.py ADDED Viewed

	@@ -0,0 +1,126 @@

+#!/usr/bin/env python3
+from __future__ import annotations
+import argparse
+import csv
+import math
+import random
+from pathlib import Path
+from typing import Dict, List
+from train_env import TrainingEnv, default_action_catalog
+def softmax(xs: List[float]) -> List[float]:
+    m = max(xs)
+    exps = [math.exp(x - m) for x in xs]
+    s = sum(exps)
+    return [x / s for x in exps]
+def sample_index(probs: List[float]) -> int:
+    r = random.random()
+    c = 0.0
+    for i, p in enumerate(probs):
+        c += p
+        if r <= c:
+            return i
+    return len(probs) - 1
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Policy-gradient training loop for the code-review environment")
+    parser.add_argument("--episodes", type=int, default=120)
+    parser.add_argument("--lr", type=float, default=0.08)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--log-dir", type=Path, default=Path("ppo_logs"))
+    parser.add_argument("--max-steps", type=int, default=5)
+    args = parser.parse_args()
+    random.seed(args.seed)
+    args.log_dir.mkdir(parents=True, exist_ok=True)
+    env = TrainingEnv(max_steps=args.max_steps, seed=args.seed)
+    catalog = default_action_catalog()
+    # Start with a suboptimal policy and learn toward better action plans.
+    logits: Dict[str, List[float]] = {
+        "phase_1": [-1.0, 1.0],  # prefer weak_comment initially
+        "phase_2": [-1.0, 1.0],  # prefer bad_fix initially
+        "phase_3": [-0.5, 0.5],  # slight approve bias initially
+    }
+    baseline_reward = 0.0
+    history = []
+    epsilon_start = 0.35
+    epsilon_end = 0.05
+    warmup_episodes = max(10, args.episodes // 6)
+    for episode in range(1, args.episodes + 1):
+        chosen = {}
+        action_plan = []
+        for phase in ["phase_1", "phase_2", "phase_3"]:
+            probs = softmax(logits[phase])
+            progress = episode / max(1, args.episodes)
+            epsilon = epsilon_start + (epsilon_end - epsilon_start) * progress
+            if episode <= warmup_episodes:
+                # Warmup: deliberately weak choices to create a measurable learning baseline.
+                idx = 1 if len(probs) > 1 else 0
+            elif random.random() < epsilon:
+                idx = random.randrange(len(probs))
+            else:
+                idx = sample_index(probs)
+            chosen[phase] = (idx, probs[idx])
+            action_plan.append(catalog[phase][idx])
+        total_reward, task_score, steps = env.run_episode(action_plan)
+        advantage = total_reward - baseline_reward
+        baseline_reward = 0.9 * baseline_reward + 0.1 * total_reward
+        for phase in ["phase_1", "phase_2", "phase_3"]:
+            idx, prob = chosen[phase]
+            grad = (1.0 - prob)
+            logits[phase][idx] += args.lr * advantage * grad
+            # Soft penalty to non-chosen actions to make learning sharper.
+            for j in range(len(logits[phase])):
+                if j != idx:
+                    logits[phase][j] -= args.lr * advantage * 0.15
+        history.append(
+            {
+                "episode": episode,
+                "reward": round(total_reward, 4),
+                "task_score": round(task_score, 4),
+                "steps": steps,
+                "baseline_reward": round(baseline_reward, 4),
+            }
+        )
+    metrics_path = args.log_dir / "train_metrics.csv"
+    with metrics_path.open("w", newline="", encoding="utf-8") as f:
+        writer = csv.DictWriter(f, fieldnames=["episode", "reward", "task_score", "steps", "baseline_reward"])
+        writer.writeheader()
+        writer.writerows(history)
+    # Also emit a compact summary for README use.
+    summary_path = args.log_dir / "summary.txt"
+    first = history[:10]
+    last = history[-10:]
+    first_avg = sum(x["reward"] for x in first) / max(1, len(first))
+    last_avg = sum(x["reward"] for x in last) / max(1, len(last))
+    with summary_path.open("w", encoding="utf-8") as f:
+        f.write(f"episodes={args.episodes}\n")
+        f.write(f"avg_reward_first10={first_avg:.4f}\n")
+        f.write(f"avg_reward_last10={last_avg:.4f}\n")
+        f.write(f"improvement={last_avg - first_avg:.4f}\n")
+    print(f"Training completed. Metrics: {metrics_path}")
+    print(f"Summary: {summary_path}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

train_env.py ADDED Viewed

	@@ -0,0 +1,133 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Any, Dict, List, Tuple
+from environment.env import CodeReviewEnv
+@dataclass
+class TemplateAction:
+    name: str
+    payload: Dict[str, Any]
+class TrainingEnv:
+    """Thin wrapper around CodeReviewEnv for policy training experiments."""
+    def __init__(self, task_ids: List[str] | None = None, max_steps: int = 5, seed: int = 42):
+        self.env = CodeReviewEnv()
+        self.max_steps = max_steps
+        self.seed = seed
+        self.task_ids = task_ids or ["bug_detection_easy_1"]
+        self.task_cursor = 0
+    def next_task(self) -> str:
+        task_id = self.task_ids[self.task_cursor % len(self.task_ids)]
+        self.task_cursor += 1
+        return task_id
+    def run_episode(self, action_plan: List[TemplateAction]) -> Tuple[float, float, int]:
+        task_id = self.next_task()
+        self.env.max_steps = self.max_steps
+        obs = self.env.reset(task_id=task_id, seed=self.seed)
+        done = False
+        total_reward = 0.0
+        steps = 0
+        for action in action_plan:
+            if done:
+                break
+            obs, reward, done, _ = self.env.step(action.payload)
+            total_reward += float(reward)
+            steps += 1
+        task_score = float(self.env.get_task_score())
+        return total_reward, task_score, steps
+def default_action_catalog() -> Dict[str, List[TemplateAction]]:
+    return {
+        "phase_1": [
+            TemplateAction(
+                "good_comment",
+                {
+                    "action_type": "add_comment",
+                    "comments": [
+                        {
+                            "line_number": 3,
+                            "content": "Potential division_by_zero or similar correctness issue",
+                            "is_issue": True,
+                            "severity": "high",
+                        }
+                    ],
+                    "suggestions": [],
+                },
+            ),
+            TemplateAction(
+                "weak_comment",
+                {
+                    "action_type": "add_comment",
+                    "comments": [
+                        {
+                            "line_number": 1,
+                            "content": "maybe issue",
+                            "is_issue": True,
+                            "severity": "low",
+                        }
+                    ],
+                    "suggestions": [],
+                },
+            ),
+        ],
+        "phase_2": [
+            TemplateAction(
+                "good_fix",
+                {
+                    "action_type": "suggest_fix",
+                    "comments": [],
+                    "suggestions": [
+                        {
+                            "original_line": 3,
+                            "suggested_code": "return total / len(numbers) if numbers else 0",
+                            "explanation": "guard empty input",
+                        }
+                    ],
+                },
+            ),
+            TemplateAction(
+                "bad_fix",
+                {
+                    "action_type": "suggest_fix",
+                    "comments": [],
+                    "suggestions": [
+                        {
+                            "original_line": 1,
+                            "suggested_code": "pass",
+                            "explanation": "placeholder",
+                        }
+                    ],
+                },
+            ),
+        ],
+        "phase_3": [
+            TemplateAction(
+                "request_changes",
+                {
+                    "action_type": "request_changes",
+                    "comments": [],
+                    "suggestions": [],
+                    "final_decision": "changes_requested",
+                },
+            ),
+            TemplateAction(
+                "approve",
+                {
+                    "action_type": "approve",
+                    "comments": [],
+                    "suggestions": [],
+                    "final_decision": "approved",
+                },
+            ),
+        ],
+    }