h1manshu commited on
Commit
0fb8bd2
·
verified ·
1 Parent(s): a158b53

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +168 -149
  2. client.py +7 -9
  3. dataset/dataset.json +6 -9
  4. inference.py +16 -8
  5. models.py +7 -3
  6. server/code_review_environment.py +64 -30
README.md CHANGED
@@ -13,43 +13,180 @@ tags:
13
 
14
  # Code Review Environment
15
 
16
- A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  ## Quick Start
 
 
 
 
 
 
19
 
20
- The simplest way to use the Code Review environment is through the `CodeReviewEnv` class:
21
-
22
- ```python
23
- from code_review import CodeReviewAction, CodeReviewEnv
24
-
25
- try:
26
- # Create environment from Docker image
27
- code_reviewenv = CodeReviewEnv.from_docker_image("code_review-env:latest")
28
-
29
- # Reset
30
- result = code_reviewenv.reset()
31
- print(f"Reset: {result.observation.echoed_message}")
32
-
33
- # Send multiple messages
34
- messages = ["Hello, World!", "Testing echo", "Final message"]
35
-
36
- for msg in messages:
37
- result = code_reviewenv.step(CodeReviewAction(message=msg))
38
- print(f"Sent: '{msg}'")
39
- print(f" → Echoed: '{result.observation.echoed_message}'")
40
- print(f" → Length: {result.observation.message_length}")
41
- print(f" → Reward: {result.reward}")
42
 
43
- finally:
44
- # Always clean up
45
- code_reviewenv.close()
 
 
 
 
 
 
46
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- That's it! The `CodeReviewEnv.from_docker_image()` method handles:
49
- - Starting the Docker container
50
- - Waiting for the server to be ready
51
- - Connecting to the environment
52
- - Container cleanup when you call `close()`
53
 
54
  ## Building the Docker Image
55
 
@@ -116,124 +253,6 @@ The deployed space includes:
116
  - **Health Check** at `/health` - Container health monitoring
117
  - **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
118
 
119
- ## Environment Details
120
-
121
- ### Action
122
- **CodeReviewAction**: Contains a single field
123
- - `message` (str) - The message to echo back
124
-
125
- ### Observation
126
- **CodeReviewObservation**: Contains the echo response and metadata
127
- - `echoed_message` (str) - The message echoed back
128
- - `message_length` (int) - Length of the message
129
- - `reward` (float) - Reward based on message length (length × 0.1)
130
- - `done` (bool) - Always False for echo environment
131
- - `metadata` (dict) - Additional info like step count
132
-
133
- ### Reward
134
- The reward is calculated as: `message_length × 0.1`
135
- - "Hi" → reward: 0.2
136
- - "Hello, World!" → reward: 1.3
137
- - Empty message → reward: 0.0
138
-
139
- ## Advanced Usage
140
-
141
- ### Connecting to an Existing Server
142
-
143
- If you already have a Code Review environment server running, you can connect directly:
144
-
145
- ```python
146
- from code_review import CodeReviewEnv
147
-
148
- # Connect to existing server
149
- code_reviewenv = CodeReviewEnv(base_url="<ENV_HTTP_URL_HERE>")
150
-
151
- # Use as normal
152
- result = code_reviewenv.reset()
153
- result = code_reviewenv.step(CodeReviewAction(message="Hello!"))
154
- ```
155
-
156
- Note: When connecting to an existing server, `code_reviewenv.close()` will NOT stop the server.
157
-
158
- ### Using the Context Manager
159
-
160
- The client supports context manager usage for automatic connection management:
161
-
162
- ```python
163
- from code_review import CodeReviewAction, CodeReviewEnv
164
-
165
- # Connect with context manager (auto-connects and closes)
166
- with CodeReviewEnv(base_url="http://localhost:8000") as env:
167
- result = env.reset()
168
- print(f"Reset: {result.observation.echoed_message}")
169
- # Multiple steps with low latency
170
- for msg in ["Hello", "World", "!"]:
171
- result = env.step(CodeReviewAction(message=msg))
172
- print(f"Echoed: {result.observation.echoed_message}")
173
- ```
174
-
175
- The client uses WebSocket connections for:
176
- - **Lower latency**: No HTTP connection overhead per request
177
- - **Persistent session**: Server maintains your environment state
178
- - **Efficient for episodes**: Better for many sequential steps
179
-
180
- ### Concurrent WebSocket Sessions
181
-
182
- The server supports multiple concurrent WebSocket connections. To enable this,
183
- modify `server/app.py` to use factory mode:
184
-
185
- ```python
186
- # In server/app.py - use factory mode for concurrent sessions
187
- app = create_app(
188
- CodeReviewEnvironment, # Pass class, not instance
189
- CodeReviewAction,
190
- CodeReviewObservation,
191
- max_concurrent_envs=4, # Allow 4 concurrent sessions
192
- )
193
- ```
194
-
195
- Then multiple clients can connect simultaneously:
196
-
197
- ```python
198
- from code_review import CodeReviewAction, CodeReviewEnv
199
- from concurrent.futures import ThreadPoolExecutor
200
-
201
- def run_episode(client_id: int):
202
- with CodeReviewEnv(base_url="http://localhost:8000") as env:
203
- result = env.reset()
204
- for i in range(10):
205
- result = env.step(CodeReviewAction(message=f"Client {client_id}, step {i}"))
206
- return client_id, result.observation.message_length
207
-
208
- # Run 4 episodes concurrently
209
- with ThreadPoolExecutor(max_workers=4) as executor:
210
- results = list(executor.map(run_episode, range(4)))
211
- ```
212
-
213
- ## Development & Testing
214
-
215
- ### Direct Environment Testing
216
-
217
- Test the environment logic directly without starting the HTTP server:
218
-
219
- ```bash
220
- # From the server directory
221
- python3 server/code_review_environment.py
222
- ```
223
-
224
- This verifies that:
225
- - Environment resets correctly
226
- - Step executes actions properly
227
- - State tracking works
228
- - Rewards are calculated correctly
229
-
230
- ### Running Locally
231
-
232
- Run the server locally for development:
233
-
234
- ```bash
235
- uvicorn server.app:app --reload
236
- ```
237
 
238
  ## Project Structure
239
 
 
13
 
14
  # Code Review Environment
15
 
16
+ A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks.
17
+
18
+ ## Motivation
19
+
20
+ Code review is a high-stakes, multi-step reasoning task that requires an agent to:
21
+
22
+ - **Detect bugs and security vulnerabilities** from raw code diffs
23
+ - **Generate corrective code** that resolves identified issues
24
+ - **Make a final judgment** (approve/reject) backed by technical reasoning
25
+
26
+ Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop — detection, remediation, and decision-making — in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in software development pipelines.
27
+
28
+ ## Environment Description
29
+
30
+ The agent receives a pull request observation at each step and must respond with a structured JSON action. The episode runs for up to `MAX_STEPS = 3` steps, following a prescribed workflow:
31
+
32
+ | Step | Expected Action | Purpose |
33
+ |------|----------------|---------|
34
+ | 1 | `comment` | Identify all issues in the diff |
35
+ | 2 | `suggest_fix` | Provide corrected code |
36
+ | 3 | `final_decision` | Approve or reject the PR |
37
+
38
+ Each step is independently scored, and the final episode score is the maximum score achieved across all steps.
39
+
40
+ ## Action Space
41
+
42
+ Actions are instances of `CodeReviewAction` and must be returned as JSON with the following fields:
43
+
44
+ ```json
45
+ {
46
+ "action_type": "comment | suggest_fix | final_decision",
47
+ "comment": "Detailed description of identified issues (>30 characters)",
48
+ "suggested_code": "Corrected code snippet, or null if not applicable",
49
+ "decision": "approve | reject | null"
50
+ }
51
+ ```
52
+
53
+ | Field | Type | Required | Description |
54
+ |-------|------|----------|-------------|
55
+ | `action_type` | `str` | Always | One of `comment`, `suggest_fix`, `final_decision` |
56
+ | `comment` | `str` | Recommended | Technical description of issues found |
57
+ | `suggested_code` | `str \| null` | Step 2 | Corrected code replacing the buggy diff |
58
+ | `decision` | `str \| null` | Step 3 | `approve` or `reject`; `null` otherwise |
59
+
60
+ ## Observation Space
61
+
62
+ Each step returns a `CodeReviewObservation` with the following fields:
63
+
64
+ | Field | Type | Description |
65
+ |-------|------|-------------|
66
+ | `pr` | `CodeReviewPullRequest` | The pull request under review |
67
+ | `pr.id` | `str` | Unique PR identifier |
68
+ | `pr.title` | `str` | Short title of the PR |
69
+ | `pr.description` | `str` | Brief description of intent |
70
+ | `pr.language` | `str` | Programming language (e.g. `python`) |
71
+ | `pr.diffs` | `List[CodeDiff]` | List of file diffs |
72
+ | `pr.diffs[].file_name` | `str` | Name of the changed file |
73
+ | `pr.diffs[].diff` | `str` | The actual code change |
74
+ | `previous_comments` | `List[str]` | Comments made in prior steps |
75
+ | `step_count` | `int` | Current step number |
76
+ | `max_steps` | `int` | Maximum steps per episode (default: 3) |
77
+
78
+ ## Scoring
79
+
80
+ Each action is scored across three components:
81
+
82
+ | Component | Weight | Method |
83
+ |-----------|--------|--------|
84
+ | Issue Detection | 40% | Fraction of ground-truth issues mentioned in `comment` |
85
+ | Fix Quality | 30% | Token overlap + sequence similarity between `suggested_code` and ground-truth fix |
86
+ | Decision Accuracy | 30% | Exact match with ground-truth `approve`/`reject`; partial credit (0.2) for wrong decision |
87
+
88
+ **Bonuses and penalties applied per step:**
89
+
90
+ - `+0.1` — comment length > 30 characters (encourages detail)
91
+ - `+0.1` — correct final decision reached in step ≤ 2 (encourages efficiency)
92
+ - `-0.1` — no comment provided on a non-decision step (penalizes lazy steps)
93
+ - `-0.05` — step count exceeds 3 (penalizes long trajectories)
94
+
95
+ The final episode score is the **maximum** `grade_action` score across all steps in the episode. Scores are clamped to `[0.0, 1.0]`.
96
+
97
+ ## Task Descriptions
98
+
99
+ The dataset contains tasks at three difficulty levels:
100
+
101
+ ### Easy
102
+
103
+ Straightforward single-file issues with an obvious fix.
104
+
105
+ | PR | Issue | Expected Decision |
106
+ |----|-------|------------------|
107
+ | Missing import | `datetime` used without import | reject |
108
+
109
+ **What the agent must do:** Detect the missing `from datetime import datetime` statement and supply the corrected import.
110
+
111
+ ---
112
+
113
+ ### Medium
114
+
115
+ Logical or performance issues requiring understanding of Python semantics.
116
+
117
+ | PR | Issue | Expected Decision |
118
+ |----|-------|------------------|
119
+ | Division function | No guard against division by zero | reject |
120
+ | Inefficient loop | `range(len(arr))` pattern; can use `in` operator | approve |
121
+
122
+ **What the agent must do:** For the division task, add a `if b == 0: return None` guard. For the loop task, recognize it as a style/efficiency issue but not a correctness bug — the correct decision is **approve**.
123
+
124
+ ---
125
+
126
+ ### Hard
127
+
128
+ Security vulnerabilities, injection attacks, and cross-file null-handling bugs.
129
+
130
+ | PR | Issue | Expected Decision |
131
+ |----|-------|------------------|
132
+ | Authentication logic | Hardcoded plaintext password `admin123` | reject |
133
+ | SQL query | String concatenation exposes SQL injection | reject |
134
+ | Cross-file null bug | `get_user(None)` called without input validation | reject |
135
+
136
+ **What the agent must do:**
137
+ - For auth: detect the hardcoded secret and propose `bcrypt`-based password comparison.
138
+ - For SQL: detect string concatenation and replace with a parameterized query (`%s` placeholder + `cursor.execute`).
139
+ - For null bug: validate `id is not None` before the `db[id]` lookup, and fix the call site in `controller.py`.
140
 
141
  ## Quick Start
142
+
143
+ Start the environment server, then run inference:
144
+
145
+ ```bash
146
+ # Terminal 1 — download code_review repo
147
+ git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
148
 
149
+ # Terminal 1 install packages
150
+ uv pip install -e .
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
+ # Terminal 1 — Run server locally
153
+ uv run server --host 0.0.0.0 --port 8000
154
+
155
+ # Terminal 2 — run the agent
156
+ uv run python inference.py
157
+ ```
158
+
159
+ The agent runs `NUM_EPISODES = 4` episodes (configurable) with each `MAX_STEPS = 3` and logs each step:
160
+
161
  ```
162
+ [START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
163
+ [STEP] step=1 action=... reward=0.55 done=false error=null
164
+ [STEP] step=2 action=... reward=0.72 done=false error=null
165
+ [STEP] step=3 action=... reward=0.85 done=true error=null
166
+ [END] success=true steps=12 score=0.850 rewards=0.55,0.72,0.85,0.60
167
+ ```
168
+
169
+ ## Configuration
170
+
171
+ Key constants in `inference.py`:
172
+
173
+ | Constant | Default | Description |
174
+ |----------|---------|-------------|
175
+ | `MAX_STEPS` | `3` | Steps per episode |
176
+ | `NUM_EPISODES` | `4` | Number of PRs to review |
177
+ | `TEMPERATURE` | `0.2` | Sampling temperature (lower = more deterministic) |
178
+ | `MAX_TOKENS` | `256` | Max tokens per LLM response |
179
+ | `SUCCESS_SCORE_THRESHOLD` | `0.1` | Minimum normalized score to count as success |
180
+
181
+ ### Score Interpretation
182
+
183
+ | Score Range | Interpretation |
184
+ |-------------|---------------|
185
+ | 0.00 – 0.20 | Failing — agent cannot follow the JSON schema or identify basic issues |
186
+ | 0.20 – 0.50 | Partial — agent detects some issues but misses security vulnerabilities or gives wrong decisions |
187
+ | 0.50 – 0.75 | Competent — agent handles easy and medium tasks; struggles with hard security/null cases |
188
+ | 0.75 – 1.00 | Strong — agent reliably detects all issue types, generates correct fixes, and makes sound decisions |
189
 
 
 
 
 
 
190
 
191
  ## Building the Docker Image
192
 
 
253
  - **Health Check** at `/health` - Container health monitoring
254
  - **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
255
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
256
 
257
  ## Project Structure
258
 
client.py CHANGED
@@ -12,12 +12,15 @@ from openenv.core import EnvClient
12
  from openenv.core.client_types import StepResult
13
  from openenv.core.env_server.types import State
14
 
15
- from .models import CodeReviewAction, CodeReviewObservation, CodeReviewReward , CodeReviewPullRequest
 
 
 
 
 
16
 
17
 
18
- class CodeReviewEnv(
19
- EnvClient[CodeReviewAction, CodeReviewObservation, State]
20
- ):
21
  """
22
  Client for the Code Review Environment.
23
 
@@ -90,19 +93,14 @@ class CodeReviewEnv(
90
  """
91
  # print("Payload ====== ", payload)
92
 
93
-
94
  obs_data = payload.get("observation") or {}
95
 
96
  if "observation" in obs_data: # nested case
97
  obs_data = obs_data["observation"]
98
 
99
-
100
-
101
-
102
  if not obs_data or "pr" not in obs_data:
103
  raise ValueError(f"Invalid observation payload: {payload}")
104
 
105
-
106
  pr_data = obs_data["pr"]
107
 
108
  observation = CodeReviewObservation(
 
12
  from openenv.core.client_types import StepResult
13
  from openenv.core.env_server.types import State
14
 
15
+ from .models import (
16
+ CodeReviewAction,
17
+ CodeReviewObservation,
18
+ CodeReviewReward,
19
+ CodeReviewPullRequest,
20
+ )
21
 
22
 
23
+ class CodeReviewEnv(EnvClient[CodeReviewAction, CodeReviewObservation, State]):
 
 
24
  """
25
  Client for the Code Review Environment.
26
 
 
93
  """
94
  # print("Payload ====== ", payload)
95
 
 
96
  obs_data = payload.get("observation") or {}
97
 
98
  if "observation" in obs_data: # nested case
99
  obs_data = obs_data["observation"]
100
 
 
 
 
101
  if not obs_data or "pr" not in obs_data:
102
  raise ValueError(f"Invalid observation payload: {payload}")
103
 
 
104
  pr_data = obs_data["pr"]
105
 
106
  observation = CodeReviewObservation(
dataset/dataset.json CHANGED
@@ -1,5 +1,4 @@
1
  [
2
-
3
  {
4
  "task_type": "easy",
5
  "pr": {
@@ -17,7 +16,7 @@
17
  "ground_truth": {
18
  "issues": ["missing import datetime"],
19
  "decision": "reject",
20
- "fix": "from datetime import datetime"
21
  }
22
  },
23
  {
@@ -37,7 +36,7 @@
37
  "ground_truth": {
38
  "issues": ["division by zero"],
39
  "decision": "reject",
40
- "fix": "if b == 0: return None"
41
  }
42
  },
43
  {
@@ -57,10 +56,9 @@
57
  "ground_truth": {
58
  "issues": ["inefficient loop"],
59
  "decision": "approve",
60
- "fix": "use 'if target in arr'"
61
  }
62
  },
63
-
64
  {
65
  "task_type": "hard",
66
  "pr": {
@@ -78,7 +76,7 @@
78
  "ground_truth": {
79
  "issues": ["hardcoded password", "security vulnerability"],
80
  "decision": "reject",
81
- "fix": "use hashed password comparison"
82
  }
83
  },
84
  {
@@ -98,7 +96,7 @@
98
  "ground_truth": {
99
  "issues": ["sql injection"],
100
  "decision": "reject",
101
- "fix": "use parameterized queries"
102
  }
103
  },
104
  {
@@ -122,8 +120,7 @@
122
  "ground_truth": {
123
  "issues": ["invalid input", "null handling"],
124
  "decision": "reject",
125
- "fix": "validate id before calling get_user"
126
  }
127
  }
128
-
129
  ]
 
1
  [
 
2
  {
3
  "task_type": "easy",
4
  "pr": {
 
16
  "ground_truth": {
17
  "issues": ["missing import datetime"],
18
  "decision": "reject",
19
+ "fix": "from datetime import datetime\nprint(datetime.now())"
20
  }
21
  },
22
  {
 
36
  "ground_truth": {
37
  "issues": ["division by zero"],
38
  "decision": "reject",
39
+ "fix": "def divide(a, b):\n if b == 0:\n return None\n return a / b"
40
  }
41
  },
42
  {
 
56
  "ground_truth": {
57
  "issues": ["inefficient loop"],
58
  "decision": "approve",
59
+ "fix": "return target in arr"
60
  }
61
  },
 
62
  {
63
  "task_type": "hard",
64
  "pr": {
 
76
  "ground_truth": {
77
  "issues": ["hardcoded password", "security vulnerability"],
78
  "decision": "reject",
79
+ "fix": "import bcrypt\n\ndef login(password, hashed_password):\n return bcrypt.checkpw(password.encode(), hashed_password)"
80
  }
81
  },
82
  {
 
96
  "ground_truth": {
97
  "issues": ["sql injection"],
98
  "decision": "reject",
99
+ "fix": "query = \"SELECT * FROM users WHERE id = %s\"\ncursor.execute(query, (user_id,))"
100
  }
101
  },
102
  {
 
120
  "ground_truth": {
121
  "issues": ["invalid input", "null handling"],
122
  "decision": "reject",
123
+ "fix": "def get_user(id):\n if id is None:\n raise ValueError('id must not be None')\n return db[id]\n\nuser = get_user(user_id)"
124
  }
125
  }
 
126
  ]
inference.py CHANGED
@@ -26,15 +26,15 @@ import asyncio
26
  from code_review import CodeReviewAction, CodeReviewObservation
27
  from code_review.client import CodeReviewEnv
28
 
29
- API_BASE_URL = "https://router.huggingface.co/v1"
30
  API_KEY = os.getenv("HF_TOKEN")
31
- MODEL_NAME = os.getenv("MODEL_NAME")
32
  TASK_NAME = "code_review"
33
  BENCHMARK = "code_review_benchmark"
34
  MAX_STEPS = 3
35
  TEMPERATURE = 0.2
36
- MAX_TOKENS = 150
37
- NUM_EPISODES = 6
38
  _MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
39
  MAX_TOTAL_REWARD = NUM_EPISODES * MAX_STEPS * _MAX_REWARD_PER_STEP
40
  SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
@@ -68,6 +68,8 @@ Rules:
68
  - Mention every issue explicitly
69
  - Use precise technical language
70
  - Write detailed comments (>30 characters)
 
 
71
 
72
  Return ONLY JSON:
73
 
@@ -193,7 +195,7 @@ def parse_action(text: str) -> Dict[str, Any]:
193
  text = text.strip().replace("```json", "").replace("```", "")
194
 
195
  try:
196
- return json.loads(text)
197
  except Exception as e:
198
  print(e)
199
  return fallback_action()
@@ -235,7 +237,9 @@ async def run_episode(client, env):
235
  reward = result.reward
236
  done = result.done
237
 
238
- log_step(step=step, action=response_text, reward=reward.score, done=done, error=None)
 
 
239
  final_score = max(final_score, reward.score if reward else 0.0)
240
 
241
  return final_score
@@ -249,7 +253,6 @@ async def main():
249
 
250
  async with CodeReviewEnv(base_url="http://localhost:8000") as env:
251
  for i in range(NUM_EPISODES):
252
- print(f"\n===== Episode {i+1} =====", flush=True)
253
  env.task_index = i
254
 
255
  score = await run_episode(client, env)
@@ -260,7 +263,12 @@ async def main():
260
  total_score = sum(scores) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
261
  final_score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
262
  success = final_score >= SUCCESS_SCORE_THRESHOLD
263
- log_end(success=success, steps=NUM_EPISODES*MAX_STEPS, score=final_score, rewards=scores)
 
 
 
 
 
264
 
265
 
266
  if __name__ == "__main__":
 
26
  from code_review import CodeReviewAction, CodeReviewObservation
27
  from code_review.client import CodeReviewEnv
28
 
29
+ API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
30
  API_KEY = os.getenv("HF_TOKEN")
31
+ MODEL_NAME = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.1-8B-Instruct"
32
  TASK_NAME = "code_review"
33
  BENCHMARK = "code_review_benchmark"
34
  MAX_STEPS = 3
35
  TEMPERATURE = 0.2
36
+ MAX_TOKENS = 256
37
+ NUM_EPISODES = 4
38
  _MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
39
  MAX_TOTAL_REWARD = NUM_EPISODES * MAX_STEPS * _MAX_REWARD_PER_STEP
40
  SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
 
68
  - Mention every issue explicitly
69
  - Use precise technical language
70
  - Write detailed comments (>30 characters)
71
+ - All string values in the JSON must use \\n for newlines, never literal line breaks
72
+ - Return ONLY raw JSON — no markdown fences, no preamble
73
 
74
  Return ONLY JSON:
75
 
 
195
  text = text.strip().replace("```json", "").replace("```", "")
196
 
197
  try:
198
+ return json.loads(text, strict=False)
199
  except Exception as e:
200
  print(e)
201
  return fallback_action()
 
237
  reward = result.reward
238
  done = result.done
239
 
240
+ log_step(
241
+ step=step, action=response_text, reward=reward.score, done=done, error=None
242
+ )
243
  final_score = max(final_score, reward.score if reward else 0.0)
244
 
245
  return final_score
 
253
 
254
  async with CodeReviewEnv(base_url="http://localhost:8000") as env:
255
  for i in range(NUM_EPISODES):
 
256
  env.task_index = i
257
 
258
  score = await run_episode(client, env)
 
263
  total_score = sum(scores) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
264
  final_score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
265
  success = final_score >= SUCCESS_SCORE_THRESHOLD
266
+ log_end(
267
+ success=success,
268
+ steps=NUM_EPISODES * MAX_STEPS,
269
+ score=final_score,
270
+ rewards=scores,
271
+ )
272
 
273
 
274
  if __name__ == "__main__":
models.py CHANGED
@@ -11,8 +11,9 @@ The code_review environment is a simple test environment that echoes back messag
11
  """
12
 
13
  from openenv.core.env_server.types import Action, Observation
14
- from pydantic import Field, BaseModel
15
- from typing import Optional, List , Any , Dict
 
16
 
17
  class CodeReviewAction(Action):
18
  """Action for the Code Review environment - just a message to echo."""
@@ -23,6 +24,7 @@ class CodeReviewAction(Action):
23
  suggested_code: Optional[str] = None
24
  decision: Optional[str] = None
25
 
 
26
  class CodeDiff(BaseModel):
27
  file_name: str
28
  diff: str
@@ -35,10 +37,11 @@ class CodeReviewPullRequest(BaseModel):
35
  diffs: List[CodeDiff]
36
  language: str
37
 
 
38
  class CodeReviewObservation(Observation):
39
  """Observation from the Code Review environment - the echoed message."""
40
 
41
- #echoed_message: str = Field(default="", description="The echoed message")
42
  pr: CodeReviewPullRequest
43
  previous_comments: List[str]
44
  step_count: int
@@ -49,6 +52,7 @@ class CodeReviewReward(BaseModel):
49
  score: float
50
  feedback: str
51
 
 
52
  class CodeReviewStepResponse(BaseModel):
53
  observation: CodeReviewObservation
54
  reward: CodeReviewReward
 
11
  """
12
 
13
  from openenv.core.env_server.types import Action, Observation
14
+ from pydantic import Field, BaseModel
15
+ from typing import Optional, List, Any, Dict
16
+
17
 
18
  class CodeReviewAction(Action):
19
  """Action for the Code Review environment - just a message to echo."""
 
24
  suggested_code: Optional[str] = None
25
  decision: Optional[str] = None
26
 
27
+
28
  class CodeDiff(BaseModel):
29
  file_name: str
30
  diff: str
 
37
  diffs: List[CodeDiff]
38
  language: str
39
 
40
+
41
  class CodeReviewObservation(Observation):
42
  """Observation from the Code Review environment - the echoed message."""
43
 
44
+ # echoed_message: str = Field(default="", description="The echoed message")
45
  pr: CodeReviewPullRequest
46
  previous_comments: List[str]
47
  step_count: int
 
52
  score: float
53
  feedback: str
54
 
55
+
56
  class CodeReviewStepResponse(BaseModel):
57
  observation: CodeReviewObservation
58
  reward: CodeReviewReward
server/code_review_environment.py CHANGED
@@ -22,7 +22,7 @@ try:
22
  CodeReviewObservation,
23
  CodeReviewReward,
24
  CodeReviewPullRequest,
25
- CodeReviewStepResponse
26
  )
27
  except ImportError:
28
  from models import (
@@ -34,9 +34,33 @@ except ImportError:
34
 
35
  import json
36
  from pathlib import Path
 
 
37
 
38
  dataset_path = Path(__file__).parent.parent / "dataset" / "dataset.json"
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  class CodeReviewEnvironment(Environment):
41
  """
42
  A simple echo environment that echoes back messages.
@@ -96,7 +120,7 @@ class CodeReviewEnvironment(Environment):
96
  self.fix_attempted = False
97
 
98
  return CodeReviewObservation(
99
- #echoed_message="Code Review environment ready!",
100
  pr=self.pr,
101
  previous_comments=self.history,
102
  step_count=self.step_count,
@@ -150,7 +174,7 @@ class CodeReviewEnvironment(Environment):
150
  self.fix_attempted = True
151
 
152
  score = self.grade_action(action, self.gt)
153
- print(f"Step {self.step_count} - Score: {score:.4f}")
154
 
155
  bonus = 0.0
156
 
@@ -184,18 +208,15 @@ class CodeReviewEnvironment(Environment):
184
  # print(type(CodeReviewObservation))
185
  # print(type(CodeReviewReward))
186
 
187
- obs = CodeReviewObservation(
188
- pr=self.pr,
189
- previous_comments=[a.comment for a in self.history if a.comment],
190
- step_count=self.step_count,
191
- max_steps=self.max_steps,
192
- )
193
  # print("Obs == " , obs)
194
 
195
- rew = CodeReviewReward(
196
- score=score,
197
- feedback="graded"
198
- )
199
 
200
  # print("FINAL REWARD TYPE:", type(rew))
201
  # print("FINAL REWARD:", rew)
@@ -223,21 +244,20 @@ class CodeReviewEnvironment(Environment):
223
  return self._state
224
 
225
  def _invalid_step(self):
226
- rew = CodeReviewReward(score=0.0, feedback="invalid action")
227
- obs = CodeReviewObservation(
228
- echoed_message="Invalid action format. Please send a valid CodeReviewAction.",
229
- pr=self.pr,
230
- previous_comments=[a.comment for a in self.history if a.comment],
231
- step_count=self.step_count,
232
- max_steps=self.max_steps,
233
- )
234
  return CodeReviewStepResponse(
235
  observation=obs,
236
  reward=rew,
237
  done=True,
238
  info={"error": "invalid_action"},
239
  )
240
-
241
 
242
  def grade_action(self, action, ground_truth):
243
  score = 0.0
@@ -295,25 +315,40 @@ class CodeReviewEnvironment(Environment):
295
  # ==============================
296
  # FIX MATCH (FUZZY)
297
  # ==============================
298
- def score_fix(self, suggested_code, ground_truth):
299
  if not suggested_code:
300
  return 0.0
301
 
302
  expected_fix = self.normalize(ground_truth.get("fix", ""))
303
  suggested_code = self.normalize(suggested_code)
304
 
305
- # direct match
 
 
 
306
  if expected_fix in suggested_code:
307
  return 1.0
308
 
309
- # partial keyword match
310
- keywords = expected_fix.split()
311
- if not keywords:
 
 
 
 
 
 
312
  return 0.0
313
 
314
- matches = sum(1 for word in keywords if word in suggested_code)
 
 
 
 
 
315
 
316
- return matches / len(keywords)
 
317
 
318
  # ==============================
319
  # DECISION MATCH
@@ -335,4 +370,3 @@ class CodeReviewEnvironment(Environment):
335
 
336
  # Wrong decision → partial penalty (not negative)
337
  return 0.2
338
-
 
22
  CodeReviewObservation,
23
  CodeReviewReward,
24
  CodeReviewPullRequest,
25
+ CodeReviewStepResponse,
26
  )
27
  except ImportError:
28
  from models import (
 
34
 
35
  import json
36
  from pathlib import Path
37
+ import re
38
+ from difflib import SequenceMatcher
39
 
40
  dataset_path = Path(__file__).parent.parent / "dataset" / "dataset.json"
41
 
42
+ STOP_WORDS = {
43
+ "use",
44
+ "the",
45
+ "a",
46
+ "an",
47
+ "to",
48
+ "and",
49
+ "or",
50
+ "of",
51
+ "in",
52
+ "for",
53
+ "with",
54
+ "is",
55
+ "it",
56
+ "on",
57
+ "at",
58
+ "by",
59
+ "from",
60
+ "that",
61
+ }
62
+
63
+
64
  class CodeReviewEnvironment(Environment):
65
  """
66
  A simple echo environment that echoes back messages.
 
120
  self.fix_attempted = False
121
 
122
  return CodeReviewObservation(
123
+ # echoed_message="Code Review environment ready!",
124
  pr=self.pr,
125
  previous_comments=self.history,
126
  step_count=self.step_count,
 
174
  self.fix_attempted = True
175
 
176
  score = self.grade_action(action, self.gt)
177
+ # print(f"Step {self.step_count} - Score: {score:.4f}")
178
 
179
  bonus = 0.0
180
 
 
208
  # print(type(CodeReviewObservation))
209
  # print(type(CodeReviewReward))
210
 
211
+ obs = CodeReviewObservation(
212
+ pr=self.pr,
213
+ previous_comments=[a.comment for a in self.history if a.comment],
214
+ step_count=self.step_count,
215
+ max_steps=self.max_steps,
216
+ )
217
  # print("Obs == " , obs)
218
 
219
+ rew = CodeReviewReward(score=score, feedback="graded")
 
 
 
220
 
221
  # print("FINAL REWARD TYPE:", type(rew))
222
  # print("FINAL REWARD:", rew)
 
244
  return self._state
245
 
246
  def _invalid_step(self):
247
+ rew = CodeReviewReward(score=0.0, feedback="invalid action")
248
+ obs = CodeReviewObservation(
249
+ echoed_message="Invalid action format. Please send a valid CodeReviewAction.",
250
+ pr=self.pr,
251
+ previous_comments=[a.comment for a in self.history if a.comment],
252
+ step_count=self.step_count,
253
+ max_steps=self.max_steps,
254
+ )
255
  return CodeReviewStepResponse(
256
  observation=obs,
257
  reward=rew,
258
  done=True,
259
  info={"error": "invalid_action"},
260
  )
 
261
 
262
  def grade_action(self, action, ground_truth):
263
  score = 0.0
 
315
  # ==============================
316
  # FIX MATCH (FUZZY)
317
  # ==============================
318
+ def score_fix(self, suggested_code: str, ground_truth: dict) -> float:
319
  if not suggested_code:
320
  return 0.0
321
 
322
  expected_fix = self.normalize(ground_truth.get("fix", ""))
323
  suggested_code = self.normalize(suggested_code)
324
 
325
+ if not expected_fix:
326
+ return 0.0
327
+
328
+ # 1. Exact / substring match — full score
329
  if expected_fix in suggested_code:
330
  return 1.0
331
 
332
+ # 2. Token overlap ignoring stop words
333
+ def code_tokens(text: str) -> list[str]:
334
+ tokens = re.findall(r"[a-zA-Z_]\w*|\d+|[=<>!+\-*/]+", text)
335
+ return [t for t in tokens if t.lower() not in STOP_WORDS]
336
+
337
+ expected_tokens = code_tokens(expected_fix)
338
+ suggested_tokens = set(code_tokens(suggested_code))
339
+
340
+ if not expected_tokens:
341
  return 0.0
342
 
343
+ token_score = sum(1 for t in expected_tokens if t in suggested_tokens) / len(
344
+ expected_tokens
345
+ )
346
+
347
+ # 3. Sequence similarity as a secondary signal
348
+ seq_score = SequenceMatcher(None, expected_fix, suggested_code).ratio()
349
 
350
+ # Weighted: token overlap matters more than character similarity
351
+ return round(0.7 * token_score + 0.3 * seq_score, 4)
352
 
353
  # ==============================
354
  # DECISION MATCH
 
370
 
371
  # Wrong decision → partial penalty (not negative)
372
  return 0.2