File size: 10,137 Bytes
7b2a69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
# RULES.md β€” CodeReview OpenEnv Agent Grounding Rules

You are an AI agent operating inside the **CodeReview OpenEnv** environment.
Read every rule below before generating any action. Violating these rules
will cause your score to drop or your episode to terminate with a penalty.

---

## 1. YOUR ONLY JOB

You are reviewing a **Python source file** for real issues.
You are **not** writing code. You are **not** explaining Python concepts.
You are **not** summarising the file. You are finding specific, locatable
problems and describing them precisely.

---

## 2. OUTPUT FORMAT β€” NON-NEGOTIABLE

You must respond with **one JSON object and nothing else**.
No markdown. No backticks. No preamble. No explanation outside the JSON.

```
{
  "comments": [ ...ReviewComment objects... ],
  "summary": "string or null",
  "submit": true or false
}
```

Any response that is not valid JSON will be treated as an empty action
and penalised with βˆ’0.05 reward.

---

## 3. ReviewComment SCHEMA β€” EXACT TYPES REQUIRED

Every object inside `comments` must have exactly these fields:

| Field        | Type            | Allowed values / constraints                              |
|--------------|-----------------|-----------------------------------------------------------|
| `line`       | int or null     | 1-indexed line number from the code. null = file-level    |
| `category`   | string (enum)   | `"bug"` `"security"` `"performance"` `"style"` `"documentation"` |
| `severity`   | string (enum)   | `"low"` `"medium"` `"high"` `"critical"`                  |
| `message`    | string          | 5–500 characters. Must describe the SPECIFIC issue.       |
| `suggestion` | string or null  | Optional fix. Max 500 characters.                         |

Do not add extra fields. Do not omit required fields. Do not use integers
for `category` or `severity`.

---

## 4. CATEGORY SCOPE β€” ONLY FLAG WHAT YOU ARE ASKED TO FLAG

The `instructions` field in the observation tells you which categories
to check. **Do not submit comments for categories outside that scope.**

- Task 1 (Easy):  `bug`, `style` only
- Task 2 (Medium): `security`, `performance` only
- Task 3 (Hard):  all five categories

Submitting comments in the wrong category is treated as a false positive
and incurs a penalty. The grader will ignore them.

---

## 5. LINE NUMBERS β€” BE PRECISE

- Count lines from **1** (the first line of the source is line 1).
- The source shown in the observation has line numbers prefixed β€” use them.
- If you cannot pinpoint a line, use `null` (file-level comment).
- Do not guess or approximate. Off-by-more-than-3 lines reduces your score.

---

## 6. NO FABRICATION

Do not invent issues that are not present in the code.
Every comment you submit must correspond to a real, demonstrable problem
in the snippet as written. Ask yourself:

> "Can I point to the exact line where this fails and show the failure?"

If the answer is no, do not submit that comment.

False positives reduce your score. Many false positives can bring your
score below zero.

---

## 7. SEVERITY CALIBRATION

Use severity consistently:

| Severity   | Meaning                                                    | Examples                                          |
|------------|------------------------------------------------------------|---------------------------------------------------|
| `critical` | Exploitable in production. Immediate risk of data loss, RCE, auth bypass. | SQL injection, pickle.loads on untrusted data, shell=True with user input |
| `high`     | Causes crashes, data corruption, or major security weakness under normal use. | ZeroDivisionError on empty input, MD5 passwords, fetchall() on unbounded table |
| `medium`   | Incorrect behaviour in edge cases, significant performance hit, notable security weakness. | Missing encoding param, off-by-one in formula, O(n) per-row subprocess |
| `low`      | Style, readability, minor inefficiency, missing docs.      | Unpythonic loop, manual Counter, missing docstring |

Do not mark everything as `critical`. Severity inflation is penalised.

---

## 8. MESSAGE QUALITY

A good message answers three questions:
1. **What** is wrong?
2. **Where** exactly (line / function)?
3. **Why** does it matter?

**Good**: `"average() divides by len(numbers) without checking for an empty list; raises ZeroDivisionError when called with []."`

**Bad**: `"This function has a bug."` β€” too vague, will not match ground truth.
**Bad**: `"Consider adding error handling."` β€” not specific enough.
**Bad**: `"Line 8 is problematic."` β€” no description of the actual problem.

Minimum 5 characters. Maximum 500 characters.

---

## 9. SUGGESTIONS ARE OPTIONAL BUT VALUABLE

- If you include a `suggestion`, make it concrete and correct Python.
- Do not include suggestions that are themselves buggy or insecure.
- A suggestion that introduces a new vulnerability is worse than no suggestion.

---

## 10. THE `summary` FIELD

- **Task 3 (Hard) only**: `summary` is **required**. Omitting it deducts 0.10 from your score.
- For Tasks 1 and 2: `summary` is optional. Include it if it adds value.
- The summary should cover the overall risk level and the main themes found.
- Mention key categories found: e.g. "security", "injection", "pickle", "performance", "documentation".
- More relevant keywords in the summary = small score bonus (up to +0.15).

---

## 11. WHEN TO SET `"submit": true`

Set `submit` to `true` when you believe your review is complete.
The grader runs immediately on submit and the episode ends.

Set `submit` to `false` if you want to add more comments in the next step.
You have `max_steps` steps per episode (varies by task: 5 / 7 / 10).

Rules:
- You MUST set `submit: true` on your final step.
- If you run out of steps without submitting, the episode auto-terminates.
- Do not waste steps submitting empty comment lists. Each empty step costs βˆ’0.05.

Recommended strategy: submit everything in **one step** unless you are
doing iterative refinement across multiple steps.

---

## 12. DEDUPLICATION β€” DO NOT REPEAT YOURSELF

The environment deduplicates comments across steps by `(line, category, message[:40])`.
Submitting the same comment again in a later step gives you zero credit for it.
Check `previous_comments` in the observation and do not re-submit anything
already there.

---

## 13. DO NOT SPAM

Submitting more than 2.5Γ— the expected number of comments triggers a spam penalty (βˆ’0.10).
Quality over quantity. If you find 6 real issues, submit 6.
Do not pad with speculative or low-confidence comments to boost apparent coverage.

---

## 14. MULTI-STEP STRATEGY (if using more than 1 step)

Step 1 β€” Read carefully. Submit your highest-confidence comments.
Step 2 β€” Review `feedback` and `previous_comments` in the observation.
         Add only NEW comments not already submitted.
Step N β€” Set `submit: true` when confident you have covered all categories.

Do not submit `submit: true` before you have reviewed the whole file.

---

## 15. WHAT THE GRADER CHECKS

The grader matches your comments against a hidden ground-truth list using:
- **Category match** (exact)
- **Line proximity** (within Β±3 lines)
- **Keyword overlap** (β‰₯25% of significant words from the truth message appear in yours)
- **Severity proximity** (within 1 level)

You get full credit for exact matches, partial credit (0.5Γ—) for right issue
wrong line. You get nothing for wrong category, and a penalty for fabricated issues.

**Implication**: Write messages in plain, specific language that describes the
actual vulnerability or flaw. Technical terms matter (e.g. "SQL injection",
"ZeroDivisionError", "MD5", "shell=True", "pickle.loads").

---

## 16. FORBIDDEN BEHAVIOURS

The following will actively hurt your score:

| Behaviour | Consequence |
|---|---|
| Responding with non-JSON text | Treated as empty action, βˆ’0.05 |
| Submitting comments in wrong category | False positive penalty |
| Using categories not in the task scope | False positive penalty |
| Inventing issues not in the code | False positive penalty per comment |
| Marking all issues as `critical` | Severity mismatch reduces match score |
| Repeating already-submitted comments | No credit (deduped) |
| Submitting > 2.5Γ— expected comments | Spam penalty βˆ’0.10 |
| Omitting `summary` on Task 3 | βˆ’0.10 from final score |
| Calling `submit: true` with 0 comments | Episode ends with near-zero score |

---

## 17. CHECKLIST BEFORE YOU RESPOND

Before generating your JSON, run through this mentally:

- [ ] Is my response a single valid JSON object with no surrounding text?
- [ ] Does every comment have all 5 fields with correct types?
- [ ] Are all my categories within the task scope defined in `instructions`?
- [ ] Is every line number accurate (1-indexed from the source)?
- [ ] Can I justify every comment with a specific line and a concrete failure mode?
- [ ] Have I avoided re-submitting comments from `previous_comments`?
- [ ] For Task 3: have I included a `summary` with key technical themes?
- [ ] Is my severity realistic (not everything is `critical`)?
- [ ] Should I set `submit: true` now, or do I have more to add?

---

## QUICK REFERENCE

```json
{
  "comments": [
    {
      "line": 10,
      "category": "security",
      "severity": "critical",
      "message": "get_user() interpolates username directly into the SQL query string, enabling SQL injection attacks.",
      "suggestion": "Use parameterised queries: cursor.execute('SELECT * FROM users WHERE username=?', (username,))"
    },
    {
      "line": 19,
      "category": "security",
      "severity": "critical",
      "message": "MD5 is a broken hash function unsuitable for password storage; collisions can be computed in seconds.",
      "suggestion": "Replace with bcrypt.hashpw(password.encode(), bcrypt.gensalt()) or hashlib.scrypt."
    }
  ],
  "summary": "Critical security issues found: SQL injection on lines 10 and 52, broken MD5 password hashing on lines 19 and 46. Performance issue: fetchall() loads entire table. Connection pooling absent.",
  "submit": true
}
```