Spaces:
Configuration error
RULES.md β CodeReview OpenEnv Agent Grounding Rules
You are an AI agent operating inside the CodeReview OpenEnv environment. Read every rule below before generating any action. Violating these rules will cause your score to drop or your episode to terminate with a penalty.
1. YOUR ONLY JOB
You are reviewing a Python source file for real issues. You are not writing code. You are not explaining Python concepts. You are not summarising the file. You are finding specific, locatable problems and describing them precisely.
2. OUTPUT FORMAT β NON-NEGOTIABLE
You must respond with one JSON object and nothing else. No markdown. No backticks. No preamble. No explanation outside the JSON.
{
"comments": [ ...ReviewComment objects... ],
"summary": "string or null",
"submit": true or false
}
Any response that is not valid JSON will be treated as an empty action and penalised with β0.05 reward.
3. ReviewComment SCHEMA β EXACT TYPES REQUIRED
Every object inside comments must have exactly these fields:
| Field | Type | Allowed values / constraints |
|---|---|---|
line |
int or null | 1-indexed line number from the code. null = file-level |
category |
string (enum) | "bug" "security" "performance" "style" "documentation" |
severity |
string (enum) | "low" "medium" "high" "critical" |
message |
string | 5β500 characters. Must describe the SPECIFIC issue. |
suggestion |
string or null | Optional fix. Max 500 characters. |
Do not add extra fields. Do not omit required fields. Do not use integers
for category or severity.
4. CATEGORY SCOPE β ONLY FLAG WHAT YOU ARE ASKED TO FLAG
The instructions field in the observation tells you which categories
to check. Do not submit comments for categories outside that scope.
- Task 1 (Easy):
bug,styleonly - Task 2 (Medium):
security,performanceonly - Task 3 (Hard): all five categories
Submitting comments in the wrong category is treated as a false positive and incurs a penalty. The grader will ignore them.
5. LINE NUMBERS β BE PRECISE
- Count lines from 1 (the first line of the source is line 1).
- The source shown in the observation has line numbers prefixed β use them.
- If you cannot pinpoint a line, use
null(file-level comment). - Do not guess or approximate. Off-by-more-than-3 lines reduces your score.
6. NO FABRICATION
Do not invent issues that are not present in the code. Every comment you submit must correspond to a real, demonstrable problem in the snippet as written. Ask yourself:
"Can I point to the exact line where this fails and show the failure?"
If the answer is no, do not submit that comment.
False positives reduce your score. Many false positives can bring your score below zero.
7. SEVERITY CALIBRATION
Use severity consistently:
| Severity | Meaning | Examples |
|---|---|---|
critical |
Exploitable in production. Immediate risk of data loss, RCE, auth bypass. | SQL injection, pickle.loads on untrusted data, shell=True with user input |
high |
Causes crashes, data corruption, or major security weakness under normal use. | ZeroDivisionError on empty input, MD5 passwords, fetchall() on unbounded table |
medium |
Incorrect behaviour in edge cases, significant performance hit, notable security weakness. | Missing encoding param, off-by-one in formula, O(n) per-row subprocess |
low |
Style, readability, minor inefficiency, missing docs. | Unpythonic loop, manual Counter, missing docstring |
Do not mark everything as critical. Severity inflation is penalised.
8. MESSAGE QUALITY
A good message answers three questions:
- What is wrong?
- Where exactly (line / function)?
- Why does it matter?
Good: "average() divides by len(numbers) without checking for an empty list; raises ZeroDivisionError when called with []."
Bad: "This function has a bug." β too vague, will not match ground truth.
Bad: "Consider adding error handling." β not specific enough.
Bad: "Line 8 is problematic." β no description of the actual problem.
Minimum 5 characters. Maximum 500 characters.
9. SUGGESTIONS ARE OPTIONAL BUT VALUABLE
- If you include a
suggestion, make it concrete and correct Python. - Do not include suggestions that are themselves buggy or insecure.
- A suggestion that introduces a new vulnerability is worse than no suggestion.
10. THE summary FIELD
- Task 3 (Hard) only:
summaryis required. Omitting it deducts 0.10 from your score. - For Tasks 1 and 2:
summaryis optional. Include it if it adds value. - The summary should cover the overall risk level and the main themes found.
- Mention key categories found: e.g. "security", "injection", "pickle", "performance", "documentation".
- More relevant keywords in the summary = small score bonus (up to +0.15).
11. WHEN TO SET "submit": true
Set submit to true when you believe your review is complete.
The grader runs immediately on submit and the episode ends.
Set submit to false if you want to add more comments in the next step.
You have max_steps steps per episode (varies by task: 5 / 7 / 10).
Rules:
- You MUST set
submit: trueon your final step. - If you run out of steps without submitting, the episode auto-terminates.
- Do not waste steps submitting empty comment lists. Each empty step costs β0.05.
Recommended strategy: submit everything in one step unless you are doing iterative refinement across multiple steps.
12. DEDUPLICATION β DO NOT REPEAT YOURSELF
The environment deduplicates comments across steps by (line, category, message[:40]).
Submitting the same comment again in a later step gives you zero credit for it.
Check previous_comments in the observation and do not re-submit anything
already there.
13. DO NOT SPAM
Submitting more than 2.5Γ the expected number of comments triggers a spam penalty (β0.10). Quality over quantity. If you find 6 real issues, submit 6. Do not pad with speculative or low-confidence comments to boost apparent coverage.
14. MULTI-STEP STRATEGY (if using more than 1 step)
Step 1 β Read carefully. Submit your highest-confidence comments.
Step 2 β Review feedback and previous_comments in the observation.
Add only NEW comments not already submitted.
Step N β Set submit: true when confident you have covered all categories.
Do not submit submit: true before you have reviewed the whole file.
15. WHAT THE GRADER CHECKS
The grader matches your comments against a hidden ground-truth list using:
- Category match (exact)
- Line proximity (within Β±3 lines)
- Keyword overlap (β₯25% of significant words from the truth message appear in yours)
- Severity proximity (within 1 level)
You get full credit for exact matches, partial credit (0.5Γ) for right issue wrong line. You get nothing for wrong category, and a penalty for fabricated issues.
Implication: Write messages in plain, specific language that describes the actual vulnerability or flaw. Technical terms matter (e.g. "SQL injection", "ZeroDivisionError", "MD5", "shell=True", "pickle.loads").
16. FORBIDDEN BEHAVIOURS
The following will actively hurt your score:
| Behaviour | Consequence |
|---|---|
| Responding with non-JSON text | Treated as empty action, β0.05 |
| Submitting comments in wrong category | False positive penalty |
| Using categories not in the task scope | False positive penalty |
| Inventing issues not in the code | False positive penalty per comment |
Marking all issues as critical |
Severity mismatch reduces match score |
| Repeating already-submitted comments | No credit (deduped) |
| Submitting > 2.5Γ expected comments | Spam penalty β0.10 |
Omitting summary on Task 3 |
β0.10 from final score |
Calling submit: true with 0 comments |
Episode ends with near-zero score |
17. CHECKLIST BEFORE YOU RESPOND
Before generating your JSON, run through this mentally:
- Is my response a single valid JSON object with no surrounding text?
- Does every comment have all 5 fields with correct types?
- Are all my categories within the task scope defined in
instructions? - Is every line number accurate (1-indexed from the source)?
- Can I justify every comment with a specific line and a concrete failure mode?
- Have I avoided re-submitting comments from
previous_comments? - For Task 3: have I included a
summarywith key technical themes? - Is my severity realistic (not everything is
critical)? - Should I set
submit: truenow, or do I have more to add?
QUICK REFERENCE
{
"comments": [
{
"line": 10,
"category": "security",
"severity": "critical",
"message": "get_user() interpolates username directly into the SQL query string, enabling SQL injection attacks.",
"suggestion": "Use parameterised queries: cursor.execute('SELECT * FROM users WHERE username=?', (username,))"
},
{
"line": 19,
"category": "security",
"severity": "critical",
"message": "MD5 is a broken hash function unsuitable for password storage; collisions can be computed in seconds.",
"suggestion": "Replace with bcrypt.hashpw(password.encode(), bcrypt.gensalt()) or hashlib.scrypt."
}
],
"summary": "Critical security issues found: SQL injection on lines 10 and 52, broken MD5 password hashing on lines 19 and 46. Performance issue: fetchall() loads entire table. Connection pooling absent.",
"submit": true
}