Avnishjain's picture
Upload 34 files
db3bb4f verified

πŸ” CodeReview OpenEnv

An OpenEnv-compliant AI training environment that simulates professional Python code review. Agents learn to identify bugs, security vulnerabilities, performance bottlenecks, style issues, and documentation gaps β€” exactly as a senior engineer would in a real pull-request workflow.


Why Code Review?

Code review is one of the highest-leverage tasks in software engineering. It is:

  • Real-world: Every professional software team does it daily
  • Structured enough to grade: Issues have objectively correct or incorrect assessments
  • Rich in partial signal: An agent that spots 3/5 critical issues is measurably better than one that spots 1/5
  • Scalable in difficulty: Easy (bugs only) β†’ Hard (all categories + written summary)

This makes it an ideal domain for training and evaluating LLM-based agents on multi-step reasoning and quality estimation tasks.


Environment Description

CodeReviewEnv
β”œβ”€β”€ Task 1 – Easy    : Bug detection + Code style        (calculator.py, 31 lines)
β”œβ”€β”€ Task 2 – Medium  : Security + Performance audit      (user_service.py, 55 lines)
└── Task 3 – Hard    : Full review, all 5 categories     (data_pipeline.py, 49 lines)

Each task presents a Python snippet containing intentional flaws. The agent submits ReviewComment objects across one or more steps, then finalises with submit=True. A deterministic grader scores the review against ground-truth issues.


Observation Space

What the agent sees on each step:

Field Type Description
task_id str Active task identifier
step int Current step (0-indexed)
snippet.file_name str Logical file name (e.g. auth.py)
snippet.source str Full Python source code
instructions str Review scope, difficulty, and guidance
previous_comments list[ReviewComment] All comments submitted so far
feedback str | None Env feedback on the last action
done bool Whether the episode has ended

Action Space

What the agent submits on each step:

{
  "comments": [
    {
      "line": 10,
      "category": "security",
      "severity": "critical",
      "message": "SQL injection via string interpolation in query.",
      "suggestion": "Use parameterised queries: cursor.execute('...', (username,))"
    }
  ],
  "summary": "Overall review summary (required for task_3_hard)",
  "submit": true
}
Field Type Values
comments[].line int | null 1-indexed line number; null for file-level
comments[].category enum bug, security, performance, style, documentation
comments[].severity enum low, medium, high, critical
comments[].message str 5–500 chars
comments[].suggestion str | null Optional fix suggestion
summary str | null Required for task_3_hard, optional otherwise
submit bool true finalises the review and triggers the grader

Reward Function

Rewards are shaped to provide signal over the full trajectory, not just on terminal submit.

Per-step (incremental) rewards

Event Reward
New valid comment added +0.05 per comment (max +0.15)
Progress signal (grader score delta) +0.5 Γ— Ξ”score
Empty step (no new comments) βˆ’0.05
Spam (> 2.5Γ— expected comments) βˆ’0.10

On submit=True (terminal)

submit_reward = score Γ— 0.8 + (0.2 if score β‰₯ threshold else βˆ’0.2)

Per-category penalties (applied to terminal grader score)

Event Penalty
False positive (fabricated issue) βˆ’0.08–0.12 per comment
Missed CRITICAL security issue βˆ’0.15–0.20
Missed HIGH issue βˆ’0.08–0.10
No summary on task 3 βˆ’0.10

All rewards are clipped to [βˆ’1.0, 1.0].


Task Descriptions

Task 1 – Easy: Bug Detection & Style Review

File: calculator.py (31 lines) | Max steps: 5 | Pass threshold: 0.55

Covers basic utility functions: divide, average, celsius_to_fahrenheit, find_max, count_words.

Ground-truth issues (6):

  • divide() β€” no zero-division guard (HIGH bug)
  • average() β€” crashes on empty list (HIGH bug)
  • celsius_to_fahrenheit β€” off-by-one (+31 vs +32) (MEDIUM bug)
  • find_max() β€” crashes on empty list (MEDIUM bug)
  • for i in range(len(lst)) β€” unpythonic iteration (LOW style)
  • Manual Counter reimplementation (LOW style)

Task 2 – Medium: Security & Performance Audit

File: user_service.py (55 lines) | Max steps: 7 | Pass threshold: 0.60

A SQLite-backed user management service with authentication.

Ground-truth issues (6):

  • SQL injection in get_user() β€” f-string query (CRITICAL security)
  • MD5 password hashing in create_user() (CRITICAL security)
  • SQL injection in delete_user() (CRITICAL security)
  • MD5 reuse in authenticate() (HIGH security)
  • fetchall() on unbounded table (HIGH performance)
  • New DB connection per query, no pooling (MEDIUM performance)

Task 3 – Hard: Comprehensive Code Review

File: data_pipeline.py (49 lines) | Max steps: 10 | Pass threshold: 0.65

An analytics data pipeline with CSV loading, row transformation, caching, and stats.

Ground-truth issues (13 across all 5 categories):

  • subprocess.run(shell=True) with user input β€” OS command injection (CRITICAL security)
  • pickle.loads() on arbitrary cache data β€” RCE risk (CRITICAL security)
  • Pickling into module-level dict (HIGH security)
  • compute_stats() ZeroDivisionError on empty data (HIGH bug)
  • Missing "value" key β†’ silent KeyError (MEDIUM bug)
  • open() without encoding (MEDIUM bug)
  • Two-pass iteration in compute_stats (MEDIUM performance)
  • Subprocess per row instead of batching (MEDIUM performance)
  • str(stats) instead of JSON export (LOW style)
  • Module-level mutable global cache (LOW style)
  • load_data() missing docstring (LOW documentation)
  • process_row() missing docstring (LOW documentation)
  • Insufficient module-level docstring (LOW documentation)

A written summary is required (summary field) β€” absence incurs a βˆ’0.10 score penalty.


Expected Baseline Scores (gpt-4o)

Task Score Pass? Notes
task_1_easy ~0.75 βœ… GPT-4o reliably spots ZeroDivisionError and off-by-one
task_2_medium ~0.65 βœ… SQL injection found; MD5 usually flagged; perf issues partial
task_3_hard ~0.55 βœ… Pickle RCE and shell injection found; docs often missed

Setup & Usage

Option A β€” Docker (recommended)

# Build
docker build -t code-review-env .

# Run (port 7860)
docker run -p 7860:7860 code-review-env

# Test it
curl http://localhost:7860/health

Option B β€” Local Python

# Install dependencies
pip install -r requirements.txt

# Start the server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# Open docs
open http://localhost:7860/docs

Run the test suite

pytest tests/ -v
# Expected: 25 passed

Run the baseline agent

export OPENAI_API_KEY=sk-...

# All tasks (direct mode β€” no server needed)
python baseline_agent.py

# Single task
python baseline_agent.py --task task_2_medium

# Against a running HTTP server
python baseline_agent.py --mode http --base-url http://localhost:7860

API Reference

Endpoint Method Description
/ GET HTML landing page
/health GET Health check
/tasks GET List all task specs
/reset POST Start or restart an episode
/step POST Submit an action
/state GET Get full serialisable state
/docs GET Interactive Swagger UI

Example: Full episode via curl

# 1. Reset
curl -X POST http://localhost:7860/reset \
  -H 'Content-Type: application/json' \
  -d '{"task_id": "task_1_easy", "session_id": "demo"}'

# 2. Step
curl -X POST http://localhost:7860/step \
  -H 'Content-Type: application/json' \
  -d '{
    "session_id": "demo",
    "action": {
      "comments": [
        {
          "line": 2,
          "category": "bug",
          "severity": "high",
          "message": "divide() will raise ZeroDivisionError when b is 0.",
          "suggestion": "Guard with: if b == 0: raise ValueError"
        }
      ],
      "submit": true
    }
  }'

# 3. Check state
curl "http://localhost:7860/state?session_id=demo"

Project Structure

openenv-code-review/
β”œβ”€β”€ app.py                  # FastAPI HTTP server
β”œβ”€β”€ openenv.yaml            # OpenEnv spec metadata
β”œβ”€β”€ Dockerfile              # Container definition
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ baseline_agent.py       # gpt-4o baseline inference script
β”‚
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ models.py           # Pydantic typed models (Observation, Action, Reward, …)
β”‚   └── environment.py      # CodeReviewEnv β€” step() / reset() / state()
β”‚
β”œβ”€β”€ corpus/
β”‚   └── snippets.py         # Python snippets with ground-truth issues
β”‚
β”œβ”€β”€ graders/
β”‚   └── graders.py          # Task1Grader, Task2Grader, Task3Grader
β”‚
└── tests/
    └── test_env.py         # 25-test pytest suite (all passing)

Deploying to Hugging Face Spaces

  1. Create a new Space with Docker SDK
  2. Push this repository to the Space
  3. Set OPENAI_API_KEY as a Space secret (only needed for baseline script)
  4. The Space will auto-build and expose port 7860
# README.md frontmatter for HF Spaces
---
title: CodeReview OpenEnv
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
tags:
  - openenv
  - code-review
  - ai-agent
  - evaluation
---

License

MIT