uvpatel7271's picture
envrionment setup
0695520
|
raw
history blame
4.64 kB
metadata
title: Python Code Review Environment
emoji: snake
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
  - code-review
  - python

python_code_review_env

python_code_review_env is a production-style OpenEnv environment that simulates a realistic Python code review workflow. An agent inspects broken code, edits it, runs tests, and submits a final solution against deterministic graders for syntax repair, bug fixing, and optimization/refactoring.

Environment design

  • Observation includes task instructions, current code, syntax errors, public test output, action history, and remaining attempts.
  • Action is structured as analyze_code, edit_code, run_tests, or submit_solution.
  • Reward is shaped and non-binary. The environment awards syntax progress, test progress, correctness, and quality improvements while penalizing invalid actions, timeouts, regressions, and unchanged edits.
  • State exposes the internal episode snapshot through /state.

Task set

  1. syntax_fix_invoice_totals (easy) Fix a syntax regression in an invoice normalization helper.
  2. bug_fix_session_windows (medium) Repair a session-collapsing bug using deterministic public and hidden tests.
  3. optimization_rank_active_users (hard) Refactor a slow ranking function and earn additional score from runtime improvement plus AST/style quality.

Action schema

{
  "action_type": "edit_code",
  "code": "def function(...):\n    ..."
}

Supported action_type values:

  • analyze_code
  • edit_code
  • run_tests
  • submit_solution

Observation schema

{
  "task_description": "...",
  "current_code": "...",
  "errors": "...",
  "test_results": "...",
  "history": []
}

The full observation also includes task_id, difficulty, task_kind, visible_tests, attempts_remaining, score, last_action_status, reward, done, and a structured reward_details breakdown.

Deterministic grading

  • Syntax tasks use compile() plus hidden behavioral checks.
  • Bug-fix tasks use deterministic function-call cases that behave like pytest assertions.
  • Optimization tasks combine correctness, runtime benchmarking, and AST/style quality scoring.
  • Infinite loops and long-running solutions are sandboxed with subprocess timeouts and receive penalties.
  • All scores are clamped to [0.0, 1.0].

Run locally

Install dependencies:

pip install .

Start the API server:

uvicorn server.app:app --host 0.0.0.0 --port 8000

Smoke-test the environment:

curl http://localhost:8000/health
curl http://localhost:8000/state

OpenEnv validation:

openenv validate

Docker build

The Docker image no longer depends on ghcr.io/meta-pytorch/openenv-base:latest, which removes the TLS handshake failure from the original build path.

docker build -t python-code-review-env -f server/Dockerfile .
docker run --rm -p 8000:8000 python-code-review-env

Expected health check:

curl http://localhost:8000/health

Hugging Face Spaces deployment

  1. Create a Docker Space.
  2. Push this repository content to the Space.
  3. Ensure port 8000 is exposed.
  4. Wait for the container to build.
  5. Verify /reset and /health return 200.

The image is CPU-friendly and designed for a small Hugging Face Space such as 2 vCPU / 8 GB RAM.

Inference baseline

inference.py uses an OpenAI-compatible client:

client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)

Supported providers include:

  • Gemini through an OpenAI-compatible gateway
  • OpenRouter
  • Together AI
  • DeepSeek-compatible OpenAI endpoints

Run it with a free/open provider:

set API_BASE_URL=https://openrouter.ai/api/v1
set API_KEY=...
set MODEL=deepseek/deepseek-chat-v3-0324:free
python inference.py

If no credentials are supplied, the script falls back to a deterministic smoke-test policy that applies the reference fix for each task so the environment can still be validated end to end.

Example output:

Task 1 Score: 1.0
Task 2 Score: 1.0
Task 3 Score: 0.9
Final Score: 1.0

Project structure

python_env/
β”œβ”€β”€ client.py
β”œβ”€β”€ graders/
β”‚   β”œβ”€β”€ bug_fix.py
β”‚   β”œβ”€β”€ dispatch.py
β”‚   β”œβ”€β”€ optimization.py
β”‚   β”œβ”€β”€ shared.py
β”‚   └── syntax.py
β”œβ”€β”€ inference.py
β”œβ”€β”€ models.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ README.md
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ env.py
β”‚   └── python_env_environment.py
└── tasks/
    └── catalog.py