Spaces:
Runtime error
title: Python Code Review Environment
emoji: snake
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
tags:
- openenv
- code-review
- python
python_code_review_env
python_code_review_env is a production-style OpenEnv environment that simulates a realistic Python code review workflow. An agent inspects broken code, edits it, runs tests, and submits a final solution against deterministic graders for syntax repair, bug fixing, and optimization/refactoring.
Environment design
Observationincludes task instructions, current code, syntax errors, public test output, action history, and remaining attempts.Actionis structured asanalyze_code,edit_code,run_tests, orsubmit_solution.Rewardis shaped and non-binary. The environment awards syntax progress, test progress, correctness, and quality improvements while penalizing invalid actions, timeouts, regressions, and unchanged edits.Stateexposes the internal episode snapshot through/state.
Task set
syntax_fix_invoice_totals(easy) Fix a syntax regression in an invoice normalization helper.bug_fix_session_windows(medium) Repair a session-collapsing bug using deterministic public and hidden tests.optimization_rank_active_users(hard) Refactor a slow ranking function and earn additional score from runtime improvement plus AST/style quality.
Action schema
{
"action_type": "edit_code",
"code": "def function(...):\n ..."
}
Supported action_type values:
analyze_codeedit_coderun_testssubmit_solution
Observation schema
{
"task_description": "...",
"current_code": "...",
"errors": "...",
"test_results": "...",
"history": []
}
The full observation also includes task_id, difficulty, task_kind, visible_tests, attempts_remaining, score, last_action_status, reward, done, and a structured reward_details breakdown.
Deterministic grading
- Syntax tasks use
compile()plus hidden behavioral checks. - Bug-fix tasks use deterministic function-call cases that behave like pytest assertions.
- Optimization tasks combine correctness, runtime benchmarking, and AST/style quality scoring.
- Infinite loops and long-running solutions are sandboxed with subprocess timeouts and receive penalties.
- All scores are clamped to
[0.0, 1.0].
Run locally
Install dependencies:
pip install .
Start the API server:
uvicorn server.app:app --host 0.0.0.0 --port 8000
Smoke-test the environment:
curl http://localhost:8000/health
curl http://localhost:8000/state
OpenEnv validation:
openenv validate
Docker build
The Docker image no longer depends on ghcr.io/meta-pytorch/openenv-base:latest, which removes the TLS handshake failure from the original build path.
docker build -t python-code-review-env -f server/Dockerfile .
docker run --rm -p 8000:8000 python-code-review-env
Expected health check:
curl http://localhost:8000/health
Hugging Face Spaces deployment
- Create a Docker Space.
- Push this repository content to the Space.
- Ensure port
8000is exposed. - Wait for the container to build.
- Verify
/resetand/healthreturn200.
The image is CPU-friendly and designed for a small Hugging Face Space such as 2 vCPU / 8 GB RAM.
Inference baseline
inference.py uses an OpenAI-compatible client:
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
Supported providers include:
- Gemini through an OpenAI-compatible gateway
- OpenRouter
- Together AI
- DeepSeek-compatible OpenAI endpoints
Run it with a free/open provider:
set API_BASE_URL=https://openrouter.ai/api/v1
set API_KEY=...
set MODEL=deepseek/deepseek-chat-v3-0324:free
python inference.py
If no credentials are supplied, the script falls back to a deterministic smoke-test policy that applies the reference fix for each task so the environment can still be validated end to end.
Example output:
Task 1 Score: 1.0
Task 2 Score: 1.0
Task 3 Score: 0.9
Final Score: 1.0
Project structure
python_env/
βββ client.py
βββ graders/
β βββ bug_fix.py
β βββ dispatch.py
β βββ optimization.py
β βββ shared.py
β βββ syntax.py
βββ inference.py
βββ models.py
βββ openenv.yaml
βββ README.md
βββ server/
β βββ app.py
β βββ Dockerfile
β βββ env.py
β βββ python_env_environment.py
βββ tasks/
βββ catalog.py