Spaces:
Runtime error
Runtime error
| title: Python Code Review Environment | |
| emoji: snake | |
| colorFrom: yellow | |
| colorTo: blue | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| - code-review | |
| - python | |
| # python_code_review_env | |
| `python_code_review_env` is a production-style OpenEnv environment that simulates a realistic Python code review workflow. An agent inspects broken code, edits it, runs tests, and submits a final solution against deterministic graders for syntax repair, bug fixing, and optimization/refactoring. | |
| ## Environment design | |
| - `Observation` includes task instructions, current code, syntax errors, public test output, action history, and remaining attempts. | |
| - `Action` is structured as `analyze_code`, `edit_code`, `run_tests`, or `submit_solution`. | |
| - `Reward` is shaped and non-binary. The environment awards syntax progress, test progress, correctness, and quality improvements while penalizing invalid actions, timeouts, regressions, and unchanged edits. | |
| - `State` exposes the internal episode snapshot through `/state`. | |
| ## Task set | |
| 1. `syntax_fix_invoice_totals` (easy) | |
| Fix a syntax regression in an invoice normalization helper. | |
| 2. `bug_fix_session_windows` (medium) | |
| Repair a session-collapsing bug using deterministic public and hidden tests. | |
| 3. `optimization_rank_active_users` (hard) | |
| Refactor a slow ranking function and earn additional score from runtime improvement plus AST/style quality. | |
| ## Action schema | |
| ```json | |
| { | |
| "action_type": "edit_code", | |
| "code": "def function(...):\n ..." | |
| } | |
| ``` | |
| Supported `action_type` values: | |
| - `analyze_code` | |
| - `edit_code` | |
| - `run_tests` | |
| - `submit_solution` | |
| ## Observation schema | |
| ```json | |
| { | |
| "task_description": "...", | |
| "current_code": "...", | |
| "errors": "...", | |
| "test_results": "...", | |
| "history": [] | |
| } | |
| ``` | |
| The full observation also includes `task_id`, `difficulty`, `task_kind`, `visible_tests`, `attempts_remaining`, `score`, `last_action_status`, `reward`, `done`, and a structured `reward_details` breakdown. | |
| ## Deterministic grading | |
| - Syntax tasks use `compile()` plus hidden behavioral checks. | |
| - Bug-fix tasks use deterministic function-call cases that behave like pytest assertions. | |
| - Optimization tasks combine correctness, runtime benchmarking, and AST/style quality scoring. | |
| - Infinite loops and long-running solutions are sandboxed with subprocess timeouts and receive penalties. | |
| - All scores are clamped to `[0.0, 1.0]`. | |
| ## Run locally | |
| Install dependencies: | |
| ```bash | |
| pip install . | |
| ``` | |
| Start the API server: | |
| ```bash | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| Smoke-test the environment: | |
| ```bash | |
| curl http://localhost:8000/health | |
| curl http://localhost:8000/state | |
| ``` | |
| OpenEnv validation: | |
| ```bash | |
| openenv validate | |
| ``` | |
| ## Docker build | |
| The Docker image no longer depends on `ghcr.io/meta-pytorch/openenv-base:latest`, which removes the TLS handshake failure from the original build path. | |
| ```bash | |
| docker build -t python-code-review-env -f server/Dockerfile . | |
| docker run --rm -p 8000:8000 python-code-review-env | |
| ``` | |
| Expected health check: | |
| ```bash | |
| curl http://localhost:8000/health | |
| ``` | |
| ## Hugging Face Spaces deployment | |
| 1. Create a Docker Space. | |
| 2. Push this repository content to the Space. | |
| 3. Ensure port `8000` is exposed. | |
| 4. Wait for the container to build. | |
| 5. Verify `/reset` and `/health` return `200`. | |
| The image is CPU-friendly and designed for a small Hugging Face Space such as `2 vCPU / 8 GB RAM`. | |
| ## Inference baseline | |
| `inference.py` uses an OpenAI-compatible client: | |
| ```python | |
| client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY) | |
| ``` | |
| Supported providers include: | |
| - Gemini through an OpenAI-compatible gateway | |
| - OpenRouter | |
| - Together AI | |
| - DeepSeek-compatible OpenAI endpoints | |
| Run it with a free/open provider: | |
| ```bash | |
| set API_BASE_URL=https://openrouter.ai/api/v1 | |
| set API_KEY=... | |
| set MODEL=deepseek/deepseek-chat-v3-0324:free | |
| python inference.py | |
| ``` | |
| If no credentials are supplied, the script falls back to a deterministic smoke-test policy that applies the reference fix for each task so the environment can still be validated end to end. | |
| Example output: | |
| ```text | |
| Task 1 Score: 1.0 | |
| Task 2 Score: 1.0 | |
| Task 3 Score: 0.9 | |
| Final Score: 1.0 | |
| ``` | |
| ## Project structure | |
| ```text | |
| python_env/ | |
| βββ client.py | |
| βββ graders/ | |
| β βββ bug_fix.py | |
| β βββ dispatch.py | |
| β βββ optimization.py | |
| β βββ shared.py | |
| β βββ syntax.py | |
| βββ inference.py | |
| βββ models.py | |
| βββ openenv.yaml | |
| βββ README.md | |
| βββ server/ | |
| β βββ app.py | |
| β βββ Dockerfile | |
| β βββ env.py | |
| β βββ python_env_environment.py | |
| βββ tasks/ | |
| βββ catalog.py | |
| ``` | |