uvpatel7271's picture
envrionment setup
0695520
|
raw
history blame
4.64 kB
---
title: Python Code Review Environment
emoji: snake
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
tags:
- openenv
- code-review
- python
---
# python_code_review_env
`python_code_review_env` is a production-style OpenEnv environment that simulates a realistic Python code review workflow. An agent inspects broken code, edits it, runs tests, and submits a final solution against deterministic graders for syntax repair, bug fixing, and optimization/refactoring.
## Environment design
- `Observation` includes task instructions, current code, syntax errors, public test output, action history, and remaining attempts.
- `Action` is structured as `analyze_code`, `edit_code`, `run_tests`, or `submit_solution`.
- `Reward` is shaped and non-binary. The environment awards syntax progress, test progress, correctness, and quality improvements while penalizing invalid actions, timeouts, regressions, and unchanged edits.
- `State` exposes the internal episode snapshot through `/state`.
## Task set
1. `syntax_fix_invoice_totals` (easy)
Fix a syntax regression in an invoice normalization helper.
2. `bug_fix_session_windows` (medium)
Repair a session-collapsing bug using deterministic public and hidden tests.
3. `optimization_rank_active_users` (hard)
Refactor a slow ranking function and earn additional score from runtime improvement plus AST/style quality.
## Action schema
```json
{
"action_type": "edit_code",
"code": "def function(...):\n ..."
}
```
Supported `action_type` values:
- `analyze_code`
- `edit_code`
- `run_tests`
- `submit_solution`
## Observation schema
```json
{
"task_description": "...",
"current_code": "...",
"errors": "...",
"test_results": "...",
"history": []
}
```
The full observation also includes `task_id`, `difficulty`, `task_kind`, `visible_tests`, `attempts_remaining`, `score`, `last_action_status`, `reward`, `done`, and a structured `reward_details` breakdown.
## Deterministic grading
- Syntax tasks use `compile()` plus hidden behavioral checks.
- Bug-fix tasks use deterministic function-call cases that behave like pytest assertions.
- Optimization tasks combine correctness, runtime benchmarking, and AST/style quality scoring.
- Infinite loops and long-running solutions are sandboxed with subprocess timeouts and receive penalties.
- All scores are clamped to `[0.0, 1.0]`.
## Run locally
Install dependencies:
```bash
pip install .
```
Start the API server:
```bash
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
Smoke-test the environment:
```bash
curl http://localhost:8000/health
curl http://localhost:8000/state
```
OpenEnv validation:
```bash
openenv validate
```
## Docker build
The Docker image no longer depends on `ghcr.io/meta-pytorch/openenv-base:latest`, which removes the TLS handshake failure from the original build path.
```bash
docker build -t python-code-review-env -f server/Dockerfile .
docker run --rm -p 8000:8000 python-code-review-env
```
Expected health check:
```bash
curl http://localhost:8000/health
```
## Hugging Face Spaces deployment
1. Create a Docker Space.
2. Push this repository content to the Space.
3. Ensure port `8000` is exposed.
4. Wait for the container to build.
5. Verify `/reset` and `/health` return `200`.
The image is CPU-friendly and designed for a small Hugging Face Space such as `2 vCPU / 8 GB RAM`.
## Inference baseline
`inference.py` uses an OpenAI-compatible client:
```python
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
```
Supported providers include:
- Gemini through an OpenAI-compatible gateway
- OpenRouter
- Together AI
- DeepSeek-compatible OpenAI endpoints
Run it with a free/open provider:
```bash
set API_BASE_URL=https://openrouter.ai/api/v1
set API_KEY=...
set MODEL=deepseek/deepseek-chat-v3-0324:free
python inference.py
```
If no credentials are supplied, the script falls back to a deterministic smoke-test policy that applies the reference fix for each task so the environment can still be validated end to end.
Example output:
```text
Task 1 Score: 1.0
Task 2 Score: 1.0
Task 3 Score: 0.9
Final Score: 1.0
```
## Project structure
```text
python_env/
β”œβ”€β”€ client.py
β”œβ”€β”€ graders/
β”‚ β”œβ”€β”€ bug_fix.py
β”‚ β”œβ”€β”€ dispatch.py
β”‚ β”œβ”€β”€ optimization.py
β”‚ β”œβ”€β”€ shared.py
β”‚ └── syntax.py
β”œβ”€β”€ inference.py
β”œβ”€β”€ models.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ README.md
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ app.py
β”‚ β”œβ”€β”€ Dockerfile
β”‚ β”œβ”€β”€ env.py
β”‚ └── python_env_environment.py
└── tasks/
└── catalog.py
```