---
title: Python Code Review Environment
emoji: snake
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
  - code-review
  - python
---

# python_code_review_env

`python_code_review_env` is a production-style OpenEnv environment that simulates a realistic Python code review workflow. An agent inspects broken code, edits it, runs tests, and submits a final solution against deterministic graders for syntax repair, bug fixing, and optimization/refactoring.

## Environment design

- `Observation` includes task instructions, current code, syntax errors, public test output, action history, and remaining attempts.
- `Action` is structured as `analyze_code`, `edit_code`, `run_tests`, or `submit_solution`.
- `Reward` is shaped and non-binary. The environment awards syntax progress, test progress, correctness, and quality improvements while penalizing invalid actions, timeouts, regressions, and unchanged edits.
- `State` exposes the internal episode snapshot through `/state`.

## Task set

1. `syntax_fix_invoice_totals` (easy)
   Fix a syntax regression in an invoice normalization helper.
2. `bug_fix_session_windows` (medium)
   Repair a session-collapsing bug using deterministic public and hidden tests.
3. `optimization_rank_active_users` (hard)
   Refactor a slow ranking function and earn additional score from runtime improvement plus AST/style quality.

## Action schema

```json
{
  "action_type": "edit_code",
  "code": "def function(...):\n    ..."
}
```

Supported `action_type` values:

- `analyze_code`
- `edit_code`
- `run_tests`
- `submit_solution`

## Observation schema

```json
{
  "task_description": "...",
  "current_code": "...",
  "errors": "...",
  "test_results": "...",
  "history": []
}
```

The full observation also includes `task_id`, `difficulty`, `task_kind`, `visible_tests`, `attempts_remaining`, `score`, `last_action_status`, `reward`, `done`, and a structured `reward_details` breakdown.

## Deterministic grading

- Syntax tasks use `compile()` plus hidden behavioral checks.
- Bug-fix tasks use deterministic function-call cases that behave like pytest assertions.
- Optimization tasks combine correctness, runtime benchmarking, and AST/style quality scoring.
- Infinite loops and long-running solutions are sandboxed with subprocess timeouts and receive penalties.
- All scores are clamped to `[0.0, 1.0]`.

## Run locally

Install dependencies:

```bash
pip install .
```

Start the API server:

```bash
uvicorn server.app:app --host 0.0.0.0 --port 8000
```

Smoke-test the environment:

```bash
curl http://localhost:8000/health
curl http://localhost:8000/state
```

OpenEnv validation:

```bash
openenv validate
```

## Docker build

The Docker image no longer depends on `ghcr.io/meta-pytorch/openenv-base:latest`, which removes the TLS handshake failure from the original build path.

```bash
docker build -t python-code-review-env -f server/Dockerfile .
docker run --rm -p 8000:8000 python-code-review-env
```

Expected health check:

```bash
curl http://localhost:8000/health
```

## Hugging Face Spaces deployment

1. Create a Docker Space.
2. Push this repository content to the Space.
3. Ensure port `8000` is exposed.
4. Wait for the container to build.
5. Verify `/reset` and `/health` return `200`.

The image is CPU-friendly and designed for a small Hugging Face Space such as `2 vCPU / 8 GB RAM`.

## Inference baseline

`inference.py` uses an OpenAI-compatible client:

```python
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
```

Supported providers include:

- Gemini through an OpenAI-compatible gateway
- OpenRouter
- Together AI
- DeepSeek-compatible OpenAI endpoints

Run it with a free/open provider:

```bash
set API_BASE_URL=https://openrouter.ai/api/v1
set API_KEY=...
set MODEL=deepseek/deepseek-chat-v3-0324:free
python inference.py
```

If no credentials are supplied, the script falls back to a deterministic smoke-test policy that applies the reference fix for each task so the environment can still be validated end to end.

Example output:

```text
Task 1 Score: 1.0
Task 2 Score: 1.0
Task 3 Score: 0.9
Final Score: 1.0
```

## Project structure

```text
python_env/
├── client.py
├── graders/
│   ├── bug_fix.py
│   ├── dispatch.py
│   ├── optimization.py
│   ├── shared.py
│   └── syntax.py
├── inference.py
├── models.py
├── openenv.yaml
├── README.md
├── server/
│   ├── app.py
│   ├── Dockerfile
│   ├── env.py
│   └── python_env_environment.py
└── tasks/
    └── catalog.py
```