---
title: LeanMigrate OpenEnv Environment
emoji: 🧠
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# LeanMigrate

LeanMigrate is a code-migration environment where an AI agent migrates real business logic one function at a time. **Lean 4 is the verifier.** The environment parses each submission with Python `ast` (Python targets) or tree-sitter (TypeScript targets) or a Rust/cargo harness (Rust targets), builds an internal verification IR in the `Candidate` namespace, and checks it against the formal Lean specification using `native_decide`-powered sample theorems.

---

## Tasks

Eight tasks across four language pairs and three difficulty levels:

| Task                  | Migration                    | Difficulty | Functions | Max Steps | Proof? |
| --------------------- | ---------------------------- | ---------- | --------- | --------- | ------ |
| `rbac_auth`           | Python → TypeScript          | Easy       | 3         | 15        | No     |
| `pricing_engine`      | JavaScript → Python          | Medium     | 5         | 25        | No     |
| `path_canonicalizer`  | C → Rust                     | Medium     | 4         | 25        | Yes    |
| `expression_eval`     | C → Rust                     | Medium     | 3         | 25        | Yes    |
| `payment_saga`        | JavaScript → Python          | Hard       | 4         | 40        | Yes    |
| `lru_cache`           | C++ → Rust                   | Hard       | 4         | 40        | Yes    |
| `shortest_path`       | C++ → Rust                   | Hard       | 3         | 40        | Yes    |
| `interval_scheduler`  | C++ → Rust                   | Hard       | 3         | 40        | Yes    |

Proof obligations are required for `path_canonicalizer`, `expression_eval`, `payment_saga`, `lru_cache`, `shortest_path`, and `interval_scheduler`. These tasks require a formal Lean proof in addition to working code.

---

## Environment

The environment exposes the standard OpenEnv-style server and client interfaces.

### Task Selection

Run one task per episode. Choose the task in one of two ways:

* Set `TASK_ID` before launching the environment or running inference.
* Pass `task_id` as a keyword argument when resetting the environment: `env.reset(task_id="pricing_engine")`.

Valid task IDs: `rbac_auth`, `pricing_engine`, `payment_saga`, `path_canonicalizer`, `expression_eval`, `lru_cache`, `shortest_path`, `interval_scheduler`.

### Actions

`LeanMigrateAction` is a typed Pydantic action model with four variants:

| Action         | Required fields                        | Effect                                         |
| -------------- | -------------------------------------- | ---------------------------------------------- |
| `inspect`      | `function_name`                        | Returns source fragment + Lean spec for one fn |
| `analyze_deps` | `function_name`                        | Returns dependency graph and migration order   |
| `run_tests`    | `function_name`, `candidate_code`      | Executes candidate against task sample cases   |
| `submit`       | `function_name`, `target_code` or `lean_proof` | Runs IR + Lean verification pipeline   |

**Field roles:**

* `candidate_code` — working implementation to test locally before submission (`run_tests` only).
* `target_code` — final implementation submitted for Lean verification.
* `lean_proof` — required only for proof tasks (e.g. `path_canonicalizer`, `expression_eval`, `payment_saga`, `lru_cache`, `shortest_path`, `interval_scheduler`).

### Observation

`LeanMigrateObservation` contains:

* Episode metadata (`episode_id`, `task_id`, `episode_step`, `max_steps`)
* Task metadata (`source_language`, `target_language`, `source_files`)
* Verification progress (`verified`, `remaining`, `failing`, `progress`)
* Feedback from the previous step (`last_action_type`, `last_action_feedback`, `last_step_reward`)
* Rich grading breakdown (`reward_details`) with `tests_passed`, `tests_total`, `proof_compiled`, `lean_error`

### Rewards

Rewards are shaped to give signal across the full trajectory:

| Action         | Reward                                                             |
| -------------- | ------------------------------------------------------------------ |
| `inspect`      | +0.05 shaping reward                                               |
| `analyze_deps` | +0.05 shaping reward                                               |
| `run_tests`    | +0.10 on all cases passing; −0.01 × failures on partial/full fail  |
| `submit` pass  | +1.0 / total\_function\_count                                      |
| `submit` fail  | −0.05 on IR rejection or Lean rejection                            |

Episode score (`progress`) is `verified_count / total_function_count`, clamped to `[0.01, 0.99]` for validator compatibility. Step rewards in the stdout log are clamped to `[0.0, 1.0]`.

---

## Setup

```bash
uv sync

# Build Lean 4 artifacts (required before running tests)
cd lean && lake build

# Run all tests
uv run pytest -q
```

**RBAC TypeScript tests** require a Node runtime:

```bash
npm install -g tsx
uv run pytest tests/test_env_grader.py -q
```

**Rust task tests** require the Rust toolchain:

```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

---

## Inference Script

Run the structured inference script against any task:

```bash
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=hf_... \
TASK_ID=rbac_auth \
uv run python inference.py
```

`inference.py` uses the OpenAI client (pointing at the HuggingFace router by default) and emits structured stdout in the required `[START]` / `[STEP]` / `[END]` format:

```
[START] task=rbac_auth env=lean_migrate model=Qwen/Qwen2.5-72B-Instruct
[STEP]  step=1 action={"type":"inspect","function_name":"findRole"} reward=0.05 done=false error=null
[STEP]  step=2 action={"type":"run_tests","function_name":"findRole"} reward=0.10 done=false error=null
[STEP]  step=3 action={"type":"submit","function_name":"findRole"} reward=0.33 done=false error=null
...
[END]   success=true steps=9 score=0.990 rewards=0.05,0.10,0.33,...
```

When `TASK_ID` is unset, inference runs all 8 tasks sequentially.

## LLM Playthrough

Run the richer LLM-driven task runner across the full suite when you want step-by-step traces and agent debugging output:

```bash
uv run python scripts/play_task.py play --all --trace-dir traces-v2
```

`scripts/play_task.py` is best-effort and uses the configured model, so it can still fail on hard tasks if the model guesses the wrong API shape or proof. Use `--quiet` when you only want the per-task summary.

---

## Baseline

Run the deterministic baseline (canonical Lean fragments). The script defaults to the original 3-task baseline, and you can pass `--tasks` to cover the full 8-task suite:

```bash
uv run python baseline/run_baseline.py --tasks \
  rbac_auth pricing_engine payment_saga \
  path_canonicalizer expression_eval lru_cache shortest_path interval_scheduler
```

| Task                  | Score   |
| --------------------- | ------- |
| `rbac_auth`           | `0.990` |
| `pricing_engine`      | `0.990` |
| `payment_saga`        | `0.990` |
| `path_canonicalizer`  | `0.990` |
| `expression_eval`     | `0.990` |
| `lru_cache`           | `0.990` |
| `shortest_path`       | `0.990` |
| `interval_scheduler`  | `0.990` |
| **overall**           | `0.990` |

Scores clamp at 0.990 by design (the open-unit clamp keeps scores below 1.0 for validator compatibility).

---

## Full Playthrough

Step-by-step replay for all tasks using bundled canonical solutions:

```bash
uv run python scripts/play_all_tasks.py
```

Export the Lean business-logic mirror for a function:

```bash
uv run python scripts/export_lean_business_logic.py \
  --task-id pricing_engine \
  --function-name finalPrice \
  --source-file tasks/pricing/source.js
```

---

## Docker Build

```bash
uv run openenv build
docker run --rm -p 8000:8000 openenv-lean_migrate:latest
```

The Docker image:
* Builds Lean 4 artifacts with `lake build`
* Installs the Python package with `uv sync`
* Includes `nodejs` + `tsx` for TypeScript task tests
* Includes the Rust toolchain for Rust task tests

---

## OpenEnv Push

```bash
openenv push
```

OpenEnv uses `lean_migrate.server.app:app` as the FastAPI entry point and `lean_migrate.server.lean_migrate_environment:LeanMigrateEnvironment` as the environment implementation.

---

## Architecture

### Verification Pipeline

When an agent submits code, it goes through a two-stage pipeline:

1. **Verification IR** (`env/verification_ir.py`): Parses submitted code (Python AST / TypeScript tree-sitter / Rust cargo), runs sample cases against the oracle, computes a behavior digest, and generates a Lean mirror file in the `Candidate` namespace with `native_decide`-powered sample checks.

2. **Lean Backend** (`lean_backend/`): The generated Lean code is fed to a backend that compiles it against the formal spec.
   - `StdinBackend` (default): Invokes `lean --stdin`, requires `cd lean && lake build` first.
   - `KiminaBackend`: HTTP backend at `KIMINA_URL` (default `http://localhost:12332`). Select with `LEAN_BACKEND=kimina`.
    - `AeneasBackend`: Experimental backend that uses Aeneas to extract Lean proofs from Rust code. Select with `LEAN_BACKEND=aeneas`. It is optional and not part of the default grading path.

### Key Data Flow for `submit`

```
SubmitAction.target_code
  → build_verification_ir()           # env/verification_ir.py
      → _parse_provenance()           # validates syntax, extracts arity
      → run_candidate_tests()         # env/grader.py — runs via subprocess
      → _render_lean_mirror()         # generates Lean code in Candidate namespace
  → backend.verify()                  # lean_backend/
      → Lean compiler output
  → EpisodeState updates verified/failing sets
```

### Project Layout

```
lean/           Lean 4 specifications (AuthSpec, PricingSpec, SagaSpec, PathSpec, ...)
env/            Task registry, state machine, grader, verification IR
lean_backend/   Lean verification backends (stdin, kimina, aeneas)
tasks/          Source bundles shown to the agent (rbac, pricing, saga, path, ...)
baseline/       Deterministic baseline runner
server/         FastAPI app and OpenEnv environment wrapper
tests/          Smoke tests and Lean backend benchmark
scripts/        Helper scripts (playthrough, export, validation)
```

### Environment Variables

| Variable                               | Purpose                           | Default                        |
| -------------------------------------- | --------------------------------- | ------------------------------ |
| `TASK_ID`                              | Which task to run                 | `rbac_auth`                    |
| `LEAN_BACKEND`                         | `stdin`, `kimina`, or `aeneas`    | `stdin`                        |
| `KIMINA_URL`                           | Kimina server URL                 | `http://localhost:12332`       |
| `LEAN_BIN` / `LEAN_CWD` / `LEAN_PATH` | Lean binary paths                 | elan defaults                  |
| `API_BASE_URL`                         | LLM API endpoint for inference.py | `https://router.huggingface.co/v1` |
| `MODEL_NAME`                           | LLM model for inference.py        | `Qwen/Qwen2.5-72B-Instruct`   |
| `HF_TOKEN`                             | Auth token for inference.py       | —                              |