--- title: LeanMigrate OpenEnv Environment emoji: 🧠 colorFrom: green colorTo: pink sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv --- # LeanMigrate LeanMigrate is a code-migration environment where an AI agent migrates real business logic one function at a time. **Lean 4 is the verifier.** The environment parses each submission with Python `ast` (Python targets) or tree-sitter (TypeScript targets) or a Rust/cargo harness (Rust targets), builds an internal verification IR in the `Candidate` namespace, and checks it against the formal Lean specification using `native_decide`-powered sample theorems. --- ## Tasks Eight tasks across four language pairs and three difficulty levels: | Task | Migration | Difficulty | Functions | Max Steps | Proof? | | --------------------- | ---------------------------- | ---------- | --------- | --------- | ------ | | `rbac_auth` | Python → TypeScript | Easy | 3 | 15 | No | | `pricing_engine` | JavaScript → Python | Medium | 5 | 25 | No | | `path_canonicalizer` | C → Rust | Medium | 4 | 25 | Yes | | `expression_eval` | C → Rust | Medium | 3 | 25 | Yes | | `payment_saga` | JavaScript → Python | Hard | 4 | 40 | Yes | | `lru_cache` | C++ → Rust | Hard | 4 | 40 | Yes | | `shortest_path` | C++ → Rust | Hard | 3 | 40 | Yes | | `interval_scheduler` | C++ → Rust | Hard | 3 | 40 | Yes | Proof obligations are required for `path_canonicalizer`, `expression_eval`, `payment_saga`, `lru_cache`, `shortest_path`, and `interval_scheduler`. These tasks require a formal Lean proof in addition to working code. --- ## Environment The environment exposes the standard OpenEnv-style server and client interfaces. ### Task Selection Run one task per episode. Choose the task in one of two ways: * Set `TASK_ID` before launching the environment or running inference. * Pass `task_id` as a keyword argument when resetting the environment: `env.reset(task_id="pricing_engine")`. Valid task IDs: `rbac_auth`, `pricing_engine`, `payment_saga`, `path_canonicalizer`, `expression_eval`, `lru_cache`, `shortest_path`, `interval_scheduler`. ### Actions `LeanMigrateAction` is a typed Pydantic action model with four variants: | Action | Required fields | Effect | | -------------- | -------------------------------------- | ---------------------------------------------- | | `inspect` | `function_name` | Returns source fragment + Lean spec for one fn | | `analyze_deps` | `function_name` | Returns dependency graph and migration order | | `run_tests` | `function_name`, `candidate_code` | Executes candidate against task sample cases | | `submit` | `function_name`, `target_code` or `lean_proof` | Runs IR + Lean verification pipeline | **Field roles:** * `candidate_code` — working implementation to test locally before submission (`run_tests` only). * `target_code` — final implementation submitted for Lean verification. * `lean_proof` — required only for proof tasks (e.g. `path_canonicalizer`, `expression_eval`, `payment_saga`, `lru_cache`, `shortest_path`, `interval_scheduler`). ### Observation `LeanMigrateObservation` contains: * Episode metadata (`episode_id`, `task_id`, `episode_step`, `max_steps`) * Task metadata (`source_language`, `target_language`, `source_files`) * Verification progress (`verified`, `remaining`, `failing`, `progress`) * Feedback from the previous step (`last_action_type`, `last_action_feedback`, `last_step_reward`) * Rich grading breakdown (`reward_details`) with `tests_passed`, `tests_total`, `proof_compiled`, `lean_error` ### Rewards Rewards are shaped to give signal across the full trajectory: | Action | Reward | | -------------- | ------------------------------------------------------------------ | | `inspect` | +0.05 shaping reward | | `analyze_deps` | +0.05 shaping reward | | `run_tests` | +0.10 on all cases passing; −0.01 × failures on partial/full fail | | `submit` pass | +1.0 / total\_function\_count | | `submit` fail | −0.05 on IR rejection or Lean rejection | Episode score (`progress`) is `verified_count / total_function_count`, clamped to `[0.01, 0.99]` for validator compatibility. Step rewards in the stdout log are clamped to `[0.0, 1.0]`. --- ## Setup ```bash uv sync # Build Lean 4 artifacts (required before running tests) cd lean && lake build # Run all tests uv run pytest -q ``` **RBAC TypeScript tests** require a Node runtime: ```bash npm install -g tsx uv run pytest tests/test_env_grader.py -q ``` **Rust task tests** require the Rust toolchain: ```bash curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh ``` --- ## Inference Script Run the structured inference script against any task: ```bash API_BASE_URL=https://router.huggingface.co/v1 \ MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \ HF_TOKEN=hf_... \ TASK_ID=rbac_auth \ uv run python inference.py ``` `inference.py` uses the OpenAI client (pointing at the HuggingFace router by default) and emits structured stdout in the required `[START]` / `[STEP]` / `[END]` format: ``` [START] task=rbac_auth env=lean_migrate model=Qwen/Qwen2.5-72B-Instruct [STEP] step=1 action={"type":"inspect","function_name":"findRole"} reward=0.05 done=false error=null [STEP] step=2 action={"type":"run_tests","function_name":"findRole"} reward=0.10 done=false error=null [STEP] step=3 action={"type":"submit","function_name":"findRole"} reward=0.33 done=false error=null ... [END] success=true steps=9 score=0.990 rewards=0.05,0.10,0.33,... ``` When `TASK_ID` is unset, inference runs all 8 tasks sequentially. ## LLM Playthrough Run the richer LLM-driven task runner across the full suite when you want step-by-step traces and agent debugging output: ```bash uv run python scripts/play_task.py play --all --trace-dir traces-v2 ``` `scripts/play_task.py` is best-effort and uses the configured model, so it can still fail on hard tasks if the model guesses the wrong API shape or proof. Use `--quiet` when you only want the per-task summary. --- ## Baseline Run the deterministic baseline (canonical Lean fragments). The script defaults to the original 3-task baseline, and you can pass `--tasks` to cover the full 8-task suite: ```bash uv run python baseline/run_baseline.py --tasks \ rbac_auth pricing_engine payment_saga \ path_canonicalizer expression_eval lru_cache shortest_path interval_scheduler ``` | Task | Score | | --------------------- | ------- | | `rbac_auth` | `0.990` | | `pricing_engine` | `0.990` | | `payment_saga` | `0.990` | | `path_canonicalizer` | `0.990` | | `expression_eval` | `0.990` | | `lru_cache` | `0.990` | | `shortest_path` | `0.990` | | `interval_scheduler` | `0.990` | | **overall** | `0.990` | Scores clamp at 0.990 by design (the open-unit clamp keeps scores below 1.0 for validator compatibility). --- ## Full Playthrough Step-by-step replay for all tasks using bundled canonical solutions: ```bash uv run python scripts/play_all_tasks.py ``` Export the Lean business-logic mirror for a function: ```bash uv run python scripts/export_lean_business_logic.py \ --task-id pricing_engine \ --function-name finalPrice \ --source-file tasks/pricing/source.js ``` --- ## Docker Build ```bash uv run openenv build docker run --rm -p 8000:8000 openenv-lean_migrate:latest ``` The Docker image: * Builds Lean 4 artifacts with `lake build` * Installs the Python package with `uv sync` * Includes `nodejs` + `tsx` for TypeScript task tests * Includes the Rust toolchain for Rust task tests --- ## OpenEnv Push ```bash openenv push ``` OpenEnv uses `lean_migrate.server.app:app` as the FastAPI entry point and `lean_migrate.server.lean_migrate_environment:LeanMigrateEnvironment` as the environment implementation. --- ## Architecture ### Verification Pipeline When an agent submits code, it goes through a two-stage pipeline: 1. **Verification IR** (`env/verification_ir.py`): Parses submitted code (Python AST / TypeScript tree-sitter / Rust cargo), runs sample cases against the oracle, computes a behavior digest, and generates a Lean mirror file in the `Candidate` namespace with `native_decide`-powered sample checks. 2. **Lean Backend** (`lean_backend/`): The generated Lean code is fed to a backend that compiles it against the formal spec. - `StdinBackend` (default): Invokes `lean --stdin`, requires `cd lean && lake build` first. - `KiminaBackend`: HTTP backend at `KIMINA_URL` (default `http://localhost:12332`). Select with `LEAN_BACKEND=kimina`. - `AeneasBackend`: Experimental backend that uses Aeneas to extract Lean proofs from Rust code. Select with `LEAN_BACKEND=aeneas`. It is optional and not part of the default grading path. ### Key Data Flow for `submit` ``` SubmitAction.target_code → build_verification_ir() # env/verification_ir.py → _parse_provenance() # validates syntax, extracts arity → run_candidate_tests() # env/grader.py — runs via subprocess → _render_lean_mirror() # generates Lean code in Candidate namespace → backend.verify() # lean_backend/ → Lean compiler output → EpisodeState updates verified/failing sets ``` ### Project Layout ``` lean/ Lean 4 specifications (AuthSpec, PricingSpec, SagaSpec, PathSpec, ...) env/ Task registry, state machine, grader, verification IR lean_backend/ Lean verification backends (stdin, kimina, aeneas) tasks/ Source bundles shown to the agent (rbac, pricing, saga, path, ...) baseline/ Deterministic baseline runner server/ FastAPI app and OpenEnv environment wrapper tests/ Smoke tests and Lean backend benchmark scripts/ Helper scripts (playthrough, export, validation) ``` ### Environment Variables | Variable | Purpose | Default | | -------------------------------------- | --------------------------------- | ------------------------------ | | `TASK_ID` | Which task to run | `rbac_auth` | | `LEAN_BACKEND` | `stdin`, `kimina`, or `aeneas` | `stdin` | | `KIMINA_URL` | Kimina server URL | `http://localhost:12332` | | `LEAN_BIN` / `LEAN_CWD` / `LEAN_PATH` | Lean binary paths | elan defaults | | `API_BASE_URL` | LLM API endpoint for inference.py | `https://router.huggingface.co/v1` | | `MODEL_NAME` | LLM model for inference.py | `Qwen/Qwen2.5-72B-Instruct` | | `HF_TOKEN` | Auth token for inference.py | — |