lean-migrate / README.md
Hrushi's picture
Upload folder using huggingface_hub
bf9c466 verified
---
title: LeanMigrate OpenEnv Environment
emoji: 🧠
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# LeanMigrate
LeanMigrate is a code-migration environment where an AI agent migrates real business logic one function at a time. **Lean 4 is the verifier.** The environment parses each submission with Python `ast` (Python targets) or tree-sitter (TypeScript targets) or a Rust/cargo harness (Rust targets), builds an internal verification IR in the `Candidate` namespace, and checks it against the formal Lean specification using `native_decide`-powered sample theorems.
---
## Tasks
Eight tasks across four language pairs and three difficulty levels:
| Task | Migration | Difficulty | Functions | Max Steps | Proof? |
| --------------------- | ---------------------------- | ---------- | --------- | --------- | ------ |
| `rbac_auth` | Python β†’ TypeScript | Easy | 3 | 15 | No |
| `pricing_engine` | JavaScript β†’ Python | Medium | 5 | 25 | No |
| `path_canonicalizer` | C β†’ Rust | Medium | 4 | 25 | Yes |
| `expression_eval` | C β†’ Rust | Medium | 3 | 25 | Yes |
| `payment_saga` | JavaScript β†’ Python | Hard | 4 | 40 | Yes |
| `lru_cache` | C++ β†’ Rust | Hard | 4 | 40 | Yes |
| `shortest_path` | C++ β†’ Rust | Hard | 3 | 40 | Yes |
| `interval_scheduler` | C++ β†’ Rust | Hard | 3 | 40 | Yes |
Proof obligations are required for `path_canonicalizer`, `expression_eval`, `payment_saga`, `lru_cache`, `shortest_path`, and `interval_scheduler`. These tasks require a formal Lean proof in addition to working code.
---
## Environment
The environment exposes the standard OpenEnv-style server and client interfaces.
### Task Selection
Run one task per episode. Choose the task in one of two ways:
* Set `TASK_ID` before launching the environment or running inference.
* Pass `task_id` as a keyword argument when resetting the environment: `env.reset(task_id="pricing_engine")`.
Valid task IDs: `rbac_auth`, `pricing_engine`, `payment_saga`, `path_canonicalizer`, `expression_eval`, `lru_cache`, `shortest_path`, `interval_scheduler`.
### Actions
`LeanMigrateAction` is a typed Pydantic action model with four variants:
| Action | Required fields | Effect |
| -------------- | -------------------------------------- | ---------------------------------------------- |
| `inspect` | `function_name` | Returns source fragment + Lean spec for one fn |
| `analyze_deps` | `function_name` | Returns dependency graph and migration order |
| `run_tests` | `function_name`, `candidate_code` | Executes candidate against task sample cases |
| `submit` | `function_name`, `target_code` or `lean_proof` | Runs IR + Lean verification pipeline |
**Field roles:**
* `candidate_code` β€” working implementation to test locally before submission (`run_tests` only).
* `target_code` β€” final implementation submitted for Lean verification.
* `lean_proof` β€” required only for proof tasks (e.g. `path_canonicalizer`, `expression_eval`, `payment_saga`, `lru_cache`, `shortest_path`, `interval_scheduler`).
### Observation
`LeanMigrateObservation` contains:
* Episode metadata (`episode_id`, `task_id`, `episode_step`, `max_steps`)
* Task metadata (`source_language`, `target_language`, `source_files`)
* Verification progress (`verified`, `remaining`, `failing`, `progress`)
* Feedback from the previous step (`last_action_type`, `last_action_feedback`, `last_step_reward`)
* Rich grading breakdown (`reward_details`) with `tests_passed`, `tests_total`, `proof_compiled`, `lean_error`
### Rewards
Rewards are shaped to give signal across the full trajectory:
| Action | Reward |
| -------------- | ------------------------------------------------------------------ |
| `inspect` | +0.05 shaping reward |
| `analyze_deps` | +0.05 shaping reward |
| `run_tests` | +0.10 on all cases passing; βˆ’0.01 Γ— failures on partial/full fail |
| `submit` pass | +1.0 / total\_function\_count |
| `submit` fail | βˆ’0.05 on IR rejection or Lean rejection |
Episode score (`progress`) is `verified_count / total_function_count`, clamped to `[0.01, 0.99]` for validator compatibility. Step rewards in the stdout log are clamped to `[0.0, 1.0]`.
---
## Setup
```bash
uv sync
# Build Lean 4 artifacts (required before running tests)
cd lean && lake build
# Run all tests
uv run pytest -q
```
**RBAC TypeScript tests** require a Node runtime:
```bash
npm install -g tsx
uv run pytest tests/test_env_grader.py -q
```
**Rust task tests** require the Rust toolchain:
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
---
## Inference Script
Run the structured inference script against any task:
```bash
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=hf_... \
TASK_ID=rbac_auth \
uv run python inference.py
```
`inference.py` uses the OpenAI client (pointing at the HuggingFace router by default) and emits structured stdout in the required `[START]` / `[STEP]` / `[END]` format:
```
[START] task=rbac_auth env=lean_migrate model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action={"type":"inspect","function_name":"findRole"} reward=0.05 done=false error=null
[STEP] step=2 action={"type":"run_tests","function_name":"findRole"} reward=0.10 done=false error=null
[STEP] step=3 action={"type":"submit","function_name":"findRole"} reward=0.33 done=false error=null
...
[END] success=true steps=9 score=0.990 rewards=0.05,0.10,0.33,...
```
When `TASK_ID` is unset, inference runs all 8 tasks sequentially.
## LLM Playthrough
Run the richer LLM-driven task runner across the full suite when you want step-by-step traces and agent debugging output:
```bash
uv run python scripts/play_task.py play --all --trace-dir traces-v2
```
`scripts/play_task.py` is best-effort and uses the configured model, so it can still fail on hard tasks if the model guesses the wrong API shape or proof. Use `--quiet` when you only want the per-task summary.
---
## Baseline
Run the deterministic baseline (canonical Lean fragments). The script defaults to the original 3-task baseline, and you can pass `--tasks` to cover the full 8-task suite:
```bash
uv run python baseline/run_baseline.py --tasks \
rbac_auth pricing_engine payment_saga \
path_canonicalizer expression_eval lru_cache shortest_path interval_scheduler
```
| Task | Score |
| --------------------- | ------- |
| `rbac_auth` | `0.990` |
| `pricing_engine` | `0.990` |
| `payment_saga` | `0.990` |
| `path_canonicalizer` | `0.990` |
| `expression_eval` | `0.990` |
| `lru_cache` | `0.990` |
| `shortest_path` | `0.990` |
| `interval_scheduler` | `0.990` |
| **overall** | `0.990` |
Scores clamp at 0.990 by design (the open-unit clamp keeps scores below 1.0 for validator compatibility).
---
## Full Playthrough
Step-by-step replay for all tasks using bundled canonical solutions:
```bash
uv run python scripts/play_all_tasks.py
```
Export the Lean business-logic mirror for a function:
```bash
uv run python scripts/export_lean_business_logic.py \
--task-id pricing_engine \
--function-name finalPrice \
--source-file tasks/pricing/source.js
```
---
## Docker Build
```bash
uv run openenv build
docker run --rm -p 8000:8000 openenv-lean_migrate:latest
```
The Docker image:
* Builds Lean 4 artifacts with `lake build`
* Installs the Python package with `uv sync`
* Includes `nodejs` + `tsx` for TypeScript task tests
* Includes the Rust toolchain for Rust task tests
---
## OpenEnv Push
```bash
openenv push
```
OpenEnv uses `lean_migrate.server.app:app` as the FastAPI entry point and `lean_migrate.server.lean_migrate_environment:LeanMigrateEnvironment` as the environment implementation.
---
## Architecture
### Verification Pipeline
When an agent submits code, it goes through a two-stage pipeline:
1. **Verification IR** (`env/verification_ir.py`): Parses submitted code (Python AST / TypeScript tree-sitter / Rust cargo), runs sample cases against the oracle, computes a behavior digest, and generates a Lean mirror file in the `Candidate` namespace with `native_decide`-powered sample checks.
2. **Lean Backend** (`lean_backend/`): The generated Lean code is fed to a backend that compiles it against the formal spec.
- `StdinBackend` (default): Invokes `lean --stdin`, requires `cd lean && lake build` first.
- `KiminaBackend`: HTTP backend at `KIMINA_URL` (default `http://localhost:12332`). Select with `LEAN_BACKEND=kimina`.
- `AeneasBackend`: Experimental backend that uses Aeneas to extract Lean proofs from Rust code. Select with `LEAN_BACKEND=aeneas`. It is optional and not part of the default grading path.
### Key Data Flow for `submit`
```
SubmitAction.target_code
β†’ build_verification_ir() # env/verification_ir.py
β†’ _parse_provenance() # validates syntax, extracts arity
β†’ run_candidate_tests() # env/grader.py β€” runs via subprocess
β†’ _render_lean_mirror() # generates Lean code in Candidate namespace
β†’ backend.verify() # lean_backend/
β†’ Lean compiler output
β†’ EpisodeState updates verified/failing sets
```
### Project Layout
```
lean/ Lean 4 specifications (AuthSpec, PricingSpec, SagaSpec, PathSpec, ...)
env/ Task registry, state machine, grader, verification IR
lean_backend/ Lean verification backends (stdin, kimina, aeneas)
tasks/ Source bundles shown to the agent (rbac, pricing, saga, path, ...)
baseline/ Deterministic baseline runner
server/ FastAPI app and OpenEnv environment wrapper
tests/ Smoke tests and Lean backend benchmark
scripts/ Helper scripts (playthrough, export, validation)
```
### Environment Variables
| Variable | Purpose | Default |
| -------------------------------------- | --------------------------------- | ------------------------------ |
| `TASK_ID` | Which task to run | `rbac_auth` |
| `LEAN_BACKEND` | `stdin`, `kimina`, or `aeneas` | `stdin` |
| `KIMINA_URL` | Kimina server URL | `http://localhost:12332` |
| `LEAN_BIN` / `LEAN_CWD` / `LEAN_PATH` | Lean binary paths | elan defaults |
| `API_BASE_URL` | LLM API endpoint for inference.py | `https://router.huggingface.co/v1` |
| `MODEL_NAME` | LLM model for inference.py | `Qwen/Qwen2.5-72B-Instruct` |
| `HF_TOKEN` | Auth token for inference.py | β€” |