Spaces:
Sleeping
Sleeping
| title: LeanMigrate OpenEnv Environment | |
| emoji: π§ | |
| colorFrom: green | |
| colorTo: pink | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| # LeanMigrate | |
| LeanMigrate is a code-migration environment where an AI agent migrates real business logic one function at a time. **Lean 4 is the verifier.** The environment parses each submission with Python `ast` (Python targets) or tree-sitter (TypeScript targets) or a Rust/cargo harness (Rust targets), builds an internal verification IR in the `Candidate` namespace, and checks it against the formal Lean specification using `native_decide`-powered sample theorems. | |
| --- | |
| ## Tasks | |
| Eight tasks across four language pairs and three difficulty levels: | |
| | Task | Migration | Difficulty | Functions | Max Steps | Proof? | | |
| | --------------------- | ---------------------------- | ---------- | --------- | --------- | ------ | | |
| | `rbac_auth` | Python β TypeScript | Easy | 3 | 15 | No | | |
| | `pricing_engine` | JavaScript β Python | Medium | 5 | 25 | No | | |
| | `path_canonicalizer` | C β Rust | Medium | 4 | 25 | Yes | | |
| | `expression_eval` | C β Rust | Medium | 3 | 25 | Yes | | |
| | `payment_saga` | JavaScript β Python | Hard | 4 | 40 | Yes | | |
| | `lru_cache` | C++ β Rust | Hard | 4 | 40 | Yes | | |
| | `shortest_path` | C++ β Rust | Hard | 3 | 40 | Yes | | |
| | `interval_scheduler` | C++ β Rust | Hard | 3 | 40 | Yes | | |
| Proof obligations are required for `path_canonicalizer`, `expression_eval`, `payment_saga`, `lru_cache`, `shortest_path`, and `interval_scheduler`. These tasks require a formal Lean proof in addition to working code. | |
| --- | |
| ## Environment | |
| The environment exposes the standard OpenEnv-style server and client interfaces. | |
| ### Task Selection | |
| Run one task per episode. Choose the task in one of two ways: | |
| * Set `TASK_ID` before launching the environment or running inference. | |
| * Pass `task_id` as a keyword argument when resetting the environment: `env.reset(task_id="pricing_engine")`. | |
| Valid task IDs: `rbac_auth`, `pricing_engine`, `payment_saga`, `path_canonicalizer`, `expression_eval`, `lru_cache`, `shortest_path`, `interval_scheduler`. | |
| ### Actions | |
| `LeanMigrateAction` is a typed Pydantic action model with four variants: | |
| | Action | Required fields | Effect | | |
| | -------------- | -------------------------------------- | ---------------------------------------------- | | |
| | `inspect` | `function_name` | Returns source fragment + Lean spec for one fn | | |
| | `analyze_deps` | `function_name` | Returns dependency graph and migration order | | |
| | `run_tests` | `function_name`, `candidate_code` | Executes candidate against task sample cases | | |
| | `submit` | `function_name`, `target_code` or `lean_proof` | Runs IR + Lean verification pipeline | | |
| **Field roles:** | |
| * `candidate_code` β working implementation to test locally before submission (`run_tests` only). | |
| * `target_code` β final implementation submitted for Lean verification. | |
| * `lean_proof` β required only for proof tasks (e.g. `path_canonicalizer`, `expression_eval`, `payment_saga`, `lru_cache`, `shortest_path`, `interval_scheduler`). | |
| ### Observation | |
| `LeanMigrateObservation` contains: | |
| * Episode metadata (`episode_id`, `task_id`, `episode_step`, `max_steps`) | |
| * Task metadata (`source_language`, `target_language`, `source_files`) | |
| * Verification progress (`verified`, `remaining`, `failing`, `progress`) | |
| * Feedback from the previous step (`last_action_type`, `last_action_feedback`, `last_step_reward`) | |
| * Rich grading breakdown (`reward_details`) with `tests_passed`, `tests_total`, `proof_compiled`, `lean_error` | |
| ### Rewards | |
| Rewards are shaped to give signal across the full trajectory: | |
| | Action | Reward | | |
| | -------------- | ------------------------------------------------------------------ | | |
| | `inspect` | +0.05 shaping reward | | |
| | `analyze_deps` | +0.05 shaping reward | | |
| | `run_tests` | +0.10 on all cases passing; β0.01 Γ failures on partial/full fail | | |
| | `submit` pass | +1.0 / total\_function\_count | | |
| | `submit` fail | β0.05 on IR rejection or Lean rejection | | |
| Episode score (`progress`) is `verified_count / total_function_count`, clamped to `[0.01, 0.99]` for validator compatibility. Step rewards in the stdout log are clamped to `[0.0, 1.0]`. | |
| --- | |
| ## Setup | |
| ```bash | |
| uv sync | |
| # Build Lean 4 artifacts (required before running tests) | |
| cd lean && lake build | |
| # Run all tests | |
| uv run pytest -q | |
| ``` | |
| **RBAC TypeScript tests** require a Node runtime: | |
| ```bash | |
| npm install -g tsx | |
| uv run pytest tests/test_env_grader.py -q | |
| ``` | |
| **Rust task tests** require the Rust toolchain: | |
| ```bash | |
| curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh | |
| ``` | |
| --- | |
| ## Inference Script | |
| Run the structured inference script against any task: | |
| ```bash | |
| API_BASE_URL=https://router.huggingface.co/v1 \ | |
| MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \ | |
| HF_TOKEN=hf_... \ | |
| TASK_ID=rbac_auth \ | |
| uv run python inference.py | |
| ``` | |
| `inference.py` uses the OpenAI client (pointing at the HuggingFace router by default) and emits structured stdout in the required `[START]` / `[STEP]` / `[END]` format: | |
| ``` | |
| [START] task=rbac_auth env=lean_migrate model=Qwen/Qwen2.5-72B-Instruct | |
| [STEP] step=1 action={"type":"inspect","function_name":"findRole"} reward=0.05 done=false error=null | |
| [STEP] step=2 action={"type":"run_tests","function_name":"findRole"} reward=0.10 done=false error=null | |
| [STEP] step=3 action={"type":"submit","function_name":"findRole"} reward=0.33 done=false error=null | |
| ... | |
| [END] success=true steps=9 score=0.990 rewards=0.05,0.10,0.33,... | |
| ``` | |
| When `TASK_ID` is unset, inference runs all 8 tasks sequentially. | |
| ## LLM Playthrough | |
| Run the richer LLM-driven task runner across the full suite when you want step-by-step traces and agent debugging output: | |
| ```bash | |
| uv run python scripts/play_task.py play --all --trace-dir traces-v2 | |
| ``` | |
| `scripts/play_task.py` is best-effort and uses the configured model, so it can still fail on hard tasks if the model guesses the wrong API shape or proof. Use `--quiet` when you only want the per-task summary. | |
| --- | |
| ## Baseline | |
| Run the deterministic baseline (canonical Lean fragments). The script defaults to the original 3-task baseline, and you can pass `--tasks` to cover the full 8-task suite: | |
| ```bash | |
| uv run python baseline/run_baseline.py --tasks \ | |
| rbac_auth pricing_engine payment_saga \ | |
| path_canonicalizer expression_eval lru_cache shortest_path interval_scheduler | |
| ``` | |
| | Task | Score | | |
| | --------------------- | ------- | | |
| | `rbac_auth` | `0.990` | | |
| | `pricing_engine` | `0.990` | | |
| | `payment_saga` | `0.990` | | |
| | `path_canonicalizer` | `0.990` | | |
| | `expression_eval` | `0.990` | | |
| | `lru_cache` | `0.990` | | |
| | `shortest_path` | `0.990` | | |
| | `interval_scheduler` | `0.990` | | |
| | **overall** | `0.990` | | |
| Scores clamp at 0.990 by design (the open-unit clamp keeps scores below 1.0 for validator compatibility). | |
| --- | |
| ## Full Playthrough | |
| Step-by-step replay for all tasks using bundled canonical solutions: | |
| ```bash | |
| uv run python scripts/play_all_tasks.py | |
| ``` | |
| Export the Lean business-logic mirror for a function: | |
| ```bash | |
| uv run python scripts/export_lean_business_logic.py \ | |
| --task-id pricing_engine \ | |
| --function-name finalPrice \ | |
| --source-file tasks/pricing/source.js | |
| ``` | |
| --- | |
| ## Docker Build | |
| ```bash | |
| uv run openenv build | |
| docker run --rm -p 8000:8000 openenv-lean_migrate:latest | |
| ``` | |
| The Docker image: | |
| * Builds Lean 4 artifacts with `lake build` | |
| * Installs the Python package with `uv sync` | |
| * Includes `nodejs` + `tsx` for TypeScript task tests | |
| * Includes the Rust toolchain for Rust task tests | |
| --- | |
| ## OpenEnv Push | |
| ```bash | |
| openenv push | |
| ``` | |
| OpenEnv uses `lean_migrate.server.app:app` as the FastAPI entry point and `lean_migrate.server.lean_migrate_environment:LeanMigrateEnvironment` as the environment implementation. | |
| --- | |
| ## Architecture | |
| ### Verification Pipeline | |
| When an agent submits code, it goes through a two-stage pipeline: | |
| 1. **Verification IR** (`env/verification_ir.py`): Parses submitted code (Python AST / TypeScript tree-sitter / Rust cargo), runs sample cases against the oracle, computes a behavior digest, and generates a Lean mirror file in the `Candidate` namespace with `native_decide`-powered sample checks. | |
| 2. **Lean Backend** (`lean_backend/`): The generated Lean code is fed to a backend that compiles it against the formal spec. | |
| - `StdinBackend` (default): Invokes `lean --stdin`, requires `cd lean && lake build` first. | |
| - `KiminaBackend`: HTTP backend at `KIMINA_URL` (default `http://localhost:12332`). Select with `LEAN_BACKEND=kimina`. | |
| - `AeneasBackend`: Experimental backend that uses Aeneas to extract Lean proofs from Rust code. Select with `LEAN_BACKEND=aeneas`. It is optional and not part of the default grading path. | |
| ### Key Data Flow for `submit` | |
| ``` | |
| SubmitAction.target_code | |
| β build_verification_ir() # env/verification_ir.py | |
| β _parse_provenance() # validates syntax, extracts arity | |
| β run_candidate_tests() # env/grader.py β runs via subprocess | |
| β _render_lean_mirror() # generates Lean code in Candidate namespace | |
| β backend.verify() # lean_backend/ | |
| β Lean compiler output | |
| β EpisodeState updates verified/failing sets | |
| ``` | |
| ### Project Layout | |
| ``` | |
| lean/ Lean 4 specifications (AuthSpec, PricingSpec, SagaSpec, PathSpec, ...) | |
| env/ Task registry, state machine, grader, verification IR | |
| lean_backend/ Lean verification backends (stdin, kimina, aeneas) | |
| tasks/ Source bundles shown to the agent (rbac, pricing, saga, path, ...) | |
| baseline/ Deterministic baseline runner | |
| server/ FastAPI app and OpenEnv environment wrapper | |
| tests/ Smoke tests and Lean backend benchmark | |
| scripts/ Helper scripts (playthrough, export, validation) | |
| ``` | |
| ### Environment Variables | |
| | Variable | Purpose | Default | | |
| | -------------------------------------- | --------------------------------- | ------------------------------ | | |
| | `TASK_ID` | Which task to run | `rbac_auth` | | |
| | `LEAN_BACKEND` | `stdin`, `kimina`, or `aeneas` | `stdin` | | |
| | `KIMINA_URL` | Kimina server URL | `http://localhost:12332` | | |
| | `LEAN_BIN` / `LEAN_CWD` / `LEAN_PATH` | Lean binary paths | elan defaults | | |
| | `API_BASE_URL` | LLM API endpoint for inference.py | `https://router.huggingface.co/v1` | | |
| | `MODEL_NAME` | LLM model for inference.py | `Qwen/Qwen2.5-72B-Instruct` | | |
| | `HF_TOKEN` | Auth token for inference.py | β | | |