Spaces:
Sleeping
title: LeanMigrate OpenEnv Environment
emoji: π§
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
LeanMigrate
LeanMigrate is a code-migration environment where an AI agent migrates real business logic one function at a time. Lean 4 is the verifier. The environment parses each submission with Python ast (Python targets) or tree-sitter (TypeScript targets) or a Rust/cargo harness (Rust targets), builds an internal verification IR in the Candidate namespace, and checks it against the formal Lean specification using native_decide-powered sample theorems.
Tasks
Eight tasks across four language pairs and three difficulty levels:
| Task | Migration | Difficulty | Functions | Max Steps | Proof? |
|---|---|---|---|---|---|
rbac_auth |
Python β TypeScript | Easy | 3 | 15 | No |
pricing_engine |
JavaScript β Python | Medium | 5 | 25 | No |
path_canonicalizer |
C β Rust | Medium | 4 | 25 | Yes |
expression_eval |
C β Rust | Medium | 3 | 25 | Yes |
payment_saga |
JavaScript β Python | Hard | 4 | 40 | Yes |
lru_cache |
C++ β Rust | Hard | 4 | 40 | Yes |
shortest_path |
C++ β Rust | Hard | 3 | 40 | Yes |
interval_scheduler |
C++ β Rust | Hard | 3 | 40 | Yes |
Proof obligations are required for path_canonicalizer, expression_eval, payment_saga, lru_cache, shortest_path, and interval_scheduler. These tasks require a formal Lean proof in addition to working code.
Environment
The environment exposes the standard OpenEnv-style server and client interfaces.
Task Selection
Run one task per episode. Choose the task in one of two ways:
- Set
TASK_IDbefore launching the environment or running inference. - Pass
task_idas a keyword argument when resetting the environment:env.reset(task_id="pricing_engine").
Valid task IDs: rbac_auth, pricing_engine, payment_saga, path_canonicalizer, expression_eval, lru_cache, shortest_path, interval_scheduler.
Actions
LeanMigrateAction is a typed Pydantic action model with four variants:
| Action | Required fields | Effect |
|---|---|---|
inspect |
function_name |
Returns source fragment + Lean spec for one fn |
analyze_deps |
function_name |
Returns dependency graph and migration order |
run_tests |
function_name, candidate_code |
Executes candidate against task sample cases |
submit |
function_name, target_code or lean_proof |
Runs IR + Lean verification pipeline |
Field roles:
candidate_codeβ working implementation to test locally before submission (run_testsonly).target_codeβ final implementation submitted for Lean verification.lean_proofβ required only for proof tasks (e.g.path_canonicalizer,expression_eval,payment_saga,lru_cache,shortest_path,interval_scheduler).
Observation
LeanMigrateObservation contains:
- Episode metadata (
episode_id,task_id,episode_step,max_steps) - Task metadata (
source_language,target_language,source_files) - Verification progress (
verified,remaining,failing,progress) - Feedback from the previous step (
last_action_type,last_action_feedback,last_step_reward) - Rich grading breakdown (
reward_details) withtests_passed,tests_total,proof_compiled,lean_error
Rewards
Rewards are shaped to give signal across the full trajectory:
| Action | Reward |
|---|---|
inspect |
+0.05 shaping reward |
analyze_deps |
+0.05 shaping reward |
run_tests |
+0.10 on all cases passing; β0.01 Γ failures on partial/full fail |
submit pass |
+1.0 / total_function_count |
submit fail |
β0.05 on IR rejection or Lean rejection |
Episode score (progress) is verified_count / total_function_count, clamped to [0.01, 0.99] for validator compatibility. Step rewards in the stdout log are clamped to [0.0, 1.0].
Setup
uv sync
# Build Lean 4 artifacts (required before running tests)
cd lean && lake build
# Run all tests
uv run pytest -q
RBAC TypeScript tests require a Node runtime:
npm install -g tsx
uv run pytest tests/test_env_grader.py -q
Rust task tests require the Rust toolchain:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Inference Script
Run the structured inference script against any task:
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=hf_... \
TASK_ID=rbac_auth \
uv run python inference.py
inference.py uses the OpenAI client (pointing at the HuggingFace router by default) and emits structured stdout in the required [START] / [STEP] / [END] format:
[START] task=rbac_auth env=lean_migrate model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action={"type":"inspect","function_name":"findRole"} reward=0.05 done=false error=null
[STEP] step=2 action={"type":"run_tests","function_name":"findRole"} reward=0.10 done=false error=null
[STEP] step=3 action={"type":"submit","function_name":"findRole"} reward=0.33 done=false error=null
...
[END] success=true steps=9 score=0.990 rewards=0.05,0.10,0.33,...
When TASK_ID is unset, inference runs all 8 tasks sequentially.
LLM Playthrough
Run the richer LLM-driven task runner across the full suite when you want step-by-step traces and agent debugging output:
uv run python scripts/play_task.py play --all --trace-dir traces-v2
scripts/play_task.py is best-effort and uses the configured model, so it can still fail on hard tasks if the model guesses the wrong API shape or proof. Use --quiet when you only want the per-task summary.
Baseline
Run the deterministic baseline (canonical Lean fragments). The script defaults to the original 3-task baseline, and you can pass --tasks to cover the full 8-task suite:
uv run python baseline/run_baseline.py --tasks \
rbac_auth pricing_engine payment_saga \
path_canonicalizer expression_eval lru_cache shortest_path interval_scheduler
| Task | Score |
|---|---|
rbac_auth |
0.990 |
pricing_engine |
0.990 |
payment_saga |
0.990 |
path_canonicalizer |
0.990 |
expression_eval |
0.990 |
lru_cache |
0.990 |
shortest_path |
0.990 |
interval_scheduler |
0.990 |
| overall | 0.990 |
Scores clamp at 0.990 by design (the open-unit clamp keeps scores below 1.0 for validator compatibility).
Full Playthrough
Step-by-step replay for all tasks using bundled canonical solutions:
uv run python scripts/play_all_tasks.py
Export the Lean business-logic mirror for a function:
uv run python scripts/export_lean_business_logic.py \
--task-id pricing_engine \
--function-name finalPrice \
--source-file tasks/pricing/source.js
Docker Build
uv run openenv build
docker run --rm -p 8000:8000 openenv-lean_migrate:latest
The Docker image:
- Builds Lean 4 artifacts with
lake build - Installs the Python package with
uv sync - Includes
nodejs+tsxfor TypeScript task tests - Includes the Rust toolchain for Rust task tests
OpenEnv Push
openenv push
OpenEnv uses lean_migrate.server.app:app as the FastAPI entry point and lean_migrate.server.lean_migrate_environment:LeanMigrateEnvironment as the environment implementation.
Architecture
Verification Pipeline
When an agent submits code, it goes through a two-stage pipeline:
Verification IR (
env/verification_ir.py): Parses submitted code (Python AST / TypeScript tree-sitter / Rust cargo), runs sample cases against the oracle, computes a behavior digest, and generates a Lean mirror file in theCandidatenamespace withnative_decide-powered sample checks.Lean Backend (
lean_backend/): The generated Lean code is fed to a backend that compiles it against the formal spec.StdinBackend(default): Invokeslean --stdin, requirescd lean && lake buildfirst.KiminaBackend: HTTP backend atKIMINA_URL(defaulthttp://localhost:12332). Select withLEAN_BACKEND=kimina.AeneasBackend: Experimental backend that uses Aeneas to extract Lean proofs from Rust code. Select withLEAN_BACKEND=aeneas. It is optional and not part of the default grading path.
Key Data Flow for submit
SubmitAction.target_code
β build_verification_ir() # env/verification_ir.py
β _parse_provenance() # validates syntax, extracts arity
β run_candidate_tests() # env/grader.py β runs via subprocess
β _render_lean_mirror() # generates Lean code in Candidate namespace
β backend.verify() # lean_backend/
β Lean compiler output
β EpisodeState updates verified/failing sets
Project Layout
lean/ Lean 4 specifications (AuthSpec, PricingSpec, SagaSpec, PathSpec, ...)
env/ Task registry, state machine, grader, verification IR
lean_backend/ Lean verification backends (stdin, kimina, aeneas)
tasks/ Source bundles shown to the agent (rbac, pricing, saga, path, ...)
baseline/ Deterministic baseline runner
server/ FastAPI app and OpenEnv environment wrapper
tests/ Smoke tests and Lean backend benchmark
scripts/ Helper scripts (playthrough, export, validation)
Environment Variables
| Variable | Purpose | Default |
|---|---|---|
TASK_ID |
Which task to run | rbac_auth |
LEAN_BACKEND |
stdin, kimina, or aeneas |
stdin |
KIMINA_URL |
Kimina server URL | http://localhost:12332 |
LEAN_BIN / LEAN_CWD / LEAN_PATH |
Lean binary paths | elan defaults |
API_BASE_URL |
LLM API endpoint for inference.py | https://router.huggingface.co/v1 |
MODEL_NAME |
LLM model for inference.py | Qwen/Qwen2.5-72B-Instruct |
HF_TOKEN |
Auth token for inference.py | β |