lean-migrate / README.md
Hrushi's picture
Upload folder using huggingface_hub
bf9c466 verified
metadata
title: LeanMigrate OpenEnv Environment
emoji: 🧠
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

LeanMigrate

LeanMigrate is a code-migration environment where an AI agent migrates real business logic one function at a time. Lean 4 is the verifier. The environment parses each submission with Python ast (Python targets) or tree-sitter (TypeScript targets) or a Rust/cargo harness (Rust targets), builds an internal verification IR in the Candidate namespace, and checks it against the formal Lean specification using native_decide-powered sample theorems.


Tasks

Eight tasks across four language pairs and three difficulty levels:

Task Migration Difficulty Functions Max Steps Proof?
rbac_auth Python β†’ TypeScript Easy 3 15 No
pricing_engine JavaScript β†’ Python Medium 5 25 No
path_canonicalizer C β†’ Rust Medium 4 25 Yes
expression_eval C β†’ Rust Medium 3 25 Yes
payment_saga JavaScript β†’ Python Hard 4 40 Yes
lru_cache C++ β†’ Rust Hard 4 40 Yes
shortest_path C++ β†’ Rust Hard 3 40 Yes
interval_scheduler C++ β†’ Rust Hard 3 40 Yes

Proof obligations are required for path_canonicalizer, expression_eval, payment_saga, lru_cache, shortest_path, and interval_scheduler. These tasks require a formal Lean proof in addition to working code.


Environment

The environment exposes the standard OpenEnv-style server and client interfaces.

Task Selection

Run one task per episode. Choose the task in one of two ways:

  • Set TASK_ID before launching the environment or running inference.
  • Pass task_id as a keyword argument when resetting the environment: env.reset(task_id="pricing_engine").

Valid task IDs: rbac_auth, pricing_engine, payment_saga, path_canonicalizer, expression_eval, lru_cache, shortest_path, interval_scheduler.

Actions

LeanMigrateAction is a typed Pydantic action model with four variants:

Action Required fields Effect
inspect function_name Returns source fragment + Lean spec for one fn
analyze_deps function_name Returns dependency graph and migration order
run_tests function_name, candidate_code Executes candidate against task sample cases
submit function_name, target_code or lean_proof Runs IR + Lean verification pipeline

Field roles:

  • candidate_code β€” working implementation to test locally before submission (run_tests only).
  • target_code β€” final implementation submitted for Lean verification.
  • lean_proof β€” required only for proof tasks (e.g. path_canonicalizer, expression_eval, payment_saga, lru_cache, shortest_path, interval_scheduler).

Observation

LeanMigrateObservation contains:

  • Episode metadata (episode_id, task_id, episode_step, max_steps)
  • Task metadata (source_language, target_language, source_files)
  • Verification progress (verified, remaining, failing, progress)
  • Feedback from the previous step (last_action_type, last_action_feedback, last_step_reward)
  • Rich grading breakdown (reward_details) with tests_passed, tests_total, proof_compiled, lean_error

Rewards

Rewards are shaped to give signal across the full trajectory:

Action Reward
inspect +0.05 shaping reward
analyze_deps +0.05 shaping reward
run_tests +0.10 on all cases passing; βˆ’0.01 Γ— failures on partial/full fail
submit pass +1.0 / total_function_count
submit fail βˆ’0.05 on IR rejection or Lean rejection

Episode score (progress) is verified_count / total_function_count, clamped to [0.01, 0.99] for validator compatibility. Step rewards in the stdout log are clamped to [0.0, 1.0].


Setup

uv sync

# Build Lean 4 artifacts (required before running tests)
cd lean && lake build

# Run all tests
uv run pytest -q

RBAC TypeScript tests require a Node runtime:

npm install -g tsx
uv run pytest tests/test_env_grader.py -q

Rust task tests require the Rust toolchain:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Inference Script

Run the structured inference script against any task:

API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=hf_... \
TASK_ID=rbac_auth \
uv run python inference.py

inference.py uses the OpenAI client (pointing at the HuggingFace router by default) and emits structured stdout in the required [START] / [STEP] / [END] format:

[START] task=rbac_auth env=lean_migrate model=Qwen/Qwen2.5-72B-Instruct
[STEP]  step=1 action={"type":"inspect","function_name":"findRole"} reward=0.05 done=false error=null
[STEP]  step=2 action={"type":"run_tests","function_name":"findRole"} reward=0.10 done=false error=null
[STEP]  step=3 action={"type":"submit","function_name":"findRole"} reward=0.33 done=false error=null
...
[END]   success=true steps=9 score=0.990 rewards=0.05,0.10,0.33,...

When TASK_ID is unset, inference runs all 8 tasks sequentially.

LLM Playthrough

Run the richer LLM-driven task runner across the full suite when you want step-by-step traces and agent debugging output:

uv run python scripts/play_task.py play --all --trace-dir traces-v2

scripts/play_task.py is best-effort and uses the configured model, so it can still fail on hard tasks if the model guesses the wrong API shape or proof. Use --quiet when you only want the per-task summary.


Baseline

Run the deterministic baseline (canonical Lean fragments). The script defaults to the original 3-task baseline, and you can pass --tasks to cover the full 8-task suite:

uv run python baseline/run_baseline.py --tasks \
  rbac_auth pricing_engine payment_saga \
  path_canonicalizer expression_eval lru_cache shortest_path interval_scheduler
Task Score
rbac_auth 0.990
pricing_engine 0.990
payment_saga 0.990
path_canonicalizer 0.990
expression_eval 0.990
lru_cache 0.990
shortest_path 0.990
interval_scheduler 0.990
overall 0.990

Scores clamp at 0.990 by design (the open-unit clamp keeps scores below 1.0 for validator compatibility).


Full Playthrough

Step-by-step replay for all tasks using bundled canonical solutions:

uv run python scripts/play_all_tasks.py

Export the Lean business-logic mirror for a function:

uv run python scripts/export_lean_business_logic.py \
  --task-id pricing_engine \
  --function-name finalPrice \
  --source-file tasks/pricing/source.js

Docker Build

uv run openenv build
docker run --rm -p 8000:8000 openenv-lean_migrate:latest

The Docker image:

  • Builds Lean 4 artifacts with lake build
  • Installs the Python package with uv sync
  • Includes nodejs + tsx for TypeScript task tests
  • Includes the Rust toolchain for Rust task tests

OpenEnv Push

openenv push

OpenEnv uses lean_migrate.server.app:app as the FastAPI entry point and lean_migrate.server.lean_migrate_environment:LeanMigrateEnvironment as the environment implementation.


Architecture

Verification Pipeline

When an agent submits code, it goes through a two-stage pipeline:

  1. Verification IR (env/verification_ir.py): Parses submitted code (Python AST / TypeScript tree-sitter / Rust cargo), runs sample cases against the oracle, computes a behavior digest, and generates a Lean mirror file in the Candidate namespace with native_decide-powered sample checks.

  2. Lean Backend (lean_backend/): The generated Lean code is fed to a backend that compiles it against the formal spec.

    • StdinBackend (default): Invokes lean --stdin, requires cd lean && lake build first.
    • KiminaBackend: HTTP backend at KIMINA_URL (default http://localhost:12332). Select with LEAN_BACKEND=kimina.
    • AeneasBackend: Experimental backend that uses Aeneas to extract Lean proofs from Rust code. Select with LEAN_BACKEND=aeneas. It is optional and not part of the default grading path.

Key Data Flow for submit

SubmitAction.target_code
  β†’ build_verification_ir()           # env/verification_ir.py
      β†’ _parse_provenance()           # validates syntax, extracts arity
      β†’ run_candidate_tests()         # env/grader.py β€” runs via subprocess
      β†’ _render_lean_mirror()         # generates Lean code in Candidate namespace
  β†’ backend.verify()                  # lean_backend/
      β†’ Lean compiler output
  β†’ EpisodeState updates verified/failing sets

Project Layout

lean/           Lean 4 specifications (AuthSpec, PricingSpec, SagaSpec, PathSpec, ...)
env/            Task registry, state machine, grader, verification IR
lean_backend/   Lean verification backends (stdin, kimina, aeneas)
tasks/          Source bundles shown to the agent (rbac, pricing, saga, path, ...)
baseline/       Deterministic baseline runner
server/         FastAPI app and OpenEnv environment wrapper
tests/          Smoke tests and Lean backend benchmark
scripts/        Helper scripts (playthrough, export, validation)

Environment Variables

Variable Purpose Default
TASK_ID Which task to run rbac_auth
LEAN_BACKEND stdin, kimina, or aeneas stdin
KIMINA_URL Kimina server URL http://localhost:12332
LEAN_BIN / LEAN_CWD / LEAN_PATH Lean binary paths elan defaults
API_BASE_URL LLM API endpoint for inference.py https://router.huggingface.co/v1
MODEL_NAME LLM model for inference.py Qwen/Qwen2.5-72B-Instruct
HF_TOKEN Auth token for inference.py β€”