Spaces:

Hrushi
/

lean-migrate

Sleeping

App Files Files Community

lean-migrate / README.md

Hrushi

Upload folder using huggingface_hub

bf9c466 verified about 2 months ago

preview code

raw

history blame contribute delete

11.6 kB

metadata

title: LeanMigrate OpenEnv Environment
emoji: 🧠
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

LeanMigrate

LeanMigrate is a code-migration environment where an AI agent migrates real business logic one function at a time. Lean 4 is the verifier. The environment parses each submission with Python ast (Python targets) or tree-sitter (TypeScript targets) or a Rust/cargo harness (Rust targets), builds an internal verification IR in the Candidate namespace, and checks it against the formal Lean specification using native_decide-powered sample theorems.

Tasks

Eight tasks across four language pairs and three difficulty levels:

Task	Migration	Difficulty	Functions	Max Steps	Proof?
`rbac_auth`	Python → TypeScript	Easy	3	15	No
`pricing_engine`	JavaScript → Python	Medium	5	25	No
`path_canonicalizer`	C → Rust	Medium	4	25	Yes
`expression_eval`	C → Rust	Medium	3	25	Yes
`payment_saga`	JavaScript → Python	Hard	4	40	Yes
`lru_cache`	C++ → Rust	Hard	4	40	Yes
`shortest_path`	C++ → Rust	Hard	3	40	Yes
`interval_scheduler`	C++ → Rust	Hard	3	40	Yes

Proof obligations are required for path_canonicalizer, expression_eval, payment_saga, lru_cache, shortest_path, and interval_scheduler. These tasks require a formal Lean proof in addition to working code.

Environment

The environment exposes the standard OpenEnv-style server and client interfaces.

Task Selection

Run one task per episode. Choose the task in one of two ways:

Set TASK_ID before launching the environment or running inference.
Pass task_id as a keyword argument when resetting the environment: env.reset(task_id="pricing_engine").

Valid task IDs: rbac_auth, pricing_engine, payment_saga, path_canonicalizer, expression_eval, lru_cache, shortest_path, interval_scheduler.

Actions

LeanMigrateAction is a typed Pydantic action model with four variants:

Action	Required fields	Effect
`inspect`	`function_name`	Returns source fragment + Lean spec for one fn
`analyze_deps`	`function_name`	Returns dependency graph and migration order
`run_tests`	`function_name`, `candidate_code`	Executes candidate against task sample cases
`submit`	`function_name`, `target_code` or `lean_proof`	Runs IR + Lean verification pipeline

Field roles:

candidate_code — working implementation to test locally before submission (run_tests only).
target_code — final implementation submitted for Lean verification.
lean_proof — required only for proof tasks (e.g. path_canonicalizer, expression_eval, payment_saga, lru_cache, shortest_path, interval_scheduler).

Observation

LeanMigrateObservation contains:

Episode metadata (episode_id, task_id, episode_step, max_steps)
Task metadata (source_language, target_language, source_files)
Verification progress (verified, remaining, failing, progress)
Feedback from the previous step (last_action_type, last_action_feedback, last_step_reward)
Rich grading breakdown (reward_details) with tests_passed, tests_total, proof_compiled, lean_error

Rewards

Rewards are shaped to give signal across the full trajectory:

Action	Reward
`inspect`	+0.05 shaping reward
`analyze_deps`	+0.05 shaping reward
`run_tests`	+0.10 on all cases passing; −0.01 × failures on partial/full fail
`submit` pass	+1.0 / total_function_count
`submit` fail	−0.05 on IR rejection or Lean rejection

Episode score (progress) is verified_count / total_function_count, clamped to [0.01, 0.99] for validator compatibility. Step rewards in the stdout log are clamped to [0.0, 1.0].

Setup

uv sync

# Build Lean 4 artifacts (required before running tests)
cd lean && lake build

# Run all tests
uv run pytest -q

RBAC TypeScript tests require a Node runtime:

npm install -g tsx
uv run pytest tests/test_env_grader.py -q

Rust task tests require the Rust toolchain:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Inference Script

Run the structured inference script against any task:

API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=hf_... \
TASK_ID=rbac_auth \
uv run python inference.py

inference.py uses the OpenAI client (pointing at the HuggingFace router by default) and emits structured stdout in the required [START] / [STEP] / [END] format:

[START] task=rbac_auth env=lean_migrate model=Qwen/Qwen2.5-72B-Instruct
[STEP]  step=1 action={"type":"inspect","function_name":"findRole"} reward=0.05 done=false error=null
[STEP]  step=2 action={"type":"run_tests","function_name":"findRole"} reward=0.10 done=false error=null
[STEP]  step=3 action={"type":"submit","function_name":"findRole"} reward=0.33 done=false error=null
...
[END]   success=true steps=9 score=0.990 rewards=0.05,0.10,0.33,...

When TASK_ID is unset, inference runs all 8 tasks sequentially.

LLM Playthrough

Run the richer LLM-driven task runner across the full suite when you want step-by-step traces and agent debugging output:

uv run python scripts/play_task.py play --all --trace-dir traces-v2

scripts/play_task.py is best-effort and uses the configured model, so it can still fail on hard tasks if the model guesses the wrong API shape or proof. Use --quiet when you only want the per-task summary.

Baseline

Run the deterministic baseline (canonical Lean fragments). The script defaults to the original 3-task baseline, and you can pass --tasks to cover the full 8-task suite:

uv run python baseline/run_baseline.py --tasks \
  rbac_auth pricing_engine payment_saga \
  path_canonicalizer expression_eval lru_cache shortest_path interval_scheduler

Task	Score
`rbac_auth`	`0.990`
`pricing_engine`	`0.990`
`payment_saga`	`0.990`
`path_canonicalizer`	`0.990`
`expression_eval`	`0.990`
`lru_cache`	`0.990`
`shortest_path`	`0.990`
`interval_scheduler`	`0.990`
overall	`0.990`

Scores clamp at 0.990 by design (the open-unit clamp keeps scores below 1.0 for validator compatibility).

Full Playthrough

Step-by-step replay for all tasks using bundled canonical solutions:

uv run python scripts/play_all_tasks.py

Export the Lean business-logic mirror for a function:

uv run python scripts/export_lean_business_logic.py \
  --task-id pricing_engine \
  --function-name finalPrice \
  --source-file tasks/pricing/source.js

Docker Build

uv run openenv build
docker run --rm -p 8000:8000 openenv-lean_migrate:latest

The Docker image:

Builds Lean 4 artifacts with lake build
Installs the Python package with uv sync
Includes nodejs + tsx for TypeScript task tests
Includes the Rust toolchain for Rust task tests

OpenEnv Push

openenv push

OpenEnv uses lean_migrate.server.app:app as the FastAPI entry point and lean_migrate.server.lean_migrate_environment:LeanMigrateEnvironment as the environment implementation.

Architecture

Verification Pipeline

When an agent submits code, it goes through a two-stage pipeline:

Verification IR (env/verification_ir.py): Parses submitted code (Python AST / TypeScript tree-sitter / Rust cargo), runs sample cases against the oracle, computes a behavior digest, and generates a Lean mirror file in the Candidate namespace with native_decide-powered sample checks.
Lean Backend (lean_backend/): The generated Lean code is fed to a backend that compiles it against the formal spec.
- StdinBackend (default): Invokes lean --stdin, requires cd lean && lake build first.
- KiminaBackend: HTTP backend at KIMINA_URL (default http://localhost:12332). Select with LEAN_BACKEND=kimina.
- AeneasBackend: Experimental backend that uses Aeneas to extract Lean proofs from Rust code. Select with LEAN_BACKEND=aeneas. It is optional and not part of the default grading path.

Key Data Flow for `submit`

SubmitAction.target_code
  → build_verification_ir()           # env/verification_ir.py
      → _parse_provenance()           # validates syntax, extracts arity
      → run_candidate_tests()         # env/grader.py — runs via subprocess
      → _render_lean_mirror()         # generates Lean code in Candidate namespace
  → backend.verify()                  # lean_backend/
      → Lean compiler output
  → EpisodeState updates verified/failing sets

Project Layout

lean/           Lean 4 specifications (AuthSpec, PricingSpec, SagaSpec, PathSpec, ...)
env/            Task registry, state machine, grader, verification IR
lean_backend/   Lean verification backends (stdin, kimina, aeneas)
tasks/          Source bundles shown to the agent (rbac, pricing, saga, path, ...)
baseline/       Deterministic baseline runner
server/         FastAPI app and OpenEnv environment wrapper
tests/          Smoke tests and Lean backend benchmark
scripts/        Helper scripts (playthrough, export, validation)

Environment Variables

Variable	Purpose	Default
`TASK_ID`	Which task to run	`rbac_auth`
`LEAN_BACKEND`	`stdin`, `kimina`, or `aeneas`	`stdin`
`KIMINA_URL`	Kimina server URL	`http://localhost:12332`
`LEAN_BIN` / `LEAN_CWD` / `LEAN_PATH`	Lean binary paths	elan defaults
`API_BASE_URL`	LLM API endpoint for inference.py	`https://router.huggingface.co/v1`
`MODEL_NAME`	LLM model for inference.py	`Qwen/Qwen2.5-72B-Instruct`
`HF_TOKEN`	Auth token for inference.py	—