Spaces:

Hrushi
/

lean-migrate

Sleeping

App Files Files Community

lean-migrate / README.md

Hrushi

Upload folder using huggingface_hub

bf9c466 verified about 2 months ago

preview code

raw

history blame contribute delete

11.6 kB

	---
	title: LeanMigrate OpenEnv Environment
	emoji: 🧠
	colorFrom: green
	colorTo: pink
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- openenv
	---

	# LeanMigrate

	LeanMigrate is a code-migration environment where an AI agent migrates real business logic one function at a time. Lean 4 is the verifier. The environment parses each submission with Python `ast` (Python targets) or tree-sitter (TypeScript targets) or a Rust/cargo harness (Rust targets), builds an internal verification IR in the `Candidate` namespace, and checks it against the formal Lean specification using `native_decide`-powered sample theorems.

	---

	## Tasks

	Eight tasks across four language pairs and three difficulty levels:

	\| Task \| Migration \| Difficulty \| Functions \| Max Steps \| Proof? \|
	\| --------------------- \| ---------------------------- \| ---------- \| --------- \| --------- \| ------ \|
	\| `rbac_auth` \| Python → TypeScript \| Easy \| 3 \| 15 \| No \|
	\| `pricing_engine` \| JavaScript → Python \| Medium \| 5 \| 25 \| No \|
	\| `path_canonicalizer` \| C → Rust \| Medium \| 4 \| 25 \| Yes \|
	\| `expression_eval` \| C → Rust \| Medium \| 3 \| 25 \| Yes \|
	\| `payment_saga` \| JavaScript → Python \| Hard \| 4 \| 40 \| Yes \|
	\| `lru_cache` \| C++ → Rust \| Hard \| 4 \| 40 \| Yes \|
	\| `shortest_path` \| C++ → Rust \| Hard \| 3 \| 40 \| Yes \|
	\| `interval_scheduler` \| C++ → Rust \| Hard \| 3 \| 40 \| Yes \|

	Proof obligations are required for `path_canonicalizer`, `expression_eval`, `payment_saga`, `lru_cache`, `shortest_path`, and `interval_scheduler`. These tasks require a formal Lean proof in addition to working code.

	---

	## Environment

	The environment exposes the standard OpenEnv-style server and client interfaces.

	### Task Selection

	Run one task per episode. Choose the task in one of two ways:

	* Set `TASK_ID` before launching the environment or running inference.
	* Pass `task_id` as a keyword argument when resetting the environment: `env.reset(task_id="pricing_engine")`.

	Valid task IDs: `rbac_auth`, `pricing_engine`, `payment_saga`, `path_canonicalizer`, `expression_eval`, `lru_cache`, `shortest_path`, `interval_scheduler`.

	### Actions

	`LeanMigrateAction` is a typed Pydantic action model with four variants:

	\| Action \| Required fields \| Effect \|
	\| -------------- \| -------------------------------------- \| ---------------------------------------------- \|
	\| `inspect` \| `function_name` \| Returns source fragment + Lean spec for one fn \|
	\| `analyze_deps` \| `function_name` \| Returns dependency graph and migration order \|
	\| `run_tests` \| `function_name`, `candidate_code` \| Executes candidate against task sample cases \|
	\| `submit` \| `function_name`, `target_code` or `lean_proof` \| Runs IR + Lean verification pipeline \|

	Field roles:

	* `candidate_code` — working implementation to test locally before submission (`run_tests` only).
	* `target_code` — final implementation submitted for Lean verification.
	* `lean_proof` — required only for proof tasks (e.g. `path_canonicalizer`, `expression_eval`, `payment_saga`, `lru_cache`, `shortest_path`, `interval_scheduler`).

	### Observation

	`LeanMigrateObservation` contains:

	* Episode metadata (`episode_id`, `task_id`, `episode_step`, `max_steps`)
	* Task metadata (`source_language`, `target_language`, `source_files`)
	* Verification progress (`verified`, `remaining`, `failing`, `progress`)
	* Feedback from the previous step (`last_action_type`, `last_action_feedback`, `last_step_reward`)
	* Rich grading breakdown (`reward_details`) with `tests_passed`, `tests_total`, `proof_compiled`, `lean_error`

	### Rewards

	Rewards are shaped to give signal across the full trajectory:

	\| Action \| Reward \|
	\| -------------- \| ------------------------------------------------------------------ \|
	\| `inspect` \| +0.05 shaping reward \|
	\| `analyze_deps` \| +0.05 shaping reward \|
	\| `run_tests` \| +0.10 on all cases passing; −0.01 × failures on partial/full fail \|
	\| `submit` pass \| +1.0 / total\_function\_count \|
	\| `submit` fail \| −0.05 on IR rejection or Lean rejection \|

	Episode score (`progress`) is `verified_count / total_function_count`, clamped to `[0.01, 0.99]` for validator compatibility. Step rewards in the stdout log are clamped to `[0.0, 1.0]`.

	---

	## Setup

	```bash
	uv sync

	# Build Lean 4 artifacts (required before running tests)
	cd lean && lake build

	# Run all tests
	uv run pytest -q
	```

	RBAC TypeScript tests require a Node runtime:

	```bash
	npm install -g tsx
	uv run pytest tests/test_env_grader.py -q
	```

	Rust task tests require the Rust toolchain:

	```bash
	curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs \| sh
	```

	---

	## Inference Script

	Run the structured inference script against any task:

	```bash
	API_BASE_URL=https://router.huggingface.co/v1 \
	MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
	HF_TOKEN=hf_... \
	TASK_ID=rbac_auth \
	uv run python inference.py
	```

	`inference.py` uses the OpenAI client (pointing at the HuggingFace router by default) and emits structured stdout in the required `[START]` / `[STEP]` / `[END]` format:

	```
	[START] task=rbac_auth env=lean_migrate model=Qwen/Qwen2.5-72B-Instruct
	[STEP] step=1 action={"type":"inspect","function_name":"findRole"} reward=0.05 done=false error=null
	[STEP] step=2 action={"type":"run_tests","function_name":"findRole"} reward=0.10 done=false error=null
	[STEP] step=3 action={"type":"submit","function_name":"findRole"} reward=0.33 done=false error=null
	...
	[END] success=true steps=9 score=0.990 rewards=0.05,0.10,0.33,...
	```

	When `TASK_ID` is unset, inference runs all 8 tasks sequentially.

	## LLM Playthrough

	Run the richer LLM-driven task runner across the full suite when you want step-by-step traces and agent debugging output:

	```bash
	uv run python scripts/play_task.py play --all --trace-dir traces-v2
	```

	`scripts/play_task.py` is best-effort and uses the configured model, so it can still fail on hard tasks if the model guesses the wrong API shape or proof. Use `--quiet` when you only want the per-task summary.

	---

	## Baseline

	Run the deterministic baseline (canonical Lean fragments). The script defaults to the original 3-task baseline, and you can pass `--tasks` to cover the full 8-task suite:

	```bash
	uv run python baseline/run_baseline.py --tasks \
	rbac_auth pricing_engine payment_saga \
	path_canonicalizer expression_eval lru_cache shortest_path interval_scheduler
	```

	\| Task \| Score \|
	\| --------------------- \| ------- \|
	\| `rbac_auth` \| `0.990` \|
	\| `pricing_engine` \| `0.990` \|
	\| `payment_saga` \| `0.990` \|
	\| `path_canonicalizer` \| `0.990` \|
	\| `expression_eval` \| `0.990` \|
	\| `lru_cache` \| `0.990` \|
	\| `shortest_path` \| `0.990` \|
	\| `interval_scheduler` \| `0.990` \|
	\| overall \| `0.990` \|

	Scores clamp at 0.990 by design (the open-unit clamp keeps scores below 1.0 for validator compatibility).

	---

	## Full Playthrough

	Step-by-step replay for all tasks using bundled canonical solutions:

	```bash
	uv run python scripts/play_all_tasks.py
	```

	Export the Lean business-logic mirror for a function:

	```bash
	uv run python scripts/export_lean_business_logic.py \
	--task-id pricing_engine \
	--function-name finalPrice \
	--source-file tasks/pricing/source.js
	```

	---

	## Docker Build

	```bash
	uv run openenv build
	docker run --rm -p 8000:8000 openenv-lean_migrate:latest
	```

	The Docker image:
	* Builds Lean 4 artifacts with `lake build`
	* Installs the Python package with `uv sync`
	* Includes `nodejs` + `tsx` for TypeScript task tests
	* Includes the Rust toolchain for Rust task tests

	---

	## OpenEnv Push

	```bash
	openenv push
	```

	OpenEnv uses `lean_migrate.server.app:app` as the FastAPI entry point and `lean_migrate.server.lean_migrate_environment:LeanMigrateEnvironment` as the environment implementation.

	---

	## Architecture

	### Verification Pipeline

	When an agent submits code, it goes through a two-stage pipeline:

	1. Verification IR (`env/verification_ir.py`): Parses submitted code (Python AST / TypeScript tree-sitter / Rust cargo), runs sample cases against the oracle, computes a behavior digest, and generates a Lean mirror file in the `Candidate` namespace with `native_decide`-powered sample checks.

	2. Lean Backend (`lean_backend/`): The generated Lean code is fed to a backend that compiles it against the formal spec.
	- `StdinBackend` (default): Invokes `lean --stdin`, requires `cd lean && lake build` first.
	- `KiminaBackend`: HTTP backend at `KIMINA_URL` (default `http://localhost:12332`). Select with `LEAN_BACKEND=kimina`.
	- `AeneasBackend`: Experimental backend that uses Aeneas to extract Lean proofs from Rust code. Select with `LEAN_BACKEND=aeneas`. It is optional and not part of the default grading path.

	### Key Data Flow for `submit`

	```
	SubmitAction.target_code
	→ build_verification_ir() # env/verification_ir.py
	→ _parse_provenance() # validates syntax, extracts arity
	→ run_candidate_tests() # env/grader.py — runs via subprocess
	→ _render_lean_mirror() # generates Lean code in Candidate namespace
	→ backend.verify() # lean_backend/
	→ Lean compiler output
	→ EpisodeState updates verified/failing sets
	```

	### Project Layout

	```
	lean/ Lean 4 specifications (AuthSpec, PricingSpec, SagaSpec, PathSpec, ...)
	env/ Task registry, state machine, grader, verification IR
	lean_backend/ Lean verification backends (stdin, kimina, aeneas)
	tasks/ Source bundles shown to the agent (rbac, pricing, saga, path, ...)
	baseline/ Deterministic baseline runner
	server/ FastAPI app and OpenEnv environment wrapper
	tests/ Smoke tests and Lean backend benchmark
	scripts/ Helper scripts (playthrough, export, validation)
	```

	### Environment Variables

	\| Variable \| Purpose \| Default \|
	\| -------------------------------------- \| --------------------------------- \| ------------------------------ \|
	\| `TASK_ID` \| Which task to run \| `rbac_auth` \|
	\| `LEAN_BACKEND` \| `stdin`, `kimina`, or `aeneas` \| `stdin` \|
	\| `KIMINA_URL` \| Kimina server URL \| `http://localhost:12332` \|
	\| `LEAN_BIN` / `LEAN_CWD` / `LEAN_PATH` \| Lean binary paths \| elan defaults \|
	\| `API_BASE_URL` \| LLM API endpoint for inference.py \| `https://router.huggingface.co/v1` \|
	\| `MODEL_NAME` \| LLM model for inference.py \| `Qwen/Qwen2.5-72B-Instruct` \|
	\| `HF_TOKEN` \| Auth token for inference.py \| — \|