Spaces:

Hrushi
/

lean-migrate

Sleeping

App Files Files Community

lean-migrate / tasks.md

Hrushi

Upload folder using huggingface_hub

bf9c466 verified about 2 months ago

preview code

raw

history blame contribute delete

21.2 kB

	# LeanMigrate Task Reference

	LeanMigrate models a real engineering workflow: a developer must migrate production business logic from one language to another, then prove correctness against a Lean 4 formal specification. Each task is structured as a small function dependency graph. The agent must migrate each function in dependency order, run sample tests, and pass Lean verification before moving to the next.

	---

	## Overview

	\| # \| Task ID \| Migration \| Difficulty \| Functions \| Proof? \|
	\| -- \| --------------------- \| -------------------- \| ---------- \| --------- \| ------ \|
	\| 1 \| `rbac_auth` \| Python → TypeScript \| Easy \| 3 \| No \|
	\| 2 \| `pricing_engine` \| JavaScript → Python \| Medium \| 5 \| No \|
	\| 3 \| `path_canonicalizer` \| C → Rust \| Medium \| 4 \| Yes \|
	\| 4 \| `expression_eval` \| C → Rust \| Medium \| 3 \| Yes \|
	\| 5 \| `payment_saga` \| JavaScript → Python \| Hard \| 4 \| Yes \|
	\| 6 \| `lru_cache` \| C++ → Rust \| Hard \| 4 \| Yes \|
	\| 7 \| `shortest_path` \| C++ → Rust \| Hard \| 3 \| Yes \|
	\| 8 \| `interval_scheduler` \| C++ → Rust \| Hard \| 3 \| Yes \|

	---

	## Task 1 — RBAC Authorization Engine (`rbac_auth`)

	Difficulty: Easy \| Migration: Python → TypeScript \| Max Steps: 15

	### What It Is

	A role-based access control (RBAC) engine that models permissions as a directed inheritance graph. Roles can inherit permissions from parent roles, enabling hierarchical access policies (e.g. `admin` inherits from `editor` which inherits from `viewer`).

	### Why It Matters

	RBAC is the access control backbone of most enterprise software. Migrating it correctly and proving its behavior is critical: a bug in permission checking is a security vulnerability.

	### Functions to Migrate

	\| Function \| Description \| Depends On \|
	\| -------------------- \| ----------------------------------------------------------------- \| ------------------------------------ \|
	\| `findRole` \| Look up a role by name from the role list \| — \|
	\| `hasDirectPermission`\| Check if a role has a specific resource/action permission directly \| — \|
	\| `canAccess` \| Check permission with role inheritance traversal, depth limit 5 \| `findRole`, `hasDirectPermission` \|

	### What Makes It Interesting

	The inheritance traversal is bounded (depth ≤ 5) to prevent cycles from causing infinite loops. The Lean spec expresses this via a `depth : Nat` countdown parameter. Agents must replicate this exact bounding behavior in TypeScript, using `function` declarations (not arrow functions) so tree-sitter can extract them for the verification IR.

	### Lean Spec Excerpt

	```lean
	def canAccess (roles : RoleMap) (roleName resource action : String) (depth : Nat := 5) : Bool :=
	match depth, findRole roles roleName with
	\| 0, _ => false
	\| _, none => false
	\| n + 1, some role =>
	hasDirectPermission role resource action \|\|
	role.inherits.any (fun parentName =>
	canAccess roles parentName resource action n)
	```

	### Scoring

	* 3 functions × (1/3) reward each on successful Lean verification
	* No proof obligation — functional correctness only

	---

	## Task 2 — E-Commerce Pricing Engine (`pricing_engine`)

	Difficulty: Medium \| Migration: JavaScript → Python \| Max Steps: 25

	### What It Is

	A JavaScript pricing engine that computes the final price of an e-commerce order in integer cents (no floating point). The engine handles line item subtotals, coupon discounts capped at 50% of subtotal, loyalty point discounts capped at 10% of subtotal, and region-based tax rates in basis points.

	### Why It Matters

	Pricing bugs cost real money. Integer arithmetic and correct discount-capping semantics are notoriously tricky to migrate across languages. The Lean spec pins the exact integer semantics so the agent cannot accidentally introduce floating-point rounding.

	### Functions to Migrate

	\| Function \| Description \| Depends On \|
	\| --------------- \| --------------------------------------------------------- \| --------------------------------------------- \|
	\| `subtotal` \| Sum of unit price × quantity for all line items \| — \|
	\| `taxRateBps` \| Tax rate in basis points for a given region ID \| — \|
	\| `couponDiscount`\| Coupon discount capped at 50% of subtotal \| `subtotal` \|
	\| `loyaltyDiscount`\| Loyalty points discount capped at 10% of subtotal \| `subtotal` \|
	\| `finalPrice` \| Final price = subtotal − discounts + tax \| `subtotal`, `couponDiscount`, `loyaltyDiscount`, `taxRateBps` \|

	### What Makes It Interesting

	The discount cap logic uses integer division (`//`), not `round()` or `ceil()`. Agents that reach for floating-point will fail Lean verification. The dependency chain is linear (subtotal → discounts → tax → final price), giving a natural migration order that the `analyze_deps` action surfaces.

	### Lean Spec Excerpt

	```lean
	def couponDiscount (order : Order) : Int :=
	let sub := subtotal order
	let rawDiscount := order.coupons.foldl (fun total coupon =>
	total + (sub * coupon.discountPercent / 100)) 0
	min rawDiscount (sub / 2)
	```

	### Scoring

	* 5 functions × (1/5) reward each on successful Lean verification
	* No proof obligation — functional correctness only

	---

	## Task 3 — Path Canonicalizer (`path_canonicalizer`)

	Difficulty: Medium \| Migration: C → Rust \| Max Steps: 25

	### What It Is

	A C path normalization library migrated to Rust. Paths are represented as lists of string segments (not raw strings). The normalizer removes `.` segments and resolves `..` clamped at the root — you cannot go above `/`.

	### Why It Matters

	Path canonicalization is a security-critical operation. Incorrect `..` resolution leads to path traversal vulnerabilities. The Lean spec formally proves idempotency: normalizing an already-normalized path produces the same path.

	### Functions to Migrate

	\| Function \| Description \| Depends On \|
	\| ------------------ \| -------------------------------------------------------- \| --------------- \|
	\| `normalizePath` \| Remove `.`, resolve `..` clamped at root \| — \|
	\| `joinPaths` \| Join two segment lists: normalize(base ++ rel) \| `normalizePath` \|
	\| `pathDepth` \| Count the number of segments in a path \| — \|
	\| `canonicalizeProof`\| Lean proof that `normalize(normalize(p)) = normalize(p)` \| `normalizePath` \|

	### What Makes It Interesting

	This is the first C → Rust task. Agents must write Rust with `serde_json` deserialization. The proof obligation (idempotency) is the first formal proof task — agents must reason about the list-based normalization algorithm in Lean rather than just implementing it.

	### Lean Spec Excerpt

	```lean
	def normalizePath (p : PathSpec.Path) : PathSpec.Path := normalizeAux [] p

	-- Proof obligation: idempotency
	-- ∀ p, normalizePath (normalizePath p) = normalizePath p
	```

	### Scoring

	* 3 runtime functions × (1/4) reward each on Lean verification
	* 1 proof × (1/4) reward on proof acceptance
	* Proof obligation: `weight_proof = 0.2`

	---

	## Task 4 — Expression Evaluator (`expression_eval`)

	Difficulty: Medium \| Migration: C → Rust \| Max Steps: 25

	### What It Is

	A C recursive-descent expression evaluator migrated to Rust. Expressions are binary trees with literal leaves and binary operator nodes (`+`, `-`, `*`, `/`). Division by zero propagates as `None` (Rust) / `none` (Lean).

	### Why It Matters

	Safe arithmetic evaluation is a common primitive in configuration languages, rule engines, and calculators. The Lean spec formalizes the `None`-propagation contract so the agent cannot silently swallow divide-by-zero.

	### Functions to Migrate

	\| Function \| Description \| Depends On \|
	\| --------------- \| ---------------------------------------------------------- \| ----------- \|
	\| `evalBinOp` \| Evaluate a binary operation; return `None` for `/ 0` \| — \|
	\| `evalExpr` \| Recursively evaluate an expression tree; propagate `None` \| `evalBinOp` \|
	\| `divisionProof` \| Lean proof that `evalBinOp .Div a 0 = none` for all `a` \| `evalBinOp` \|

	### What Makes It Interesting

	The proof obligation is compact and tractable (`by simp [evalBinOp]`), making it a good first proof for agents exploring the proof mechanics. The expression tree is passed as a `serde_json::Value` from the test harness, requiring careful recursive deserialization.

	### Lean Spec Excerpt

	```lean
	def evalBinOp (op : ExprSpec.Op) (a b : Int) : Option Int :=
	match op with
	\| .Add => some (a + b)
	\| .Sub => some (a - b)
	\| .Mul => some (a * b)
	\| .Div => if b = 0 then none else some (a / b)

	-- Proof obligation: ∀ a, evalBinOp .Div a 0 = none
	```

	### Scoring

	* 2 runtime functions × (1/3) reward each
	* 1 proof × (1/3) reward on proof acceptance
	* Proof obligation: `weight_proof = 0.2`

	---

	## Task 5 — Payment Saga State Machine (`payment_saga`)

	Difficulty: Hard \| Migration: JavaScript → Python \| Max Steps: 40

	### What It Is

	A JavaScript Express.js payment saga migrated to Python. A saga is a compensating transaction pattern: it progresses through states (`Idle → Reserved → Authorized → Captured → Settled`) and supports rollback transitions (`CompensateCapture → CompensateAuthorize → CompensateReserve → Compensated`). The agent must also prove the no-double-charge theorem: the happy path reaches `Settled` and `isCharged` returns `true` exactly once.

	### Why It Matters

	Payment systems are among the most heavily regulated and correctness-critical software. Proving no-double-charge is not just a unit test — it is a machine-checked guarantee that the state machine cannot reach a charged state more than once via the happy path.

	### Functions to Migrate

	\| Function \| Description \| Depends On \|
	\| --------------------- \| ---------------------------------------------------- \| ---------------------------------- \|
	\| `transition` \| Pure state transition for one event \| — \|
	\| `runSaga` \| Fold a sequence of events into the final saga state \| `transition` \|
	\| `isCharged` \| Return `true` for `Captured` or `Settled` \| — \|
	\| `no_double_charge_proof` \| Lean proof of happy-path no-double-charge theorem \| `transition`, `runSaga`, `isCharged` \|

	### What Makes It Interesting

	This is the hardest proof in the easy/medium tier. The proof must reference `happyPath` and pre-proved lemmas (`happy_path_settles`) from the `SagaSpec` module. Agents that try to `native_decide` the entire proof will hit timeout; the intended approach uses `constructor`, `exact`, and `simp [isCharged]`.

	### Lean Spec Excerpt

	```lean
	theorem no_double_charge :
	runSaga happyPath = .Settled ∧ isCharged (runSaga happyPath) = true := by
	constructor
	· exact happy_path_settles
	· rw [happy_path_settles]
	simp [isCharged]
	```

	### Scoring

	* 3 runtime functions × (1/4) reward each
	* 1 proof × (1/4) reward on proof acceptance
	* Proof obligation: `weight_proof = 0.3`

	---

	## Task 6 — LRU Cache (`lru_cache`)

	Difficulty: Hard \| Migration: C++ → Rust \| Max Steps: 40

	### What It Is

	A C++ LRU cache (backed by `std::list` + `unordered_map`) migrated to a Rust `Vec<(u64, u64)>` where the front is MRU. The agent must implement eviction, put (insert/update with MRU promotion), get (lookup with MRU promotion), and prove the capacity bound.

	### Why It Matters

	LRU caches are ubiquitous in systems programming. The Lean spec pins the capacity invariant formally, ensuring the implementation never silently exceeds its capacity — a property that is hard to test exhaustively but trivial to prove once the algorithm is correct.

	### Functions to Migrate

	\| Function \| Description \| Depends On \|
	\| ------------ \| ---------------------------------------------------------------- \| ---------- \|
	\| `lruEvict` \| Trim cache to `cap` entries (keep front = MRU, drop LRU tail) \| — \|
	\| `lruPut` \| Insert/update key-val; move to MRU front; evict if over capacity \| `lruEvict` \|
	\| `lruGet` \| Lookup key; on hit, move to MRU front; return `(value, cache)` \| — \|
	\| `lruPutProof`\| Lean proof that `(lruPut cache cap k v).length ≤ cap` \| `lruPut` \|

	### What Makes It Interesting

	The Rust representation (`Vec<(u64,u64)>`) differs structurally from the C++ `std::list + unordered_map`. Agents must re-derive the algorithm from the Lean spec, not just translate the C++ line by line. The proof obligation is an inequality bound, requiring `omega` or `simp` with list length lemmas.

	### Lean Spec Excerpt

	```lean
	def lruPut (cache : LruSpec.CacheState) (cap : Nat) (key val : Nat) : LruSpec.CacheState :=
	((key, val) :: cache.filter (fun p => p.1 ≠ key)).take cap

	-- Proof obligation: ∀ cache cap k v, (lruPut cache cap k v).length ≤ cap
	```

	### Scoring

	* 3 runtime functions × (1/4) reward each
	* 1 proof × (1/4) reward on proof acceptance
	* Proof obligation: `weight_proof = 0.2`

	---

	## Task 7 — Shortest Path / Dijkstra (`shortest_path`)

	Difficulty: Hard \| Migration: C++ → Rust \| Max Steps: 40

	### What It Is

	A C++ Dijkstra (Bellman-Ford relaxation loop) migrated to Rust. Input is a list of weighted directed edges `(from, to, weight)`. Output is a `Vec<Option<u64>>` of length `n` where `None` means unreachable. The agent must also prove that the self-distance `shortestDist graph n src src = some 0` when `src < n`.

	### Why It Matters

	Shortest-path algorithms appear in routing, logistics, game AI, and network analysis. The Lean spec uses a Bellman-Ford style relaxation (not a priority queue), making the formal model tractable while the Rust implementation can use any correct strategy.

	### Functions to Migrate

	\| Function \| Description \| Depends On \|
	\| -------------- \| -------------------------------------------------------------------- \| ---------- \|
	\| `dijkstra` \| Single-source shortest paths; returns `Vec<Option<u64>>` of length n \| — \|
	\| `shortestDist` \| Shortest distance from `src` to `dst`, or `None` if unreachable \| `dijkstra` \|
	\| `relaxProof` \| Lean proof that `shortestDist graph n src src = some 0` \| `shortestDist` \|

	### What Makes It Interesting

	The Lean spec models Dijkstra as iterated Bellman-Ford relaxation — `n-1` passes over all edges. Agents writing Rust can use a proper priority queue for efficiency, but the formal model in Lean is the ground truth for correctness. The proof requires reasoning about the initial distance vector and the `src < n` guard.

	### Lean Spec Excerpt

	```lean
	def dijkstra (graph : ShortestPathSpec.Graph) (n src : Nat) : List (Option Nat) :=
	let init := (List.range n).map (fun i => if i == src then some 0 else none)
	(List.range (n - 1)).foldl (fun d _ => relaxPass d graph) init

	-- Proof obligation: ∀ graph n src, src < n → shortestDist graph n src src = some 0
	```

	### Scoring

	* 2 runtime functions × (1/3) reward each
	* 1 proof × (1/3) reward on proof acceptance
	* Proof obligation: `weight_proof = 0.2`

	---

	## Task 8 — Interval Scheduler (`interval_scheduler`)

	Difficulty: Hard \| Migration: C++ → Rust \| Max Steps: 40

	### What It Is

	A C++ interval scheduling library migrated to Rust. Two algorithms: (1) merge overlapping intervals (sort by start, scan and merge), and (2) greedy activity selection (sort by end time, pick non-overlapping intervals to maximize count). The agent must also prove the merge output is sorted and non-overlapping.

	### Why It Matters

	Interval scheduling is the foundation of calendar systems, resource allocators, and database range scans. The merge-then-prove pattern is a canonical example of verifying a sorting + merging algorithm — a class of problems where off-by-one errors are common and hard to catch with finite tests.

	### Functions to Migrate

	\| Function \| Description \| Depends On \|
	\| ---------------- \| ---------------------------------------------------------------------- \| ---------- \|
	\| `mergeIntervals` \| Sort by start, merge overlapping/touching intervals \| — \|
	\| `maxSchedule` \| Greedy activity selection: sort by end time, pick non-overlapping \| — \|
	\| `mergeProof` \| Lean proof that `mergeIntervals` output is sorted + non-overlapping \| `mergeIntervals` \|

	### What Makes It Interesting

	Both functions are independent, which means the agent can migrate them in either order. The proof obligation (`isSorted ∧ noOverlap`) is a conjunction that requires two separate sub-proofs, making it the most structurally complex proof in the suite. Agents must use the `constructor` tactic and prove each branch using the Lean definitions of `isSorted` and `noOverlap` from `IntervalSpec`.

	### Lean Spec Excerpt

	```lean
	def mergeIntervals (ivs : List IntervalSpec.Interval) : List IntervalSpec.Interval :=
	(sortByStart ivs).foldl (fun acc iv =>
	match acc.getLast? with
	\| none => [iv]
	\| some (ls, le) =>
	if iv.1 ≤ le then dropLastInterval acc ++ [(ls, max le iv.2)]
	else acc ++ [iv]) []

	-- Proof obligation: ∀ ivs, isSorted (mergeIntervals ivs) ∧ noOverlap (mergeIntervals ivs)
	```

	### Scoring

	* 2 runtime functions × (1/3) reward each
	* 1 proof × (1/3) reward on proof acceptance
	* Proof obligation: `weight_proof = 0.2`

	---

	## Reward Design

	### Philosophy

	Every action provides signal — there is no sparse-reward problem:

	1. Exploration signal: `inspect` and `analyze_deps` give small positive rewards (+0.05) so agents learn to gather information before attempting implementation.

	2. Functional signal: `run_tests` gives +0.10 on full pass, with proportional penalties on partial failure (−0.01 per failing case). This guides the agent toward correct behavior before spending Lean compilation time.

	3. Verification signal: `submit` gives a large positive reward (`1 / n_functions`) when Lean accepts the code, and a small penalty (−0.05) on rejection. The penalty is small enough not to discourage reasonable attempts, but nonzero to penalize spamming.

	4. Proof signal: Proof tasks split the reward between functional implementation and the proof, with explicit weight fields (`weight_functional`, `weight_property`, `weight_proof`) configurable per task.

	### Score Clamping

	Internal episode scores are clamped to `[0.01, 0.99]` — this is the "open unit" convention from OpenEnv that keeps scores away from the extremes, which some validators interpret specially. Step reward logs clamp to `[0.0, 1.0]`.

	### Episode Termination

	An episode ends when:
	* All functions are verified (`progress = 1.0`), or
	* The step limit is reached (`step_count ≥ max_steps`)

	The step limit scales with difficulty: 15 steps for easy tasks, 25 for medium, 40 for hard.

	---

	## Design Notes for Evaluators

	Why Lean 4 as the verifier? Lean 4 provides machine-checked guarantees that go beyond unit tests. A submission that passes sample cases but violates the formal spec is rejected. This makes it impossible to "cheat" with a hardcoded lookup table that only works on known inputs.

	Why multiple language pairs? Cross-language migration is harder than same-language refactoring. Each language pair exercises different aspects of the agent's capability: type system awareness, idiomatic pattern translation, and error handling conventions.

	Why incremental function-level migration? Submitting the entire file at once would give the agent no partial-credit signal and would make Lean errors much harder to diagnose. Function-level granularity mirrors real code review and makes the reward landscape smooth.

	Difficulty calibration:
	- Easy: one language pair, no proofs, short dependency chains, small functions
	- Medium: two language pairs (including C→Rust), optional proofs, medium chains
	- Hard: C++→Rust with complex data structures, mandatory proofs, long chains