Spaces:
Sleeping
LeanMigrate Task Reference
LeanMigrate models a real engineering workflow: a developer must migrate production business logic from one language to another, then prove correctness against a Lean 4 formal specification. Each task is structured as a small function dependency graph. The agent must migrate each function in dependency order, run sample tests, and pass Lean verification before moving to the next.
Overview
| # | Task ID | Migration | Difficulty | Functions | Proof? |
|---|---|---|---|---|---|
| 1 | rbac_auth |
Python β TypeScript | Easy | 3 | No |
| 2 | pricing_engine |
JavaScript β Python | Medium | 5 | No |
| 3 | path_canonicalizer |
C β Rust | Medium | 4 | Yes |
| 4 | expression_eval |
C β Rust | Medium | 3 | Yes |
| 5 | payment_saga |
JavaScript β Python | Hard | 4 | Yes |
| 6 | lru_cache |
C++ β Rust | Hard | 4 | Yes |
| 7 | shortest_path |
C++ β Rust | Hard | 3 | Yes |
| 8 | interval_scheduler |
C++ β Rust | Hard | 3 | Yes |
Task 1 β RBAC Authorization Engine (rbac_auth)
Difficulty: Easy | Migration: Python β TypeScript | Max Steps: 15
What It Is
A role-based access control (RBAC) engine that models permissions as a directed inheritance graph. Roles can inherit permissions from parent roles, enabling hierarchical access policies (e.g. admin inherits from editor which inherits from viewer).
Why It Matters
RBAC is the access control backbone of most enterprise software. Migrating it correctly and proving its behavior is critical: a bug in permission checking is a security vulnerability.
Functions to Migrate
| Function | Description | Depends On |
|---|---|---|
findRole |
Look up a role by name from the role list | β |
hasDirectPermission |
Check if a role has a specific resource/action permission directly | β |
canAccess |
Check permission with role inheritance traversal, depth limit 5 | findRole, hasDirectPermission |
What Makes It Interesting
The inheritance traversal is bounded (depth β€ 5) to prevent cycles from causing infinite loops. The Lean spec expresses this via a depth : Nat countdown parameter. Agents must replicate this exact bounding behavior in TypeScript, using function declarations (not arrow functions) so tree-sitter can extract them for the verification IR.
Lean Spec Excerpt
def canAccess (roles : RoleMap) (roleName resource action : String) (depth : Nat := 5) : Bool :=
match depth, findRole roles roleName with
| 0, _ => false
| _, none => false
| n + 1, some role =>
hasDirectPermission role resource action ||
role.inherits.any (fun parentName =>
canAccess roles parentName resource action n)
Scoring
- 3 functions Γ (1/3) reward each on successful Lean verification
- No proof obligation β functional correctness only
Task 2 β E-Commerce Pricing Engine (pricing_engine)
Difficulty: Medium | Migration: JavaScript β Python | Max Steps: 25
What It Is
A JavaScript pricing engine that computes the final price of an e-commerce order in integer cents (no floating point). The engine handles line item subtotals, coupon discounts capped at 50% of subtotal, loyalty point discounts capped at 10% of subtotal, and region-based tax rates in basis points.
Why It Matters
Pricing bugs cost real money. Integer arithmetic and correct discount-capping semantics are notoriously tricky to migrate across languages. The Lean spec pins the exact integer semantics so the agent cannot accidentally introduce floating-point rounding.
Functions to Migrate
| Function | Description | Depends On |
|---|---|---|
subtotal |
Sum of unit price Γ quantity for all line items | β |
taxRateBps |
Tax rate in basis points for a given region ID | β |
couponDiscount |
Coupon discount capped at 50% of subtotal | subtotal |
loyaltyDiscount |
Loyalty points discount capped at 10% of subtotal | subtotal |
finalPrice |
Final price = subtotal β discounts + tax | subtotal, couponDiscount, loyaltyDiscount, taxRateBps |
What Makes It Interesting
The discount cap logic uses integer division (//), not round() or ceil(). Agents that reach for floating-point will fail Lean verification. The dependency chain is linear (subtotal β discounts β tax β final price), giving a natural migration order that the analyze_deps action surfaces.
Lean Spec Excerpt
def couponDiscount (order : Order) : Int :=
let sub := subtotal order
let rawDiscount := order.coupons.foldl (fun total coupon =>
total + (sub * coupon.discountPercent / 100)) 0
min rawDiscount (sub / 2)
Scoring
- 5 functions Γ (1/5) reward each on successful Lean verification
- No proof obligation β functional correctness only
Task 3 β Path Canonicalizer (path_canonicalizer)
Difficulty: Medium | Migration: C β Rust | Max Steps: 25
What It Is
A C path normalization library migrated to Rust. Paths are represented as lists of string segments (not raw strings). The normalizer removes . segments and resolves .. clamped at the root β you cannot go above /.
Why It Matters
Path canonicalization is a security-critical operation. Incorrect .. resolution leads to path traversal vulnerabilities. The Lean spec formally proves idempotency: normalizing an already-normalized path produces the same path.
Functions to Migrate
| Function | Description | Depends On |
|---|---|---|
normalizePath |
Remove ., resolve .. clamped at root |
β |
joinPaths |
Join two segment lists: normalize(base ++ rel) | normalizePath |
pathDepth |
Count the number of segments in a path | β |
canonicalizeProof |
Lean proof that normalize(normalize(p)) = normalize(p) |
normalizePath |
What Makes It Interesting
This is the first C β Rust task. Agents must write Rust with serde_json deserialization. The proof obligation (idempotency) is the first formal proof task β agents must reason about the list-based normalization algorithm in Lean rather than just implementing it.
Lean Spec Excerpt
def normalizePath (p : PathSpec.Path) : PathSpec.Path := normalizeAux [] p
-- Proof obligation: idempotency
-- β p, normalizePath (normalizePath p) = normalizePath p
Scoring
- 3 runtime functions Γ (1/4) reward each on Lean verification
- 1 proof Γ (1/4) reward on proof acceptance
- Proof obligation:
weight_proof = 0.2
Task 4 β Expression Evaluator (expression_eval)
Difficulty: Medium | Migration: C β Rust | Max Steps: 25
What It Is
A C recursive-descent expression evaluator migrated to Rust. Expressions are binary trees with literal leaves and binary operator nodes (+, -, *, /). Division by zero propagates as None (Rust) / none (Lean).
Why It Matters
Safe arithmetic evaluation is a common primitive in configuration languages, rule engines, and calculators. The Lean spec formalizes the None-propagation contract so the agent cannot silently swallow divide-by-zero.
Functions to Migrate
| Function | Description | Depends On |
|---|---|---|
evalBinOp |
Evaluate a binary operation; return None for / 0 |
β |
evalExpr |
Recursively evaluate an expression tree; propagate None |
evalBinOp |
divisionProof |
Lean proof that evalBinOp .Div a 0 = none for all a |
evalBinOp |
What Makes It Interesting
The proof obligation is compact and tractable (by simp [evalBinOp]), making it a good first proof for agents exploring the proof mechanics. The expression tree is passed as a serde_json::Value from the test harness, requiring careful recursive deserialization.
Lean Spec Excerpt
def evalBinOp (op : ExprSpec.Op) (a b : Int) : Option Int :=
match op with
| .Add => some (a + b)
| .Sub => some (a - b)
| .Mul => some (a * b)
| .Div => if b = 0 then none else some (a / b)
-- Proof obligation: β a, evalBinOp .Div a 0 = none
Scoring
- 2 runtime functions Γ (1/3) reward each
- 1 proof Γ (1/3) reward on proof acceptance
- Proof obligation:
weight_proof = 0.2
Task 5 β Payment Saga State Machine (payment_saga)
Difficulty: Hard | Migration: JavaScript β Python | Max Steps: 40
What It Is
A JavaScript Express.js payment saga migrated to Python. A saga is a compensating transaction pattern: it progresses through states (Idle β Reserved β Authorized β Captured β Settled) and supports rollback transitions (CompensateCapture β CompensateAuthorize β CompensateReserve β Compensated). The agent must also prove the no-double-charge theorem: the happy path reaches Settled and isCharged returns true exactly once.
Why It Matters
Payment systems are among the most heavily regulated and correctness-critical software. Proving no-double-charge is not just a unit test β it is a machine-checked guarantee that the state machine cannot reach a charged state more than once via the happy path.
Functions to Migrate
| Function | Description | Depends On |
|---|---|---|
transition |
Pure state transition for one event | β |
runSaga |
Fold a sequence of events into the final saga state | transition |
isCharged |
Return true for Captured or Settled |
β |
no_double_charge_proof |
Lean proof of happy-path no-double-charge theorem | transition, runSaga, isCharged |
What Makes It Interesting
This is the hardest proof in the easy/medium tier. The proof must reference happyPath and pre-proved lemmas (happy_path_settles) from the SagaSpec module. Agents that try to native_decide the entire proof will hit timeout; the intended approach uses constructor, exact, and simp [isCharged].
Lean Spec Excerpt
theorem no_double_charge :
runSaga happyPath = .Settled β§ isCharged (runSaga happyPath) = true := by
constructor
Β· exact happy_path_settles
Β· rw [happy_path_settles]
simp [isCharged]
Scoring
- 3 runtime functions Γ (1/4) reward each
- 1 proof Γ (1/4) reward on proof acceptance
- Proof obligation:
weight_proof = 0.3
Task 6 β LRU Cache (lru_cache)
Difficulty: Hard | Migration: C++ β Rust | Max Steps: 40
What It Is
A C++ LRU cache (backed by std::list + unordered_map) migrated to a Rust Vec<(u64, u64)> where the front is MRU. The agent must implement eviction, put (insert/update with MRU promotion), get (lookup with MRU promotion), and prove the capacity bound.
Why It Matters
LRU caches are ubiquitous in systems programming. The Lean spec pins the capacity invariant formally, ensuring the implementation never silently exceeds its capacity β a property that is hard to test exhaustively but trivial to prove once the algorithm is correct.
Functions to Migrate
| Function | Description | Depends On |
|---|---|---|
lruEvict |
Trim cache to cap entries (keep front = MRU, drop LRU tail) |
β |
lruPut |
Insert/update key-val; move to MRU front; evict if over capacity | lruEvict |
lruGet |
Lookup key; on hit, move to MRU front; return (value, cache) |
β |
lruPutProof |
Lean proof that (lruPut cache cap k v).length β€ cap |
lruPut |
What Makes It Interesting
The Rust representation (Vec<(u64,u64)>) differs structurally from the C++ std::list + unordered_map. Agents must re-derive the algorithm from the Lean spec, not just translate the C++ line by line. The proof obligation is an inequality bound, requiring omega or simp with list length lemmas.
Lean Spec Excerpt
def lruPut (cache : LruSpec.CacheState) (cap : Nat) (key val : Nat) : LruSpec.CacheState :=
((key, val) :: cache.filter (fun p => p.1 β key)).take cap
-- Proof obligation: β cache cap k v, (lruPut cache cap k v).length β€ cap
Scoring
- 3 runtime functions Γ (1/4) reward each
- 1 proof Γ (1/4) reward on proof acceptance
- Proof obligation:
weight_proof = 0.2
Task 7 β Shortest Path / Dijkstra (shortest_path)
Difficulty: Hard | Migration: C++ β Rust | Max Steps: 40
What It Is
A C++ Dijkstra (Bellman-Ford relaxation loop) migrated to Rust. Input is a list of weighted directed edges (from, to, weight). Output is a Vec<Option<u64>> of length n where None means unreachable. The agent must also prove that the self-distance shortestDist graph n src src = some 0 when src < n.
Why It Matters
Shortest-path algorithms appear in routing, logistics, game AI, and network analysis. The Lean spec uses a Bellman-Ford style relaxation (not a priority queue), making the formal model tractable while the Rust implementation can use any correct strategy.
Functions to Migrate
| Function | Description | Depends On |
|---|---|---|
dijkstra |
Single-source shortest paths; returns Vec<Option<u64>> of length n |
β |
shortestDist |
Shortest distance from src to dst, or None if unreachable |
dijkstra |
relaxProof |
Lean proof that shortestDist graph n src src = some 0 |
shortestDist |
What Makes It Interesting
The Lean spec models Dijkstra as iterated Bellman-Ford relaxation β n-1 passes over all edges. Agents writing Rust can use a proper priority queue for efficiency, but the formal model in Lean is the ground truth for correctness. The proof requires reasoning about the initial distance vector and the src < n guard.
Lean Spec Excerpt
def dijkstra (graph : ShortestPathSpec.Graph) (n src : Nat) : List (Option Nat) :=
let init := (List.range n).map (fun i => if i == src then some 0 else none)
(List.range (n - 1)).foldl (fun d _ => relaxPass d graph) init
-- Proof obligation: β graph n src, src < n β shortestDist graph n src src = some 0
Scoring
- 2 runtime functions Γ (1/3) reward each
- 1 proof Γ (1/3) reward on proof acceptance
- Proof obligation:
weight_proof = 0.2
Task 8 β Interval Scheduler (interval_scheduler)
Difficulty: Hard | Migration: C++ β Rust | Max Steps: 40
What It Is
A C++ interval scheduling library migrated to Rust. Two algorithms: (1) merge overlapping intervals (sort by start, scan and merge), and (2) greedy activity selection (sort by end time, pick non-overlapping intervals to maximize count). The agent must also prove the merge output is sorted and non-overlapping.
Why It Matters
Interval scheduling is the foundation of calendar systems, resource allocators, and database range scans. The merge-then-prove pattern is a canonical example of verifying a sorting + merging algorithm β a class of problems where off-by-one errors are common and hard to catch with finite tests.
Functions to Migrate
| Function | Description | Depends On |
|---|---|---|
mergeIntervals |
Sort by start, merge overlapping/touching intervals | β |
maxSchedule |
Greedy activity selection: sort by end time, pick non-overlapping | β |
mergeProof |
Lean proof that mergeIntervals output is sorted + non-overlapping |
mergeIntervals |
What Makes It Interesting
Both functions are independent, which means the agent can migrate them in either order. The proof obligation (isSorted β§ noOverlap) is a conjunction that requires two separate sub-proofs, making it the most structurally complex proof in the suite. Agents must use the constructor tactic and prove each branch using the Lean definitions of isSorted and noOverlap from IntervalSpec.
Lean Spec Excerpt
def mergeIntervals (ivs : List IntervalSpec.Interval) : List IntervalSpec.Interval :=
(sortByStart ivs).foldl (fun acc iv =>
match acc.getLast? with
| none => [iv]
| some (ls, le) =>
if iv.1 β€ le then dropLastInterval acc ++ [(ls, max le iv.2)]
else acc ++ [iv]) []
-- Proof obligation: β ivs, isSorted (mergeIntervals ivs) β§ noOverlap (mergeIntervals ivs)
Scoring
- 2 runtime functions Γ (1/3) reward each
- 1 proof Γ (1/3) reward on proof acceptance
- Proof obligation:
weight_proof = 0.2
Reward Design
Philosophy
Every action provides signal β there is no sparse-reward problem:
Exploration signal:
inspectandanalyze_depsgive small positive rewards (+0.05) so agents learn to gather information before attempting implementation.Functional signal:
run_testsgives +0.10 on full pass, with proportional penalties on partial failure (β0.01 per failing case). This guides the agent toward correct behavior before spending Lean compilation time.Verification signal:
submitgives a large positive reward (1 / n_functions) when Lean accepts the code, and a small penalty (β0.05) on rejection. The penalty is small enough not to discourage reasonable attempts, but nonzero to penalize spamming.Proof signal: Proof tasks split the reward between functional implementation and the proof, with explicit weight fields (
weight_functional,weight_property,weight_proof) configurable per task.
Score Clamping
Internal episode scores are clamped to [0.01, 0.99] β this is the "open unit" convention from OpenEnv that keeps scores away from the extremes, which some validators interpret specially. Step reward logs clamp to [0.0, 1.0].
Episode Termination
An episode ends when:
- All functions are verified (
progress = 1.0), or - The step limit is reached (
step_count β₯ max_steps)
The step limit scales with difficulty: 15 steps for easy tasks, 25 for medium, 40 for hard.
Design Notes for Evaluators
Why Lean 4 as the verifier? Lean 4 provides machine-checked guarantees that go beyond unit tests. A submission that passes sample cases but violates the formal spec is rejected. This makes it impossible to "cheat" with a hardcoded lookup table that only works on known inputs.
Why multiple language pairs? Cross-language migration is harder than same-language refactoring. Each language pair exercises different aspects of the agent's capability: type system awareness, idiomatic pattern translation, and error handling conventions.
Why incremental function-level migration? Submitting the entire file at once would give the agent no partial-credit signal and would make Lean errors much harder to diagnose. Function-level granularity mirrors real code review and makes the reward landscape smooth.
Difficulty calibration:
- Easy: one language pair, no proofs, short dependency chains, small functions
- Medium: two language pairs (including CβRust), optional proofs, medium chains
- Hard: C++βRust with complex data structures, mandatory proofs, long chains