lean-migrate / tasks.md
Hrushi's picture
Upload folder using huggingface_hub
bf9c466 verified

LeanMigrate Task Reference

LeanMigrate models a real engineering workflow: a developer must migrate production business logic from one language to another, then prove correctness against a Lean 4 formal specification. Each task is structured as a small function dependency graph. The agent must migrate each function in dependency order, run sample tests, and pass Lean verification before moving to the next.


Overview

# Task ID Migration Difficulty Functions Proof?
1 rbac_auth Python β†’ TypeScript Easy 3 No
2 pricing_engine JavaScript β†’ Python Medium 5 No
3 path_canonicalizer C β†’ Rust Medium 4 Yes
4 expression_eval C β†’ Rust Medium 3 Yes
5 payment_saga JavaScript β†’ Python Hard 4 Yes
6 lru_cache C++ β†’ Rust Hard 4 Yes
7 shortest_path C++ β†’ Rust Hard 3 Yes
8 interval_scheduler C++ β†’ Rust Hard 3 Yes

Task 1 β€” RBAC Authorization Engine (rbac_auth)

Difficulty: Easy | Migration: Python β†’ TypeScript | Max Steps: 15

What It Is

A role-based access control (RBAC) engine that models permissions as a directed inheritance graph. Roles can inherit permissions from parent roles, enabling hierarchical access policies (e.g. admin inherits from editor which inherits from viewer).

Why It Matters

RBAC is the access control backbone of most enterprise software. Migrating it correctly and proving its behavior is critical: a bug in permission checking is a security vulnerability.

Functions to Migrate

Function Description Depends On
findRole Look up a role by name from the role list β€”
hasDirectPermission Check if a role has a specific resource/action permission directly β€”
canAccess Check permission with role inheritance traversal, depth limit 5 findRole, hasDirectPermission

What Makes It Interesting

The inheritance traversal is bounded (depth ≀ 5) to prevent cycles from causing infinite loops. The Lean spec expresses this via a depth : Nat countdown parameter. Agents must replicate this exact bounding behavior in TypeScript, using function declarations (not arrow functions) so tree-sitter can extract them for the verification IR.

Lean Spec Excerpt

def canAccess (roles : RoleMap) (roleName resource action : String) (depth : Nat := 5) : Bool :=
  match depth, findRole roles roleName with
  | 0, _ => false
  | _, none => false
  | n + 1, some role =>
    hasDirectPermission role resource action ||
    role.inherits.any (fun parentName =>
      canAccess roles parentName resource action n)

Scoring

  • 3 functions Γ— (1/3) reward each on successful Lean verification
  • No proof obligation β€” functional correctness only

Task 2 β€” E-Commerce Pricing Engine (pricing_engine)

Difficulty: Medium | Migration: JavaScript β†’ Python | Max Steps: 25

What It Is

A JavaScript pricing engine that computes the final price of an e-commerce order in integer cents (no floating point). The engine handles line item subtotals, coupon discounts capped at 50% of subtotal, loyalty point discounts capped at 10% of subtotal, and region-based tax rates in basis points.

Why It Matters

Pricing bugs cost real money. Integer arithmetic and correct discount-capping semantics are notoriously tricky to migrate across languages. The Lean spec pins the exact integer semantics so the agent cannot accidentally introduce floating-point rounding.

Functions to Migrate

Function Description Depends On
subtotal Sum of unit price Γ— quantity for all line items β€”
taxRateBps Tax rate in basis points for a given region ID β€”
couponDiscount Coupon discount capped at 50% of subtotal subtotal
loyaltyDiscount Loyalty points discount capped at 10% of subtotal subtotal
finalPrice Final price = subtotal βˆ’ discounts + tax subtotal, couponDiscount, loyaltyDiscount, taxRateBps

What Makes It Interesting

The discount cap logic uses integer division (//), not round() or ceil(). Agents that reach for floating-point will fail Lean verification. The dependency chain is linear (subtotal β†’ discounts β†’ tax β†’ final price), giving a natural migration order that the analyze_deps action surfaces.

Lean Spec Excerpt

def couponDiscount (order : Order) : Int :=
  let sub := subtotal order
  let rawDiscount := order.coupons.foldl (fun total coupon =>
    total + (sub * coupon.discountPercent / 100)) 0
  min rawDiscount (sub / 2)

Scoring

  • 5 functions Γ— (1/5) reward each on successful Lean verification
  • No proof obligation β€” functional correctness only

Task 3 β€” Path Canonicalizer (path_canonicalizer)

Difficulty: Medium | Migration: C β†’ Rust | Max Steps: 25

What It Is

A C path normalization library migrated to Rust. Paths are represented as lists of string segments (not raw strings). The normalizer removes . segments and resolves .. clamped at the root β€” you cannot go above /.

Why It Matters

Path canonicalization is a security-critical operation. Incorrect .. resolution leads to path traversal vulnerabilities. The Lean spec formally proves idempotency: normalizing an already-normalized path produces the same path.

Functions to Migrate

Function Description Depends On
normalizePath Remove ., resolve .. clamped at root β€”
joinPaths Join two segment lists: normalize(base ++ rel) normalizePath
pathDepth Count the number of segments in a path β€”
canonicalizeProof Lean proof that normalize(normalize(p)) = normalize(p) normalizePath

What Makes It Interesting

This is the first C β†’ Rust task. Agents must write Rust with serde_json deserialization. The proof obligation (idempotency) is the first formal proof task β€” agents must reason about the list-based normalization algorithm in Lean rather than just implementing it.

Lean Spec Excerpt

def normalizePath (p : PathSpec.Path) : PathSpec.Path := normalizeAux [] p

-- Proof obligation: idempotency
-- βˆ€ p, normalizePath (normalizePath p) = normalizePath p

Scoring

  • 3 runtime functions Γ— (1/4) reward each on Lean verification
  • 1 proof Γ— (1/4) reward on proof acceptance
  • Proof obligation: weight_proof = 0.2

Task 4 β€” Expression Evaluator (expression_eval)

Difficulty: Medium | Migration: C β†’ Rust | Max Steps: 25

What It Is

A C recursive-descent expression evaluator migrated to Rust. Expressions are binary trees with literal leaves and binary operator nodes (+, -, *, /). Division by zero propagates as None (Rust) / none (Lean).

Why It Matters

Safe arithmetic evaluation is a common primitive in configuration languages, rule engines, and calculators. The Lean spec formalizes the None-propagation contract so the agent cannot silently swallow divide-by-zero.

Functions to Migrate

Function Description Depends On
evalBinOp Evaluate a binary operation; return None for / 0 β€”
evalExpr Recursively evaluate an expression tree; propagate None evalBinOp
divisionProof Lean proof that evalBinOp .Div a 0 = none for all a evalBinOp

What Makes It Interesting

The proof obligation is compact and tractable (by simp [evalBinOp]), making it a good first proof for agents exploring the proof mechanics. The expression tree is passed as a serde_json::Value from the test harness, requiring careful recursive deserialization.

Lean Spec Excerpt

def evalBinOp (op : ExprSpec.Op) (a b : Int) : Option Int :=
  match op with
  | .Add => some (a + b)
  | .Sub => some (a - b)
  | .Mul => some (a * b)
  | .Div => if b = 0 then none else some (a / b)

-- Proof obligation: βˆ€ a, evalBinOp .Div a 0 = none

Scoring

  • 2 runtime functions Γ— (1/3) reward each
  • 1 proof Γ— (1/3) reward on proof acceptance
  • Proof obligation: weight_proof = 0.2

Task 5 β€” Payment Saga State Machine (payment_saga)

Difficulty: Hard | Migration: JavaScript β†’ Python | Max Steps: 40

What It Is

A JavaScript Express.js payment saga migrated to Python. A saga is a compensating transaction pattern: it progresses through states (Idle β†’ Reserved β†’ Authorized β†’ Captured β†’ Settled) and supports rollback transitions (CompensateCapture β†’ CompensateAuthorize β†’ CompensateReserve β†’ Compensated). The agent must also prove the no-double-charge theorem: the happy path reaches Settled and isCharged returns true exactly once.

Why It Matters

Payment systems are among the most heavily regulated and correctness-critical software. Proving no-double-charge is not just a unit test β€” it is a machine-checked guarantee that the state machine cannot reach a charged state more than once via the happy path.

Functions to Migrate

Function Description Depends On
transition Pure state transition for one event β€”
runSaga Fold a sequence of events into the final saga state transition
isCharged Return true for Captured or Settled β€”
no_double_charge_proof Lean proof of happy-path no-double-charge theorem transition, runSaga, isCharged

What Makes It Interesting

This is the hardest proof in the easy/medium tier. The proof must reference happyPath and pre-proved lemmas (happy_path_settles) from the SagaSpec module. Agents that try to native_decide the entire proof will hit timeout; the intended approach uses constructor, exact, and simp [isCharged].

Lean Spec Excerpt

theorem no_double_charge :
    runSaga happyPath = .Settled ∧ isCharged (runSaga happyPath) = true := by
  constructor
  Β· exact happy_path_settles
  Β· rw [happy_path_settles]
    simp [isCharged]

Scoring

  • 3 runtime functions Γ— (1/4) reward each
  • 1 proof Γ— (1/4) reward on proof acceptance
  • Proof obligation: weight_proof = 0.3

Task 6 β€” LRU Cache (lru_cache)

Difficulty: Hard | Migration: C++ β†’ Rust | Max Steps: 40

What It Is

A C++ LRU cache (backed by std::list + unordered_map) migrated to a Rust Vec<(u64, u64)> where the front is MRU. The agent must implement eviction, put (insert/update with MRU promotion), get (lookup with MRU promotion), and prove the capacity bound.

Why It Matters

LRU caches are ubiquitous in systems programming. The Lean spec pins the capacity invariant formally, ensuring the implementation never silently exceeds its capacity β€” a property that is hard to test exhaustively but trivial to prove once the algorithm is correct.

Functions to Migrate

Function Description Depends On
lruEvict Trim cache to cap entries (keep front = MRU, drop LRU tail) β€”
lruPut Insert/update key-val; move to MRU front; evict if over capacity lruEvict
lruGet Lookup key; on hit, move to MRU front; return (value, cache) β€”
lruPutProof Lean proof that (lruPut cache cap k v).length ≀ cap lruPut

What Makes It Interesting

The Rust representation (Vec<(u64,u64)>) differs structurally from the C++ std::list + unordered_map. Agents must re-derive the algorithm from the Lean spec, not just translate the C++ line by line. The proof obligation is an inequality bound, requiring omega or simp with list length lemmas.

Lean Spec Excerpt

def lruPut (cache : LruSpec.CacheState) (cap : Nat) (key val : Nat) : LruSpec.CacheState :=
  ((key, val) :: cache.filter (fun p => p.1 β‰  key)).take cap

-- Proof obligation: βˆ€ cache cap k v, (lruPut cache cap k v).length ≀ cap

Scoring

  • 3 runtime functions Γ— (1/4) reward each
  • 1 proof Γ— (1/4) reward on proof acceptance
  • Proof obligation: weight_proof = 0.2

Task 7 β€” Shortest Path / Dijkstra (shortest_path)

Difficulty: Hard | Migration: C++ β†’ Rust | Max Steps: 40

What It Is

A C++ Dijkstra (Bellman-Ford relaxation loop) migrated to Rust. Input is a list of weighted directed edges (from, to, weight). Output is a Vec<Option<u64>> of length n where None means unreachable. The agent must also prove that the self-distance shortestDist graph n src src = some 0 when src < n.

Why It Matters

Shortest-path algorithms appear in routing, logistics, game AI, and network analysis. The Lean spec uses a Bellman-Ford style relaxation (not a priority queue), making the formal model tractable while the Rust implementation can use any correct strategy.

Functions to Migrate

Function Description Depends On
dijkstra Single-source shortest paths; returns Vec<Option<u64>> of length n β€”
shortestDist Shortest distance from src to dst, or None if unreachable dijkstra
relaxProof Lean proof that shortestDist graph n src src = some 0 shortestDist

What Makes It Interesting

The Lean spec models Dijkstra as iterated Bellman-Ford relaxation β€” n-1 passes over all edges. Agents writing Rust can use a proper priority queue for efficiency, but the formal model in Lean is the ground truth for correctness. The proof requires reasoning about the initial distance vector and the src < n guard.

Lean Spec Excerpt

def dijkstra (graph : ShortestPathSpec.Graph) (n src : Nat) : List (Option Nat) :=
  let init := (List.range n).map (fun i => if i == src then some 0 else none)
  (List.range (n - 1)).foldl (fun d _ => relaxPass d graph) init

-- Proof obligation: βˆ€ graph n src, src < n β†’ shortestDist graph n src src = some 0

Scoring

  • 2 runtime functions Γ— (1/3) reward each
  • 1 proof Γ— (1/3) reward on proof acceptance
  • Proof obligation: weight_proof = 0.2

Task 8 β€” Interval Scheduler (interval_scheduler)

Difficulty: Hard | Migration: C++ β†’ Rust | Max Steps: 40

What It Is

A C++ interval scheduling library migrated to Rust. Two algorithms: (1) merge overlapping intervals (sort by start, scan and merge), and (2) greedy activity selection (sort by end time, pick non-overlapping intervals to maximize count). The agent must also prove the merge output is sorted and non-overlapping.

Why It Matters

Interval scheduling is the foundation of calendar systems, resource allocators, and database range scans. The merge-then-prove pattern is a canonical example of verifying a sorting + merging algorithm β€” a class of problems where off-by-one errors are common and hard to catch with finite tests.

Functions to Migrate

Function Description Depends On
mergeIntervals Sort by start, merge overlapping/touching intervals β€”
maxSchedule Greedy activity selection: sort by end time, pick non-overlapping β€”
mergeProof Lean proof that mergeIntervals output is sorted + non-overlapping mergeIntervals

What Makes It Interesting

Both functions are independent, which means the agent can migrate them in either order. The proof obligation (isSorted ∧ noOverlap) is a conjunction that requires two separate sub-proofs, making it the most structurally complex proof in the suite. Agents must use the constructor tactic and prove each branch using the Lean definitions of isSorted and noOverlap from IntervalSpec.

Lean Spec Excerpt

def mergeIntervals (ivs : List IntervalSpec.Interval) : List IntervalSpec.Interval :=
  (sortByStart ivs).foldl (fun acc iv =>
    match acc.getLast? with
    | none => [iv]
    | some (ls, le) =>
      if iv.1 ≀ le then dropLastInterval acc ++ [(ls, max le iv.2)]
      else acc ++ [iv]) []

-- Proof obligation: βˆ€ ivs, isSorted (mergeIntervals ivs) ∧ noOverlap (mergeIntervals ivs)

Scoring

  • 2 runtime functions Γ— (1/3) reward each
  • 1 proof Γ— (1/3) reward on proof acceptance
  • Proof obligation: weight_proof = 0.2

Reward Design

Philosophy

Every action provides signal β€” there is no sparse-reward problem:

  1. Exploration signal: inspect and analyze_deps give small positive rewards (+0.05) so agents learn to gather information before attempting implementation.

  2. Functional signal: run_tests gives +0.10 on full pass, with proportional penalties on partial failure (βˆ’0.01 per failing case). This guides the agent toward correct behavior before spending Lean compilation time.

  3. Verification signal: submit gives a large positive reward (1 / n_functions) when Lean accepts the code, and a small penalty (βˆ’0.05) on rejection. The penalty is small enough not to discourage reasonable attempts, but nonzero to penalize spamming.

  4. Proof signal: Proof tasks split the reward between functional implementation and the proof, with explicit weight fields (weight_functional, weight_property, weight_proof) configurable per task.

Score Clamping

Internal episode scores are clamped to [0.01, 0.99] β€” this is the "open unit" convention from OpenEnv that keeps scores away from the extremes, which some validators interpret specially. Step reward logs clamp to [0.0, 1.0].

Episode Termination

An episode ends when:

  • All functions are verified (progress = 1.0), or
  • The step limit is reached (step_count β‰₯ max_steps)

The step limit scales with difficulty: 15 steps for easy tasks, 25 for medium, 40 for hard.


Design Notes for Evaluators

Why Lean 4 as the verifier? Lean 4 provides machine-checked guarantees that go beyond unit tests. A submission that passes sample cases but violates the formal spec is rejected. This makes it impossible to "cheat" with a hardcoded lookup table that only works on known inputs.

Why multiple language pairs? Cross-language migration is harder than same-language refactoring. Each language pair exercises different aspects of the agent's capability: type system awareness, idiomatic pattern translation, and error handling conventions.

Why incremental function-level migration? Submitting the entire file at once would give the agent no partial-credit signal and would make Lean errors much harder to diagnose. Function-level granularity mirrors real code review and makes the reward landscape smooth.

Difficulty calibration:

  • Easy: one language pair, no proofs, short dependency chains, small functions
  • Medium: two language pairs (including Cβ†’Rust), optional proofs, medium chains
  • Hard: C++β†’Rust with complex data structures, mandatory proofs, long chains