Spaces:

Hrushi
/

lean-migrate

Sleeping

App Files Files Community

lean-migrate / tasks.md

Hrushi

Upload folder using huggingface_hub

bf9c466 verified about 2 months ago

preview code

raw

history blame contribute delete

21.2 kB

LeanMigrate Task Reference

LeanMigrate models a real engineering workflow: a developer must migrate production business logic from one language to another, then prove correctness against a Lean 4 formal specification. Each task is structured as a small function dependency graph. The agent must migrate each function in dependency order, run sample tests, and pass Lean verification before moving to the next.

Overview

#	Task ID	Migration	Difficulty	Functions	Proof?
1	`rbac_auth`	Python → TypeScript	Easy	3	No
2	`pricing_engine`	JavaScript → Python	Medium	5	No
3	`path_canonicalizer`	C → Rust	Medium	4	Yes
4	`expression_eval`	C → Rust	Medium	3	Yes
5	`payment_saga`	JavaScript → Python	Hard	4	Yes
6	`lru_cache`	C++ → Rust	Hard	4	Yes
7	`shortest_path`	C++ → Rust	Hard	3	Yes
8	`interval_scheduler`	C++ → Rust	Hard	3	Yes

Task 1 — RBAC Authorization Engine (`rbac_auth`)

Difficulty: Easy | Migration: Python → TypeScript | Max Steps: 15

What It Is

A role-based access control (RBAC) engine that models permissions as a directed inheritance graph. Roles can inherit permissions from parent roles, enabling hierarchical access policies (e.g. admin inherits from editor which inherits from viewer).

Why It Matters

RBAC is the access control backbone of most enterprise software. Migrating it correctly and proving its behavior is critical: a bug in permission checking is a security vulnerability.

Functions to Migrate

Function	Description	Depends On
`findRole`	Look up a role by name from the role list	—
`hasDirectPermission`	Check if a role has a specific resource/action permission directly	—
`canAccess`	Check permission with role inheritance traversal, depth limit 5	`findRole`, `hasDirectPermission`

What Makes It Interesting

The inheritance traversal is bounded (depth ≤ 5) to prevent cycles from causing infinite loops. The Lean spec expresses this via a depth : Nat countdown parameter. Agents must replicate this exact bounding behavior in TypeScript, using function declarations (not arrow functions) so tree-sitter can extract them for the verification IR.

Lean Spec Excerpt

def canAccess (roles : RoleMap) (roleName resource action : String) (depth : Nat := 5) : Bool :=
  match depth, findRole roles roleName with
  | 0, _ => false
  | _, none => false
  | n + 1, some role =>
    hasDirectPermission role resource action ||
    role.inherits.any (fun parentName =>
      canAccess roles parentName resource action n)

Scoring

3 functions × (1/3) reward each on successful Lean verification
No proof obligation — functional correctness only

Task 2 — E-Commerce Pricing Engine (`pricing_engine`)

Difficulty: Medium | Migration: JavaScript → Python | Max Steps: 25

What It Is

A JavaScript pricing engine that computes the final price of an e-commerce order in integer cents (no floating point). The engine handles line item subtotals, coupon discounts capped at 50% of subtotal, loyalty point discounts capped at 10% of subtotal, and region-based tax rates in basis points.

Why It Matters

Pricing bugs cost real money. Integer arithmetic and correct discount-capping semantics are notoriously tricky to migrate across languages. The Lean spec pins the exact integer semantics so the agent cannot accidentally introduce floating-point rounding.

Functions to Migrate

Function	Description	Depends On
`subtotal`	Sum of unit price × quantity for all line items	—
`taxRateBps`	Tax rate in basis points for a given region ID	—
`couponDiscount`	Coupon discount capped at 50% of subtotal	`subtotal`
`loyaltyDiscount`	Loyalty points discount capped at 10% of subtotal	`subtotal`
`finalPrice`	Final price = subtotal − discounts + tax	`subtotal`, `couponDiscount`, `loyaltyDiscount`, `taxRateBps`

What Makes It Interesting

The discount cap logic uses integer division (//), not round() or ceil(). Agents that reach for floating-point will fail Lean verification. The dependency chain is linear (subtotal → discounts → tax → final price), giving a natural migration order that the analyze_deps action surfaces.

Lean Spec Excerpt

def couponDiscount (order : Order) : Int :=
  let sub := subtotal order
  let rawDiscount := order.coupons.foldl (fun total coupon =>
    total + (sub * coupon.discountPercent / 100)) 0
  min rawDiscount (sub / 2)

Scoring

5 functions × (1/5) reward each on successful Lean verification
No proof obligation — functional correctness only

Task 3 — Path Canonicalizer (`path_canonicalizer`)

Difficulty: Medium | Migration: C → Rust | Max Steps: 25

What It Is

A C path normalization library migrated to Rust. Paths are represented as lists of string segments (not raw strings). The normalizer removes . segments and resolves .. clamped at the root — you cannot go above /.

Why It Matters

Path canonicalization is a security-critical operation. Incorrect .. resolution leads to path traversal vulnerabilities. The Lean spec formally proves idempotency: normalizing an already-normalized path produces the same path.

Functions to Migrate

Function	Description	Depends On
`normalizePath`	Remove `.`, resolve `..` clamped at root	—
`joinPaths`	Join two segment lists: normalize(base ++ rel)	`normalizePath`
`pathDepth`	Count the number of segments in a path	—
`canonicalizeProof`	Lean proof that `normalize(normalize(p)) = normalize(p)`	`normalizePath`

What Makes It Interesting

This is the first C → Rust task. Agents must write Rust with serde_json deserialization. The proof obligation (idempotency) is the first formal proof task — agents must reason about the list-based normalization algorithm in Lean rather than just implementing it.

Lean Spec Excerpt

def normalizePath (p : PathSpec.Path) : PathSpec.Path := normalizeAux [] p

-- Proof obligation: idempotency
-- ∀ p, normalizePath (normalizePath p) = normalizePath p

Scoring

3 runtime functions × (1/4) reward each on Lean verification
1 proof × (1/4) reward on proof acceptance
Proof obligation: weight_proof = 0.2

Task 4 — Expression Evaluator (`expression_eval`)

Difficulty: Medium | Migration: C → Rust | Max Steps: 25

What It Is

A C recursive-descent expression evaluator migrated to Rust. Expressions are binary trees with literal leaves and binary operator nodes (+, -, *, /). Division by zero propagates as None (Rust) / none (Lean).

Why It Matters

Safe arithmetic evaluation is a common primitive in configuration languages, rule engines, and calculators. The Lean spec formalizes the None-propagation contract so the agent cannot silently swallow divide-by-zero.

Functions to Migrate

Function	Description	Depends On
`evalBinOp`	Evaluate a binary operation; return `None` for `/ 0`	—
`evalExpr`	Recursively evaluate an expression tree; propagate `None`	`evalBinOp`
`divisionProof`	Lean proof that `evalBinOp .Div a 0 = none` for all `a`	`evalBinOp`

What Makes It Interesting

The proof obligation is compact and tractable (by simp [evalBinOp]), making it a good first proof for agents exploring the proof mechanics. The expression tree is passed as a serde_json::Value from the test harness, requiring careful recursive deserialization.

Lean Spec Excerpt

def evalBinOp (op : ExprSpec.Op) (a b : Int) : Option Int :=
  match op with
  | .Add => some (a + b)
  | .Sub => some (a - b)
  | .Mul => some (a * b)
  | .Div => if b = 0 then none else some (a / b)

-- Proof obligation: ∀ a, evalBinOp .Div a 0 = none

Scoring

2 runtime functions × (1/3) reward each
1 proof × (1/3) reward on proof acceptance
Proof obligation: weight_proof = 0.2

Task 5 — Payment Saga State Machine (`payment_saga`)

Difficulty: Hard | Migration: JavaScript → Python | Max Steps: 40

What It Is

A JavaScript Express.js payment saga migrated to Python. A saga is a compensating transaction pattern: it progresses through states (Idle → Reserved → Authorized → Captured → Settled) and supports rollback transitions (CompensateCapture → CompensateAuthorize → CompensateReserve → Compensated). The agent must also prove the no-double-charge theorem: the happy path reaches Settled and isCharged returns true exactly once.

Why It Matters

Payment systems are among the most heavily regulated and correctness-critical software. Proving no-double-charge is not just a unit test — it is a machine-checked guarantee that the state machine cannot reach a charged state more than once via the happy path.

Functions to Migrate

Function	Description	Depends On
`transition`	Pure state transition for one event	—
`runSaga`	Fold a sequence of events into the final saga state	`transition`
`isCharged`	Return `true` for `Captured` or `Settled`	—
`no_double_charge_proof`	Lean proof of happy-path no-double-charge theorem	`transition`, `runSaga`, `isCharged`

What Makes It Interesting

This is the hardest proof in the easy/medium tier. The proof must reference happyPath and pre-proved lemmas (happy_path_settles) from the SagaSpec module. Agents that try to native_decide the entire proof will hit timeout; the intended approach uses constructor, exact, and simp [isCharged].

Lean Spec Excerpt

theorem no_double_charge :
    runSaga happyPath = .Settled ∧ isCharged (runSaga happyPath) = true := by
  constructor
  · exact happy_path_settles
  · rw [happy_path_settles]
    simp [isCharged]

Scoring

3 runtime functions × (1/4) reward each
1 proof × (1/4) reward on proof acceptance
Proof obligation: weight_proof = 0.3

Task 6 — LRU Cache (`lru_cache`)

Difficulty: Hard | Migration: C++ → Rust | Max Steps: 40

What It Is

A C++ LRU cache (backed by std::list + unordered_map) migrated to a Rust Vec<(u64, u64)> where the front is MRU. The agent must implement eviction, put (insert/update with MRU promotion), get (lookup with MRU promotion), and prove the capacity bound.

Why It Matters

LRU caches are ubiquitous in systems programming. The Lean spec pins the capacity invariant formally, ensuring the implementation never silently exceeds its capacity — a property that is hard to test exhaustively but trivial to prove once the algorithm is correct.

Functions to Migrate

Function	Description	Depends On
`lruEvict`	Trim cache to `cap` entries (keep front = MRU, drop LRU tail)	—
`lruPut`	Insert/update key-val; move to MRU front; evict if over capacity	`lruEvict`
`lruGet`	Lookup key; on hit, move to MRU front; return `(value, cache)`	—
`lruPutProof`	Lean proof that `(lruPut cache cap k v).length ≤ cap`	`lruPut`

What Makes It Interesting

The Rust representation (Vec<(u64,u64)>) differs structurally from the C++ std::list + unordered_map. Agents must re-derive the algorithm from the Lean spec, not just translate the C++ line by line. The proof obligation is an inequality bound, requiring omega or simp with list length lemmas.

Lean Spec Excerpt

def lruPut (cache : LruSpec.CacheState) (cap : Nat) (key val : Nat) : LruSpec.CacheState :=
  ((key, val) :: cache.filter (fun p => p.1 ≠ key)).take cap

-- Proof obligation: ∀ cache cap k v, (lruPut cache cap k v).length ≤ cap

Scoring

3 runtime functions × (1/4) reward each
1 proof × (1/4) reward on proof acceptance
Proof obligation: weight_proof = 0.2

Task 7 — Shortest Path / Dijkstra (`shortest_path`)

Difficulty: Hard | Migration: C++ → Rust | Max Steps: 40

What It Is

A C++ Dijkstra (Bellman-Ford relaxation loop) migrated to Rust. Input is a list of weighted directed edges (from, to, weight). Output is a Vec<Option<u64>> of length n where None means unreachable. The agent must also prove that the self-distance shortestDist graph n src src = some 0 when src < n.

Why It Matters

Shortest-path algorithms appear in routing, logistics, game AI, and network analysis. The Lean spec uses a Bellman-Ford style relaxation (not a priority queue), making the formal model tractable while the Rust implementation can use any correct strategy.

Functions to Migrate

Function	Description	Depends On
`dijkstra`	Single-source shortest paths; returns `Vec<Option<u64>>` of length n	—
`shortestDist`	Shortest distance from `src` to `dst`, or `None` if unreachable	`dijkstra`
`relaxProof`	Lean proof that `shortestDist graph n src src = some 0`	`shortestDist`

What Makes It Interesting

The Lean spec models Dijkstra as iterated Bellman-Ford relaxation — n-1 passes over all edges. Agents writing Rust can use a proper priority queue for efficiency, but the formal model in Lean is the ground truth for correctness. The proof requires reasoning about the initial distance vector and the src < n guard.

Lean Spec Excerpt

def dijkstra (graph : ShortestPathSpec.Graph) (n src : Nat) : List (Option Nat) :=
  let init := (List.range n).map (fun i => if i == src then some 0 else none)
  (List.range (n - 1)).foldl (fun d _ => relaxPass d graph) init

-- Proof obligation: ∀ graph n src, src < n → shortestDist graph n src src = some 0

Scoring

2 runtime functions × (1/3) reward each
1 proof × (1/3) reward on proof acceptance
Proof obligation: weight_proof = 0.2

Task 8 — Interval Scheduler (`interval_scheduler`)

Difficulty: Hard | Migration: C++ → Rust | Max Steps: 40

What It Is

A C++ interval scheduling library migrated to Rust. Two algorithms: (1) merge overlapping intervals (sort by start, scan and merge), and (2) greedy activity selection (sort by end time, pick non-overlapping intervals to maximize count). The agent must also prove the merge output is sorted and non-overlapping.

Why It Matters

Interval scheduling is the foundation of calendar systems, resource allocators, and database range scans. The merge-then-prove pattern is a canonical example of verifying a sorting + merging algorithm — a class of problems where off-by-one errors are common and hard to catch with finite tests.

Functions to Migrate

Function	Description	Depends On
`mergeIntervals`	Sort by start, merge overlapping/touching intervals	—
`maxSchedule`	Greedy activity selection: sort by end time, pick non-overlapping	—
`mergeProof`	Lean proof that `mergeIntervals` output is sorted + non-overlapping	`mergeIntervals`

What Makes It Interesting

Both functions are independent, which means the agent can migrate them in either order. The proof obligation (isSorted ∧ noOverlap) is a conjunction that requires two separate sub-proofs, making it the most structurally complex proof in the suite. Agents must use the constructor tactic and prove each branch using the Lean definitions of isSorted and noOverlap from IntervalSpec.

Lean Spec Excerpt

def mergeIntervals (ivs : List IntervalSpec.Interval) : List IntervalSpec.Interval :=
  (sortByStart ivs).foldl (fun acc iv =>
    match acc.getLast? with
    | none => [iv]
    | some (ls, le) =>
      if iv.1 ≤ le then dropLastInterval acc ++ [(ls, max le iv.2)]
      else acc ++ [iv]) []

-- Proof obligation: ∀ ivs, isSorted (mergeIntervals ivs) ∧ noOverlap (mergeIntervals ivs)

Scoring

2 runtime functions × (1/3) reward each
1 proof × (1/3) reward on proof acceptance
Proof obligation: weight_proof = 0.2

Reward Design

Philosophy

Every action provides signal — there is no sparse-reward problem:

Exploration signal: inspect and analyze_deps give small positive rewards (+0.05) so agents learn to gather information before attempting implementation.
Functional signal: run_tests gives +0.10 on full pass, with proportional penalties on partial failure (−0.01 per failing case). This guides the agent toward correct behavior before spending Lean compilation time.
Verification signal: submit gives a large positive reward (1 / n_functions) when Lean accepts the code, and a small penalty (−0.05) on rejection. The penalty is small enough not to discourage reasonable attempts, but nonzero to penalize spamming.
Proof signal: Proof tasks split the reward between functional implementation and the proof, with explicit weight fields (weight_functional, weight_property, weight_proof) configurable per task.

Score Clamping

Internal episode scores are clamped to [0.01, 0.99] — this is the "open unit" convention from OpenEnv that keeps scores away from the extremes, which some validators interpret specially. Step reward logs clamp to [0.0, 1.0].

Episode Termination

An episode ends when:

All functions are verified (progress = 1.0), or
The step limit is reached (step_count ≥ max_steps)

The step limit scales with difficulty: 15 steps for easy tasks, 25 for medium, 40 for hard.

Design Notes for Evaluators

Why Lean 4 as the verifier? Lean 4 provides machine-checked guarantees that go beyond unit tests. A submission that passes sample cases but violates the formal spec is rejected. This makes it impossible to "cheat" with a hardcoded lookup table that only works on known inputs.

Why multiple language pairs? Cross-language migration is harder than same-language refactoring. Each language pair exercises different aspects of the agent's capability: type system awareness, idiomatic pattern translation, and error handling conventions.

Why incremental function-level migration? Submitting the entire file at once would give the agent no partial-credit signal and would make Lean errors much harder to diagnose. Function-level granularity mirrors real code review and makes the reward landscape smooth.

Difficulty calibration:

Easy: one language pair, no proofs, short dependency chains, small functions
Medium: two language pairs (including C→Rust), optional proofs, medium chains
Hard: C++→Rust with complex data structures, mandatory proofs, long chains

LeanMigrate Task Reference

Overview

Task 1 — RBAC Authorization Engine (rbac_auth)

What It Is

Why It Matters

Functions to Migrate

What Makes It Interesting

Lean Spec Excerpt

Scoring

Task 2 — E-Commerce Pricing Engine (pricing_engine)

What It Is

Why It Matters

Functions to Migrate

What Makes It Interesting

Lean Spec Excerpt

Scoring

Task 3 — Path Canonicalizer (path_canonicalizer)

What It Is

Why It Matters

Functions to Migrate

What Makes It Interesting

Lean Spec Excerpt

Scoring

Task 4 — Expression Evaluator (expression_eval)

What It Is

Why It Matters

Functions to Migrate

What Makes It Interesting

Lean Spec Excerpt

Scoring

Task 5 — Payment Saga State Machine (payment_saga)

What It Is

Why It Matters

Functions to Migrate

What Makes It Interesting

Lean Spec Excerpt

Scoring

Task 6 — LRU Cache (lru_cache)

What It Is

Why It Matters

Functions to Migrate

What Makes It Interesting

Lean Spec Excerpt

Scoring

Task 7 — Shortest Path / Dijkstra (shortest_path)

What It Is

Why It Matters

Functions to Migrate

What Makes It Interesting

Lean Spec Excerpt

Scoring

Task 8 — Interval Scheduler (interval_scheduler)

What It Is

Why It Matters

Functions to Migrate

What Makes It Interesting

Lean Spec Excerpt

Scoring

Reward Design

Philosophy

Score Clamping

Episode Termination

Design Notes for Evaluators

Task 1 — RBAC Authorization Engine (`rbac_auth`)

Task 2 — E-Commerce Pricing Engine (`pricing_engine`)

Task 3 — Path Canonicalizer (`path_canonicalizer`)

Task 4 — Expression Evaluator (`expression_eval`)

Task 5 — Payment Saga State Machine (`payment_saga`)

Task 6 — LRU Cache (`lru_cache`)

Task 7 — Shortest Path / Dijkstra (`shortest_path`)

Task 8 — Interval Scheduler (`interval_scheduler`)