Spaces:
Sleeping
Sleeping
| # LeanMigrate Task Reference | |
| LeanMigrate models a real engineering workflow: a developer must migrate production business logic from one language to another, then prove correctness against a Lean 4 formal specification. Each task is structured as a small function dependency graph. The agent must migrate each function in dependency order, run sample tests, and pass Lean verification before moving to the next. | |
| --- | |
| ## Overview | |
| | # | Task ID | Migration | Difficulty | Functions | Proof? | | |
| | -- | --------------------- | -------------------- | ---------- | --------- | ------ | | |
| | 1 | `rbac_auth` | Python β TypeScript | Easy | 3 | No | | |
| | 2 | `pricing_engine` | JavaScript β Python | Medium | 5 | No | | |
| | 3 | `path_canonicalizer` | C β Rust | Medium | 4 | Yes | | |
| | 4 | `expression_eval` | C β Rust | Medium | 3 | Yes | | |
| | 5 | `payment_saga` | JavaScript β Python | Hard | 4 | Yes | | |
| | 6 | `lru_cache` | C++ β Rust | Hard | 4 | Yes | | |
| | 7 | `shortest_path` | C++ β Rust | Hard | 3 | Yes | | |
| | 8 | `interval_scheduler` | C++ β Rust | Hard | 3 | Yes | | |
| --- | |
| ## Task 1 β RBAC Authorization Engine (`rbac_auth`) | |
| **Difficulty:** Easy | **Migration:** Python β TypeScript | **Max Steps:** 15 | |
| ### What It Is | |
| A role-based access control (RBAC) engine that models permissions as a directed inheritance graph. Roles can inherit permissions from parent roles, enabling hierarchical access policies (e.g. `admin` inherits from `editor` which inherits from `viewer`). | |
| ### Why It Matters | |
| RBAC is the access control backbone of most enterprise software. Migrating it correctly and proving its behavior is critical: a bug in permission checking is a security vulnerability. | |
| ### Functions to Migrate | |
| | Function | Description | Depends On | | |
| | -------------------- | ----------------------------------------------------------------- | ------------------------------------ | | |
| | `findRole` | Look up a role by name from the role list | β | | |
| | `hasDirectPermission`| Check if a role has a specific resource/action permission directly | β | | |
| | `canAccess` | Check permission with role inheritance traversal, depth limit 5 | `findRole`, `hasDirectPermission` | | |
| ### What Makes It Interesting | |
| The inheritance traversal is bounded (depth β€ 5) to prevent cycles from causing infinite loops. The Lean spec expresses this via a `depth : Nat` countdown parameter. Agents must replicate this exact bounding behavior in TypeScript, using `function` declarations (not arrow functions) so tree-sitter can extract them for the verification IR. | |
| ### Lean Spec Excerpt | |
| ```lean | |
| def canAccess (roles : RoleMap) (roleName resource action : String) (depth : Nat := 5) : Bool := | |
| match depth, findRole roles roleName with | |
| | 0, _ => false | |
| | _, none => false | |
| | n + 1, some role => | |
| hasDirectPermission role resource action || | |
| role.inherits.any (fun parentName => | |
| canAccess roles parentName resource action n) | |
| ``` | |
| ### Scoring | |
| * 3 functions Γ (1/3) reward each on successful Lean verification | |
| * No proof obligation β functional correctness only | |
| --- | |
| ## Task 2 β E-Commerce Pricing Engine (`pricing_engine`) | |
| **Difficulty:** Medium | **Migration:** JavaScript β Python | **Max Steps:** 25 | |
| ### What It Is | |
| A JavaScript pricing engine that computes the final price of an e-commerce order in **integer cents** (no floating point). The engine handles line item subtotals, coupon discounts capped at 50% of subtotal, loyalty point discounts capped at 10% of subtotal, and region-based tax rates in basis points. | |
| ### Why It Matters | |
| Pricing bugs cost real money. Integer arithmetic and correct discount-capping semantics are notoriously tricky to migrate across languages. The Lean spec pins the exact integer semantics so the agent cannot accidentally introduce floating-point rounding. | |
| ### Functions to Migrate | |
| | Function | Description | Depends On | | |
| | --------------- | --------------------------------------------------------- | --------------------------------------------- | | |
| | `subtotal` | Sum of unit price Γ quantity for all line items | β | | |
| | `taxRateBps` | Tax rate in basis points for a given region ID | β | | |
| | `couponDiscount`| Coupon discount capped at 50% of subtotal | `subtotal` | | |
| | `loyaltyDiscount`| Loyalty points discount capped at 10% of subtotal | `subtotal` | | |
| | `finalPrice` | Final price = subtotal β discounts + tax | `subtotal`, `couponDiscount`, `loyaltyDiscount`, `taxRateBps` | | |
| ### What Makes It Interesting | |
| The discount cap logic uses integer division (`//`), not `round()` or `ceil()`. Agents that reach for floating-point will fail Lean verification. The dependency chain is linear (subtotal β discounts β tax β final price), giving a natural migration order that the `analyze_deps` action surfaces. | |
| ### Lean Spec Excerpt | |
| ```lean | |
| def couponDiscount (order : Order) : Int := | |
| let sub := subtotal order | |
| let rawDiscount := order.coupons.foldl (fun total coupon => | |
| total + (sub * coupon.discountPercent / 100)) 0 | |
| min rawDiscount (sub / 2) | |
| ``` | |
| ### Scoring | |
| * 5 functions Γ (1/5) reward each on successful Lean verification | |
| * No proof obligation β functional correctness only | |
| --- | |
| ## Task 3 β Path Canonicalizer (`path_canonicalizer`) | |
| **Difficulty:** Medium | **Migration:** C β Rust | **Max Steps:** 25 | |
| ### What It Is | |
| A C path normalization library migrated to Rust. Paths are represented as lists of string segments (not raw strings). The normalizer removes `.` segments and resolves `..` clamped at the root β you cannot go above `/`. | |
| ### Why It Matters | |
| Path canonicalization is a security-critical operation. Incorrect `..` resolution leads to path traversal vulnerabilities. The Lean spec formally proves idempotency: normalizing an already-normalized path produces the same path. | |
| ### Functions to Migrate | |
| | Function | Description | Depends On | | |
| | ------------------ | -------------------------------------------------------- | --------------- | | |
| | `normalizePath` | Remove `.`, resolve `..` clamped at root | β | | |
| | `joinPaths` | Join two segment lists: normalize(base ++ rel) | `normalizePath` | | |
| | `pathDepth` | Count the number of segments in a path | β | | |
| | `canonicalizeProof`| **Lean proof** that `normalize(normalize(p)) = normalize(p)` | `normalizePath` | | |
| ### What Makes It Interesting | |
| This is the first C β Rust task. Agents must write Rust with `serde_json` deserialization. The proof obligation (idempotency) is the first formal proof task β agents must reason about the list-based normalization algorithm in Lean rather than just implementing it. | |
| ### Lean Spec Excerpt | |
| ```lean | |
| def normalizePath (p : PathSpec.Path) : PathSpec.Path := normalizeAux [] p | |
| -- Proof obligation: idempotency | |
| -- β p, normalizePath (normalizePath p) = normalizePath p | |
| ``` | |
| ### Scoring | |
| * 3 runtime functions Γ (1/4) reward each on Lean verification | |
| * 1 proof Γ (1/4) reward on proof acceptance | |
| * Proof obligation: `weight_proof = 0.2` | |
| --- | |
| ## Task 4 β Expression Evaluator (`expression_eval`) | |
| **Difficulty:** Medium | **Migration:** C β Rust | **Max Steps:** 25 | |
| ### What It Is | |
| A C recursive-descent expression evaluator migrated to Rust. Expressions are binary trees with literal leaves and binary operator nodes (`+`, `-`, `*`, `/`). Division by zero propagates as `None` (Rust) / `none` (Lean). | |
| ### Why It Matters | |
| Safe arithmetic evaluation is a common primitive in configuration languages, rule engines, and calculators. The Lean spec formalizes the `None`-propagation contract so the agent cannot silently swallow divide-by-zero. | |
| ### Functions to Migrate | |
| | Function | Description | Depends On | | |
| | --------------- | ---------------------------------------------------------- | ----------- | | |
| | `evalBinOp` | Evaluate a binary operation; return `None` for `/ 0` | β | | |
| | `evalExpr` | Recursively evaluate an expression tree; propagate `None` | `evalBinOp` | | |
| | `divisionProof` | **Lean proof** that `evalBinOp .Div a 0 = none` for all `a` | `evalBinOp` | | |
| ### What Makes It Interesting | |
| The proof obligation is compact and tractable (`by simp [evalBinOp]`), making it a good first proof for agents exploring the proof mechanics. The expression tree is passed as a `serde_json::Value` from the test harness, requiring careful recursive deserialization. | |
| ### Lean Spec Excerpt | |
| ```lean | |
| def evalBinOp (op : ExprSpec.Op) (a b : Int) : Option Int := | |
| match op with | |
| | .Add => some (a + b) | |
| | .Sub => some (a - b) | |
| | .Mul => some (a * b) | |
| | .Div => if b = 0 then none else some (a / b) | |
| -- Proof obligation: β a, evalBinOp .Div a 0 = none | |
| ``` | |
| ### Scoring | |
| * 2 runtime functions Γ (1/3) reward each | |
| * 1 proof Γ (1/3) reward on proof acceptance | |
| * Proof obligation: `weight_proof = 0.2` | |
| --- | |
| ## Task 5 β Payment Saga State Machine (`payment_saga`) | |
| **Difficulty:** Hard | **Migration:** JavaScript β Python | **Max Steps:** 40 | |
| ### What It Is | |
| A JavaScript Express.js payment saga migrated to Python. A saga is a compensating transaction pattern: it progresses through states (`Idle β Reserved β Authorized β Captured β Settled`) and supports rollback transitions (`CompensateCapture β CompensateAuthorize β CompensateReserve β Compensated`). The agent must also prove the **no-double-charge theorem**: the happy path reaches `Settled` and `isCharged` returns `true` exactly once. | |
| ### Why It Matters | |
| Payment systems are among the most heavily regulated and correctness-critical software. Proving no-double-charge is not just a unit test β it is a machine-checked guarantee that the state machine cannot reach a charged state more than once via the happy path. | |
| ### Functions to Migrate | |
| | Function | Description | Depends On | | |
| | --------------------- | ---------------------------------------------------- | ---------------------------------- | | |
| | `transition` | Pure state transition for one event | β | | |
| | `runSaga` | Fold a sequence of events into the final saga state | `transition` | | |
| | `isCharged` | Return `true` for `Captured` or `Settled` | β | | |
| | `no_double_charge_proof` | **Lean proof** of happy-path no-double-charge theorem | `transition`, `runSaga`, `isCharged` | | |
| ### What Makes It Interesting | |
| This is the hardest proof in the easy/medium tier. The proof must reference `happyPath` and pre-proved lemmas (`happy_path_settles`) from the `SagaSpec` module. Agents that try to `native_decide` the entire proof will hit timeout; the intended approach uses `constructor`, `exact`, and `simp [isCharged]`. | |
| ### Lean Spec Excerpt | |
| ```lean | |
| theorem no_double_charge : | |
| runSaga happyPath = .Settled β§ isCharged (runSaga happyPath) = true := by | |
| constructor | |
| Β· exact happy_path_settles | |
| Β· rw [happy_path_settles] | |
| simp [isCharged] | |
| ``` | |
| ### Scoring | |
| * 3 runtime functions Γ (1/4) reward each | |
| * 1 proof Γ (1/4) reward on proof acceptance | |
| * Proof obligation: `weight_proof = 0.3` | |
| --- | |
| ## Task 6 β LRU Cache (`lru_cache`) | |
| **Difficulty:** Hard | **Migration:** C++ β Rust | **Max Steps:** 40 | |
| ### What It Is | |
| A C++ LRU cache (backed by `std::list` + `unordered_map`) migrated to a Rust `Vec<(u64, u64)>` where the **front is MRU**. The agent must implement eviction, put (insert/update with MRU promotion), get (lookup with MRU promotion), and prove the capacity bound. | |
| ### Why It Matters | |
| LRU caches are ubiquitous in systems programming. The Lean spec pins the capacity invariant formally, ensuring the implementation never silently exceeds its capacity β a property that is hard to test exhaustively but trivial to prove once the algorithm is correct. | |
| ### Functions to Migrate | |
| | Function | Description | Depends On | | |
| | ------------ | ---------------------------------------------------------------- | ---------- | | |
| | `lruEvict` | Trim cache to `cap` entries (keep front = MRU, drop LRU tail) | β | | |
| | `lruPut` | Insert/update key-val; move to MRU front; evict if over capacity | `lruEvict` | | |
| | `lruGet` | Lookup key; on hit, move to MRU front; return `(value, cache)` | β | | |
| | `lruPutProof`| **Lean proof** that `(lruPut cache cap k v).length β€ cap` | `lruPut` | | |
| ### What Makes It Interesting | |
| The Rust representation (`Vec<(u64,u64)>`) differs structurally from the C++ `std::list + unordered_map`. Agents must re-derive the algorithm from the Lean spec, not just translate the C++ line by line. The proof obligation is an inequality bound, requiring `omega` or `simp` with list length lemmas. | |
| ### Lean Spec Excerpt | |
| ```lean | |
| def lruPut (cache : LruSpec.CacheState) (cap : Nat) (key val : Nat) : LruSpec.CacheState := | |
| ((key, val) :: cache.filter (fun p => p.1 β key)).take cap | |
| -- Proof obligation: β cache cap k v, (lruPut cache cap k v).length β€ cap | |
| ``` | |
| ### Scoring | |
| * 3 runtime functions Γ (1/4) reward each | |
| * 1 proof Γ (1/4) reward on proof acceptance | |
| * Proof obligation: `weight_proof = 0.2` | |
| --- | |
| ## Task 7 β Shortest Path / Dijkstra (`shortest_path`) | |
| **Difficulty:** Hard | **Migration:** C++ β Rust | **Max Steps:** 40 | |
| ### What It Is | |
| A C++ Dijkstra (Bellman-Ford relaxation loop) migrated to Rust. Input is a list of weighted directed edges `(from, to, weight)`. Output is a `Vec<Option<u64>>` of length `n` where `None` means unreachable. The agent must also prove that the self-distance `shortestDist graph n src src = some 0` when `src < n`. | |
| ### Why It Matters | |
| Shortest-path algorithms appear in routing, logistics, game AI, and network analysis. The Lean spec uses a Bellman-Ford style relaxation (not a priority queue), making the formal model tractable while the Rust implementation can use any correct strategy. | |
| ### Functions to Migrate | |
| | Function | Description | Depends On | | |
| | -------------- | -------------------------------------------------------------------- | ---------- | | |
| | `dijkstra` | Single-source shortest paths; returns `Vec<Option<u64>>` of length n | β | | |
| | `shortestDist` | Shortest distance from `src` to `dst`, or `None` if unreachable | `dijkstra` | | |
| | `relaxProof` | **Lean proof** that `shortestDist graph n src src = some 0` | `shortestDist` | | |
| ### What Makes It Interesting | |
| The Lean spec models Dijkstra as iterated Bellman-Ford relaxation β `n-1` passes over all edges. Agents writing Rust can use a proper priority queue for efficiency, but the formal model in Lean is the ground truth for correctness. The proof requires reasoning about the initial distance vector and the `src < n` guard. | |
| ### Lean Spec Excerpt | |
| ```lean | |
| def dijkstra (graph : ShortestPathSpec.Graph) (n src : Nat) : List (Option Nat) := | |
| let init := (List.range n).map (fun i => if i == src then some 0 else none) | |
| (List.range (n - 1)).foldl (fun d _ => relaxPass d graph) init | |
| -- Proof obligation: β graph n src, src < n β shortestDist graph n src src = some 0 | |
| ``` | |
| ### Scoring | |
| * 2 runtime functions Γ (1/3) reward each | |
| * 1 proof Γ (1/3) reward on proof acceptance | |
| * Proof obligation: `weight_proof = 0.2` | |
| --- | |
| ## Task 8 β Interval Scheduler (`interval_scheduler`) | |
| **Difficulty:** Hard | **Migration:** C++ β Rust | **Max Steps:** 40 | |
| ### What It Is | |
| A C++ interval scheduling library migrated to Rust. Two algorithms: (1) **merge overlapping intervals** (sort by start, scan and merge), and (2) **greedy activity selection** (sort by end time, pick non-overlapping intervals to maximize count). The agent must also prove the merge output is sorted and non-overlapping. | |
| ### Why It Matters | |
| Interval scheduling is the foundation of calendar systems, resource allocators, and database range scans. The merge-then-prove pattern is a canonical example of verifying a sorting + merging algorithm β a class of problems where off-by-one errors are common and hard to catch with finite tests. | |
| ### Functions to Migrate | |
| | Function | Description | Depends On | | |
| | ---------------- | ---------------------------------------------------------------------- | ---------- | | |
| | `mergeIntervals` | Sort by start, merge overlapping/touching intervals | β | | |
| | `maxSchedule` | Greedy activity selection: sort by end time, pick non-overlapping | β | | |
| | `mergeProof` | **Lean proof** that `mergeIntervals` output is sorted + non-overlapping | `mergeIntervals` | | |
| ### What Makes It Interesting | |
| Both functions are independent, which means the agent can migrate them in either order. The proof obligation (`isSorted β§ noOverlap`) is a conjunction that requires two separate sub-proofs, making it the most structurally complex proof in the suite. Agents must use the `constructor` tactic and prove each branch using the Lean definitions of `isSorted` and `noOverlap` from `IntervalSpec`. | |
| ### Lean Spec Excerpt | |
| ```lean | |
| def mergeIntervals (ivs : List IntervalSpec.Interval) : List IntervalSpec.Interval := | |
| (sortByStart ivs).foldl (fun acc iv => | |
| match acc.getLast? with | |
| | none => [iv] | |
| | some (ls, le) => | |
| if iv.1 β€ le then dropLastInterval acc ++ [(ls, max le iv.2)] | |
| else acc ++ [iv]) [] | |
| -- Proof obligation: β ivs, isSorted (mergeIntervals ivs) β§ noOverlap (mergeIntervals ivs) | |
| ``` | |
| ### Scoring | |
| * 2 runtime functions Γ (1/3) reward each | |
| * 1 proof Γ (1/3) reward on proof acceptance | |
| * Proof obligation: `weight_proof = 0.2` | |
| --- | |
| ## Reward Design | |
| ### Philosophy | |
| Every action provides signal β there is no sparse-reward problem: | |
| 1. **Exploration signal**: `inspect` and `analyze_deps` give small positive rewards (+0.05) so agents learn to gather information before attempting implementation. | |
| 2. **Functional signal**: `run_tests` gives +0.10 on full pass, with proportional penalties on partial failure (β0.01 per failing case). This guides the agent toward correct behavior before spending Lean compilation time. | |
| 3. **Verification signal**: `submit` gives a large positive reward (`1 / n_functions`) when Lean accepts the code, and a small penalty (β0.05) on rejection. The penalty is small enough not to discourage reasonable attempts, but nonzero to penalize spamming. | |
| 4. **Proof signal**: Proof tasks split the reward between functional implementation and the proof, with explicit weight fields (`weight_functional`, `weight_property`, `weight_proof`) configurable per task. | |
| ### Score Clamping | |
| Internal episode scores are clamped to `[0.01, 0.99]` β this is the "open unit" convention from OpenEnv that keeps scores away from the extremes, which some validators interpret specially. Step reward logs clamp to `[0.0, 1.0]`. | |
| ### Episode Termination | |
| An episode ends when: | |
| * All functions are verified (`progress = 1.0`), **or** | |
| * The step limit is reached (`step_count β₯ max_steps`) | |
| The step limit scales with difficulty: 15 steps for easy tasks, 25 for medium, 40 for hard. | |
| --- | |
| ## Design Notes for Evaluators | |
| **Why Lean 4 as the verifier?** Lean 4 provides machine-checked guarantees that go beyond unit tests. A submission that passes sample cases but violates the formal spec is rejected. This makes it impossible to "cheat" with a hardcoded lookup table that only works on known inputs. | |
| **Why multiple language pairs?** Cross-language migration is harder than same-language refactoring. Each language pair exercises different aspects of the agent's capability: type system awareness, idiomatic pattern translation, and error handling conventions. | |
| **Why incremental function-level migration?** Submitting the entire file at once would give the agent no partial-credit signal and would make Lean errors much harder to diagnose. Function-level granularity mirrors real code review and makes the reward landscape smooth. | |
| **Difficulty calibration:** | |
| - Easy: one language pair, no proofs, short dependency chains, small functions | |
| - Medium: two language pairs (including CβRust), optional proofs, medium chains | |
| - Hard: C++βRust with complex data structures, mandatory proofs, long chains | |