lean-migrate / tasks.md
Hrushi's picture
Upload folder using huggingface_hub
bf9c466 verified
# LeanMigrate Task Reference
LeanMigrate models a real engineering workflow: a developer must migrate production business logic from one language to another, then prove correctness against a Lean 4 formal specification. Each task is structured as a small function dependency graph. The agent must migrate each function in dependency order, run sample tests, and pass Lean verification before moving to the next.
---
## Overview
| # | Task ID | Migration | Difficulty | Functions | Proof? |
| -- | --------------------- | -------------------- | ---------- | --------- | ------ |
| 1 | `rbac_auth` | Python β†’ TypeScript | Easy | 3 | No |
| 2 | `pricing_engine` | JavaScript β†’ Python | Medium | 5 | No |
| 3 | `path_canonicalizer` | C β†’ Rust | Medium | 4 | Yes |
| 4 | `expression_eval` | C β†’ Rust | Medium | 3 | Yes |
| 5 | `payment_saga` | JavaScript β†’ Python | Hard | 4 | Yes |
| 6 | `lru_cache` | C++ β†’ Rust | Hard | 4 | Yes |
| 7 | `shortest_path` | C++ β†’ Rust | Hard | 3 | Yes |
| 8 | `interval_scheduler` | C++ β†’ Rust | Hard | 3 | Yes |
---
## Task 1 β€” RBAC Authorization Engine (`rbac_auth`)
**Difficulty:** Easy | **Migration:** Python β†’ TypeScript | **Max Steps:** 15
### What It Is
A role-based access control (RBAC) engine that models permissions as a directed inheritance graph. Roles can inherit permissions from parent roles, enabling hierarchical access policies (e.g. `admin` inherits from `editor` which inherits from `viewer`).
### Why It Matters
RBAC is the access control backbone of most enterprise software. Migrating it correctly and proving its behavior is critical: a bug in permission checking is a security vulnerability.
### Functions to Migrate
| Function | Description | Depends On |
| -------------------- | ----------------------------------------------------------------- | ------------------------------------ |
| `findRole` | Look up a role by name from the role list | β€” |
| `hasDirectPermission`| Check if a role has a specific resource/action permission directly | β€” |
| `canAccess` | Check permission with role inheritance traversal, depth limit 5 | `findRole`, `hasDirectPermission` |
### What Makes It Interesting
The inheritance traversal is bounded (depth ≀ 5) to prevent cycles from causing infinite loops. The Lean spec expresses this via a `depth : Nat` countdown parameter. Agents must replicate this exact bounding behavior in TypeScript, using `function` declarations (not arrow functions) so tree-sitter can extract them for the verification IR.
### Lean Spec Excerpt
```lean
def canAccess (roles : RoleMap) (roleName resource action : String) (depth : Nat := 5) : Bool :=
match depth, findRole roles roleName with
| 0, _ => false
| _, none => false
| n + 1, some role =>
hasDirectPermission role resource action ||
role.inherits.any (fun parentName =>
canAccess roles parentName resource action n)
```
### Scoring
* 3 functions Γ— (1/3) reward each on successful Lean verification
* No proof obligation β€” functional correctness only
---
## Task 2 β€” E-Commerce Pricing Engine (`pricing_engine`)
**Difficulty:** Medium | **Migration:** JavaScript β†’ Python | **Max Steps:** 25
### What It Is
A JavaScript pricing engine that computes the final price of an e-commerce order in **integer cents** (no floating point). The engine handles line item subtotals, coupon discounts capped at 50% of subtotal, loyalty point discounts capped at 10% of subtotal, and region-based tax rates in basis points.
### Why It Matters
Pricing bugs cost real money. Integer arithmetic and correct discount-capping semantics are notoriously tricky to migrate across languages. The Lean spec pins the exact integer semantics so the agent cannot accidentally introduce floating-point rounding.
### Functions to Migrate
| Function | Description | Depends On |
| --------------- | --------------------------------------------------------- | --------------------------------------------- |
| `subtotal` | Sum of unit price Γ— quantity for all line items | β€” |
| `taxRateBps` | Tax rate in basis points for a given region ID | β€” |
| `couponDiscount`| Coupon discount capped at 50% of subtotal | `subtotal` |
| `loyaltyDiscount`| Loyalty points discount capped at 10% of subtotal | `subtotal` |
| `finalPrice` | Final price = subtotal βˆ’ discounts + tax | `subtotal`, `couponDiscount`, `loyaltyDiscount`, `taxRateBps` |
### What Makes It Interesting
The discount cap logic uses integer division (`//`), not `round()` or `ceil()`. Agents that reach for floating-point will fail Lean verification. The dependency chain is linear (subtotal β†’ discounts β†’ tax β†’ final price), giving a natural migration order that the `analyze_deps` action surfaces.
### Lean Spec Excerpt
```lean
def couponDiscount (order : Order) : Int :=
let sub := subtotal order
let rawDiscount := order.coupons.foldl (fun total coupon =>
total + (sub * coupon.discountPercent / 100)) 0
min rawDiscount (sub / 2)
```
### Scoring
* 5 functions Γ— (1/5) reward each on successful Lean verification
* No proof obligation β€” functional correctness only
---
## Task 3 β€” Path Canonicalizer (`path_canonicalizer`)
**Difficulty:** Medium | **Migration:** C β†’ Rust | **Max Steps:** 25
### What It Is
A C path normalization library migrated to Rust. Paths are represented as lists of string segments (not raw strings). The normalizer removes `.` segments and resolves `..` clamped at the root β€” you cannot go above `/`.
### Why It Matters
Path canonicalization is a security-critical operation. Incorrect `..` resolution leads to path traversal vulnerabilities. The Lean spec formally proves idempotency: normalizing an already-normalized path produces the same path.
### Functions to Migrate
| Function | Description | Depends On |
| ------------------ | -------------------------------------------------------- | --------------- |
| `normalizePath` | Remove `.`, resolve `..` clamped at root | β€” |
| `joinPaths` | Join two segment lists: normalize(base ++ rel) | `normalizePath` |
| `pathDepth` | Count the number of segments in a path | β€” |
| `canonicalizeProof`| **Lean proof** that `normalize(normalize(p)) = normalize(p)` | `normalizePath` |
### What Makes It Interesting
This is the first C β†’ Rust task. Agents must write Rust with `serde_json` deserialization. The proof obligation (idempotency) is the first formal proof task β€” agents must reason about the list-based normalization algorithm in Lean rather than just implementing it.
### Lean Spec Excerpt
```lean
def normalizePath (p : PathSpec.Path) : PathSpec.Path := normalizeAux [] p
-- Proof obligation: idempotency
-- βˆ€ p, normalizePath (normalizePath p) = normalizePath p
```
### Scoring
* 3 runtime functions Γ— (1/4) reward each on Lean verification
* 1 proof Γ— (1/4) reward on proof acceptance
* Proof obligation: `weight_proof = 0.2`
---
## Task 4 β€” Expression Evaluator (`expression_eval`)
**Difficulty:** Medium | **Migration:** C β†’ Rust | **Max Steps:** 25
### What It Is
A C recursive-descent expression evaluator migrated to Rust. Expressions are binary trees with literal leaves and binary operator nodes (`+`, `-`, `*`, `/`). Division by zero propagates as `None` (Rust) / `none` (Lean).
### Why It Matters
Safe arithmetic evaluation is a common primitive in configuration languages, rule engines, and calculators. The Lean spec formalizes the `None`-propagation contract so the agent cannot silently swallow divide-by-zero.
### Functions to Migrate
| Function | Description | Depends On |
| --------------- | ---------------------------------------------------------- | ----------- |
| `evalBinOp` | Evaluate a binary operation; return `None` for `/ 0` | β€” |
| `evalExpr` | Recursively evaluate an expression tree; propagate `None` | `evalBinOp` |
| `divisionProof` | **Lean proof** that `evalBinOp .Div a 0 = none` for all `a` | `evalBinOp` |
### What Makes It Interesting
The proof obligation is compact and tractable (`by simp [evalBinOp]`), making it a good first proof for agents exploring the proof mechanics. The expression tree is passed as a `serde_json::Value` from the test harness, requiring careful recursive deserialization.
### Lean Spec Excerpt
```lean
def evalBinOp (op : ExprSpec.Op) (a b : Int) : Option Int :=
match op with
| .Add => some (a + b)
| .Sub => some (a - b)
| .Mul => some (a * b)
| .Div => if b = 0 then none else some (a / b)
-- Proof obligation: βˆ€ a, evalBinOp .Div a 0 = none
```
### Scoring
* 2 runtime functions Γ— (1/3) reward each
* 1 proof Γ— (1/3) reward on proof acceptance
* Proof obligation: `weight_proof = 0.2`
---
## Task 5 β€” Payment Saga State Machine (`payment_saga`)
**Difficulty:** Hard | **Migration:** JavaScript β†’ Python | **Max Steps:** 40
### What It Is
A JavaScript Express.js payment saga migrated to Python. A saga is a compensating transaction pattern: it progresses through states (`Idle β†’ Reserved β†’ Authorized β†’ Captured β†’ Settled`) and supports rollback transitions (`CompensateCapture β†’ CompensateAuthorize β†’ CompensateReserve β†’ Compensated`). The agent must also prove the **no-double-charge theorem**: the happy path reaches `Settled` and `isCharged` returns `true` exactly once.
### Why It Matters
Payment systems are among the most heavily regulated and correctness-critical software. Proving no-double-charge is not just a unit test β€” it is a machine-checked guarantee that the state machine cannot reach a charged state more than once via the happy path.
### Functions to Migrate
| Function | Description | Depends On |
| --------------------- | ---------------------------------------------------- | ---------------------------------- |
| `transition` | Pure state transition for one event | β€” |
| `runSaga` | Fold a sequence of events into the final saga state | `transition` |
| `isCharged` | Return `true` for `Captured` or `Settled` | β€” |
| `no_double_charge_proof` | **Lean proof** of happy-path no-double-charge theorem | `transition`, `runSaga`, `isCharged` |
### What Makes It Interesting
This is the hardest proof in the easy/medium tier. The proof must reference `happyPath` and pre-proved lemmas (`happy_path_settles`) from the `SagaSpec` module. Agents that try to `native_decide` the entire proof will hit timeout; the intended approach uses `constructor`, `exact`, and `simp [isCharged]`.
### Lean Spec Excerpt
```lean
theorem no_double_charge :
runSaga happyPath = .Settled ∧ isCharged (runSaga happyPath) = true := by
constructor
Β· exact happy_path_settles
Β· rw [happy_path_settles]
simp [isCharged]
```
### Scoring
* 3 runtime functions Γ— (1/4) reward each
* 1 proof Γ— (1/4) reward on proof acceptance
* Proof obligation: `weight_proof = 0.3`
---
## Task 6 β€” LRU Cache (`lru_cache`)
**Difficulty:** Hard | **Migration:** C++ β†’ Rust | **Max Steps:** 40
### What It Is
A C++ LRU cache (backed by `std::list` + `unordered_map`) migrated to a Rust `Vec<(u64, u64)>` where the **front is MRU**. The agent must implement eviction, put (insert/update with MRU promotion), get (lookup with MRU promotion), and prove the capacity bound.
### Why It Matters
LRU caches are ubiquitous in systems programming. The Lean spec pins the capacity invariant formally, ensuring the implementation never silently exceeds its capacity β€” a property that is hard to test exhaustively but trivial to prove once the algorithm is correct.
### Functions to Migrate
| Function | Description | Depends On |
| ------------ | ---------------------------------------------------------------- | ---------- |
| `lruEvict` | Trim cache to `cap` entries (keep front = MRU, drop LRU tail) | β€” |
| `lruPut` | Insert/update key-val; move to MRU front; evict if over capacity | `lruEvict` |
| `lruGet` | Lookup key; on hit, move to MRU front; return `(value, cache)` | β€” |
| `lruPutProof`| **Lean proof** that `(lruPut cache cap k v).length ≀ cap` | `lruPut` |
### What Makes It Interesting
The Rust representation (`Vec<(u64,u64)>`) differs structurally from the C++ `std::list + unordered_map`. Agents must re-derive the algorithm from the Lean spec, not just translate the C++ line by line. The proof obligation is an inequality bound, requiring `omega` or `simp` with list length lemmas.
### Lean Spec Excerpt
```lean
def lruPut (cache : LruSpec.CacheState) (cap : Nat) (key val : Nat) : LruSpec.CacheState :=
((key, val) :: cache.filter (fun p => p.1 β‰  key)).take cap
-- Proof obligation: βˆ€ cache cap k v, (lruPut cache cap k v).length ≀ cap
```
### Scoring
* 3 runtime functions Γ— (1/4) reward each
* 1 proof Γ— (1/4) reward on proof acceptance
* Proof obligation: `weight_proof = 0.2`
---
## Task 7 β€” Shortest Path / Dijkstra (`shortest_path`)
**Difficulty:** Hard | **Migration:** C++ β†’ Rust | **Max Steps:** 40
### What It Is
A C++ Dijkstra (Bellman-Ford relaxation loop) migrated to Rust. Input is a list of weighted directed edges `(from, to, weight)`. Output is a `Vec<Option<u64>>` of length `n` where `None` means unreachable. The agent must also prove that the self-distance `shortestDist graph n src src = some 0` when `src < n`.
### Why It Matters
Shortest-path algorithms appear in routing, logistics, game AI, and network analysis. The Lean spec uses a Bellman-Ford style relaxation (not a priority queue), making the formal model tractable while the Rust implementation can use any correct strategy.
### Functions to Migrate
| Function | Description | Depends On |
| -------------- | -------------------------------------------------------------------- | ---------- |
| `dijkstra` | Single-source shortest paths; returns `Vec<Option<u64>>` of length n | β€” |
| `shortestDist` | Shortest distance from `src` to `dst`, or `None` if unreachable | `dijkstra` |
| `relaxProof` | **Lean proof** that `shortestDist graph n src src = some 0` | `shortestDist` |
### What Makes It Interesting
The Lean spec models Dijkstra as iterated Bellman-Ford relaxation β€” `n-1` passes over all edges. Agents writing Rust can use a proper priority queue for efficiency, but the formal model in Lean is the ground truth for correctness. The proof requires reasoning about the initial distance vector and the `src < n` guard.
### Lean Spec Excerpt
```lean
def dijkstra (graph : ShortestPathSpec.Graph) (n src : Nat) : List (Option Nat) :=
let init := (List.range n).map (fun i => if i == src then some 0 else none)
(List.range (n - 1)).foldl (fun d _ => relaxPass d graph) init
-- Proof obligation: βˆ€ graph n src, src < n β†’ shortestDist graph n src src = some 0
```
### Scoring
* 2 runtime functions Γ— (1/3) reward each
* 1 proof Γ— (1/3) reward on proof acceptance
* Proof obligation: `weight_proof = 0.2`
---
## Task 8 β€” Interval Scheduler (`interval_scheduler`)
**Difficulty:** Hard | **Migration:** C++ β†’ Rust | **Max Steps:** 40
### What It Is
A C++ interval scheduling library migrated to Rust. Two algorithms: (1) **merge overlapping intervals** (sort by start, scan and merge), and (2) **greedy activity selection** (sort by end time, pick non-overlapping intervals to maximize count). The agent must also prove the merge output is sorted and non-overlapping.
### Why It Matters
Interval scheduling is the foundation of calendar systems, resource allocators, and database range scans. The merge-then-prove pattern is a canonical example of verifying a sorting + merging algorithm β€” a class of problems where off-by-one errors are common and hard to catch with finite tests.
### Functions to Migrate
| Function | Description | Depends On |
| ---------------- | ---------------------------------------------------------------------- | ---------- |
| `mergeIntervals` | Sort by start, merge overlapping/touching intervals | β€” |
| `maxSchedule` | Greedy activity selection: sort by end time, pick non-overlapping | β€” |
| `mergeProof` | **Lean proof** that `mergeIntervals` output is sorted + non-overlapping | `mergeIntervals` |
### What Makes It Interesting
Both functions are independent, which means the agent can migrate them in either order. The proof obligation (`isSorted ∧ noOverlap`) is a conjunction that requires two separate sub-proofs, making it the most structurally complex proof in the suite. Agents must use the `constructor` tactic and prove each branch using the Lean definitions of `isSorted` and `noOverlap` from `IntervalSpec`.
### Lean Spec Excerpt
```lean
def mergeIntervals (ivs : List IntervalSpec.Interval) : List IntervalSpec.Interval :=
(sortByStart ivs).foldl (fun acc iv =>
match acc.getLast? with
| none => [iv]
| some (ls, le) =>
if iv.1 ≀ le then dropLastInterval acc ++ [(ls, max le iv.2)]
else acc ++ [iv]) []
-- Proof obligation: βˆ€ ivs, isSorted (mergeIntervals ivs) ∧ noOverlap (mergeIntervals ivs)
```
### Scoring
* 2 runtime functions Γ— (1/3) reward each
* 1 proof Γ— (1/3) reward on proof acceptance
* Proof obligation: `weight_proof = 0.2`
---
## Reward Design
### Philosophy
Every action provides signal β€” there is no sparse-reward problem:
1. **Exploration signal**: `inspect` and `analyze_deps` give small positive rewards (+0.05) so agents learn to gather information before attempting implementation.
2. **Functional signal**: `run_tests` gives +0.10 on full pass, with proportional penalties on partial failure (βˆ’0.01 per failing case). This guides the agent toward correct behavior before spending Lean compilation time.
3. **Verification signal**: `submit` gives a large positive reward (`1 / n_functions`) when Lean accepts the code, and a small penalty (βˆ’0.05) on rejection. The penalty is small enough not to discourage reasonable attempts, but nonzero to penalize spamming.
4. **Proof signal**: Proof tasks split the reward between functional implementation and the proof, with explicit weight fields (`weight_functional`, `weight_property`, `weight_proof`) configurable per task.
### Score Clamping
Internal episode scores are clamped to `[0.01, 0.99]` β€” this is the "open unit" convention from OpenEnv that keeps scores away from the extremes, which some validators interpret specially. Step reward logs clamp to `[0.0, 1.0]`.
### Episode Termination
An episode ends when:
* All functions are verified (`progress = 1.0`), **or**
* The step limit is reached (`step_count β‰₯ max_steps`)
The step limit scales with difficulty: 15 steps for easy tasks, 25 for medium, 40 for hard.
---
## Design Notes for Evaluators
**Why Lean 4 as the verifier?** Lean 4 provides machine-checked guarantees that go beyond unit tests. A submission that passes sample cases but violates the formal spec is rejected. This makes it impossible to "cheat" with a hardcoded lookup table that only works on known inputs.
**Why multiple language pairs?** Cross-language migration is harder than same-language refactoring. Each language pair exercises different aspects of the agent's capability: type system awareness, idiomatic pattern translation, and error handling conventions.
**Why incremental function-level migration?** Submitting the entire file at once would give the agent no partial-credit signal and would make Lean errors much harder to diagnose. Function-level granularity mirrors real code review and makes the reward landscape smooth.
**Difficulty calibration:**
- Easy: one language pair, no proofs, short dependency chains, small functions
- Medium: two language pairs (including C→Rust), optional proofs, medium chains
- Hard: C++β†’Rust with complex data structures, mandatory proofs, long chains