XllentAI
/

modular_arithmetic

@@ -19,9 +19,10 @@ metrics:
 A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
 2^2048) on the public benchmark — tiers 1-5 = 100%, tier 6 = 98%, tiers 7-8 = 100%,
 tier 9 = 99%, **tier 10 = 100%** — so `highest_tier_above_90 = 10` (the maximum),
-overall_accuracy **0.997**. Every cell is the same **carry-aware TCN** (~30M params total, 0.13 GB),
-so its capability comes from *learning one algorithmic step* rather than memorising finite
-multiplication tables, and it verifiably generalises to primes never seen in training.
 ## The idea
@@ -47,25 +48,26 @@ The single-step function is **piecewise linear** (`2t + bit*b`, then subtract 0,
 `2p`), which is why it generalises across primes where the full bilinear map does not:
 held-out-prime validation accuracy tracks training accuracy throughout (no memorisation gap).
-## Eight cells, routed by prime size
-The recurrence is exact only if the state is wide enough to hold the residue, so the cell is
-trained per bit-width. The model ships eight and routes each problem to the narrowest cell
-whose state holds its prime:
-| Cell | Primes | Tiers | Architecture | Params | Public benchmark |
 |---|---|---|---|---|---|
-| 16-bit | `< 2^16` | 1-3 | carry-aware TCN, 6 blocks, dil 1..8 | ~2.4M | tiers 1-3 = 1.00 |
-| 32-bit | `< 2^32` | 4 | carry-aware TCN, 8 blocks, dil 1..16 | ~3.2M | tier 4 = 1.00 |
-| 64-bit | `< 2^64` | 5 | carry-aware TCN, 8 blocks, dil 1..32 | ~3.2M | tier 5 = 0.99 |
-| 128-bit | `< 2^128` | 6 | carry-aware TCN, 10 blocks, dil 1..64 | ~3.9M | tier 6 = 0.97 |
-| 256-bit | `< 2^256` | 7 | carry-aware TCN, 12 blocks, dil 1..128 | ~4.7M | tier 7 = 0.98 |
-| 512-bit | `< 2^512` | 8 | carry-aware TCN, 14 blocks, dil 1..256 | ~5.5M | tier 8 = 0.98 |
-| 1024-bit | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
-| 2048-bit | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 = 1.00 |
-For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]` fallback without
-invoking the network.
 ## The carry-aware TCN (every tier)
@@ -134,12 +136,17 @@ position. Combined with gradient accumulation (effective batch ~26k) and the wor
 loss, this took tier 9 from **0.73 -> 0.99**, even across prime widths (held-out value-uniform
 validation 0.99; per-width 1015-1024 all ~0.99).
 ```bash
-python horner_rnn/train.py --stage1-minutes 50                  # 16-bit cell -> weights16.pt
-python exploration/train_horner32.py --minutes 120              # 32-bit cell -> weights32.pt
-python exploration/train_horner_tcn.py --bits 64  --blocks 8  --max-dil 32  --lo-bits 62  # tier 5
-python exploration/train_horner_tcn.py --bits 256 --blocks 12 --max-dil 128 --lo-bits 251 # tier 7
-python exploration/train_horner_tcn.py --bits 512 --blocks 14 --max-dil 256 --accum 2     # tier 8
 ```
 The **1024-bit (tier-9) cell is a multi-stage curriculum**, not a single run — the carry

 A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
 2^2048) on the public benchmark — tiers 1-5 = 100%, tier 6 = 98%, tiers 7-8 = 100%,
 tier 9 = 99%, **tier 10 = 100%** — so `highest_tier_above_90 = 10` (the maximum),
+overall_accuracy **0.997**. Every cell is the same **carry-aware TCN** (~21M params total across
+five weight-sets, 0.13 GB), so its capability comes from *learning one algorithmic step* rather
+than memorising finite multiplication tables, and it verifiably generalises to primes never seen
+in training.
 ## The idea
 `2p`), which is why it generalises across primes where the full bilinear map does not:
 held-out-prime validation accuracy tracks training accuracy throughout (no memorisation gap).
+## Five weight-sets, routed by prime size
+The recurrence is exact only if the state is wide enough to hold the residue, so each cell is
+trained per bit-width — but because the dilated convolution is weight-shared across bit-positions
+and the carry/borrow rule is position-invariant, **one shared weight-set serves the four mid
+widths 64/128/256/512** (run at each prime's native width). The model therefore ships **five
+weight-sets** and routes each problem to the narrowest cell whose state holds its prime:
+| Weight file | Primes | Tiers | Architecture | Params | Public benchmark |
 |---|---|---|---|---|---|
+| `weights16.pt` | `< 2^16` | 1-3 | carry-aware TCN, 6 blocks, dil 1..8 | ~2.4M | tiers 1-3 = 1.00 |
+| `weights32.pt` | `< 2^32` | 4 | carry-aware TCN, 8 blocks, dil 1..16 | ~3.2M | tier 4 = 1.00 |
+| `weights_shared_64_512.pt` | `< 2^512` | 5-8 | carry-aware TCN, 14 blocks, dil 1..256 — **one shared set**, run at native width | ~5.5M | tier 5 = 1.00, tier 6 = 0.98, tier 7 = 1.00, tier 8 = 1.00 |
+| `weights1024.pt` | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
+| `weights2048.pt` | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 = 1.00 |
+The four separate mid-width cells it replaced (0.99 / 0.97 / 0.98 / 0.98, ~17M params combined)
+were collapsed into the single shared set, which matches or beats them at ~5.5M — total **~21M
+params, 0.13 GB**. For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]`
+fallback without invoking the network.
 ## The carry-aware TCN (every tier)
 loss, this took tier 9 from **0.73 -> 0.99**, even across prime widths (held-out value-uniform
 validation 0.99; per-width 1015-1024 all ~0.99).
+The training scripts live in the companion research repo (not shipped in this model repo); the
+commands below document *how the weights were obtained* (the provenance the rules ask for):
 ```bash
+# small-prime cells, width-matched (bit-length-uniform over the cell's whole range + value-uniform)
+python exploration/train_horner_tcn.py --bits 16 --lo-bits 2  --bitlen-frac 0.65 --bitlen-lo 2   # -> weights16.pt  (tiers 1-3)
+python exploration/train_horner_tcn.py --bits 32 --lo-bits 17 --bitlen-frac 0.6  --bitlen-lo 17  # -> weights32.pt  (tier 4)
+# ONE shared mid-width set for tiers 5-8: warm-start from the dedicated 512-bit cell, then
+# fine-tune on a {64,128,256,512}-bit mix (the carry rule is width-portable) -> weights_shared_64_512.pt
+python exploration/train_unified.py --warm --init-from weights512.pt --widths 64,128,256,512
 ```
 The **1024-bit (tier-9) cell is a multi-stage curriculum**, not a single run — the carry

train.py DELETED Viewed

@@ -1,221 +0,0 @@
-"""Train the horner_rnn transition cell (bit-level Horner step) + chain fine-tuning.
-Stage 1: train cell f(t, bit, b, p) = (2t + bit*b) mod p (quotients {0,1,2},
-easier than base-4's {0..6}) with grad clipping, EMA, hard-boundary mining.
-Stage 2 (optional, default off): fine-tune end-to-end through the 16-step
-chain with a straight-through estimator on the quantized state, loss on every
-step's ground-truth intermediate. In practice this was destructive at lr2=5e-5
-(chain val collapsed); the shipped weights come from stage 1 alone, which
-reaches chain val ~0.998 on held-out primes. Kept for further experimentation
-at lower learning rates.
-"""
-from __future__ import annotations
-import argparse
-import time
-import sys
-from pathlib import Path
-import torch
-import torch.nn as nn
-# Import the shared architecture from the sibling model.py.
-HERE = Path(__file__).resolve().parent
-sys.path.insert(0, str(HERE))
-from model import HornerCell, BITS, _to_bits as to_bits  # noqa: E402
-def sieve_primes(limit: int) -> list[int]:
-    is_p = bytearray([1]) * limit
-    is_p[0] = is_p[1] = 0
-    for i in range(2, int(limit ** 0.5) + 1):
-        if is_p[i]:
-            is_p[i * i :: i] = bytearray(len(is_p[i * i :: i]))
-    return [i for i in range(2, limit) if is_p[i]]
-def sample_batch(primes_t, n, device, hard_frac=0.5):
-    p = primes_t[torch.randint(len(primes_t), (n,), device=device)]
-    b = (torch.rand(n, device=device) * p).long().clamp(max=p - 1)
-    bit = torch.randint(0, 2, (n,), device=device)
-    n_hard = int(n * hard_frac)
-    t = torch.empty(n, dtype=torch.long, device=device)
-    t[n_hard:] = (torch.rand(n - n_hard, device=device) * p[n_hard:]).long()
-    if n_hard:
-        ph, bh, bith = p[:n_hard], b[:n_hard], bit[:n_hard]
-        q = torch.randint(0, 3, (n_hard,), device=device)
-        delta = torch.randint(-2, 3, (n_hard,), device=device)
-        th = (q * ph + delta - bith * bh) >> 1
-        t[:n_hard] = th.clamp(min=0) % ph
-    z = (2 * t + bit * b) % p
-    return t, bit, b, p, z
-@torch.no_grad()
-def exact_rate(model, primes_t, device, n=200_000, bs=65536) -> float:
-    ok = 0
-    for i in range(0, n, bs):
-        m = min(bs, n - i)
-        t, bit, b, p, z = sample_batch(primes_t, m, device, hard_frac=0.0)
-        logits = model(to_bits(t), bit.float().unsqueeze(1), to_bits(b), to_bits(p))
-        ok += ((logits > 0).long() == to_bits(z).long()).all(dim=1).sum().item()
-    return ok / n
-@torch.no_grad()
-def chain_exact_rate(model, primes_t, device, n=20_000) -> float:
-    p = primes_t[torch.randint(len(primes_t), (n,), device=device)]
-    a = (torch.rand(n, device=device) * p).long().clamp(max=p - 1)
-    b = (torch.rand(n, device=device) * p).long().clamp(max=p - 1)
-    truth = (a * b) % p
-    bb, pb = to_bits(b), to_bits(p)
-    tb = torch.zeros(n, BITS, device=device)
-    for i in range(BITS - 1, -1, -1):
-        bit = ((a >> i) & 1).float().unsqueeze(1)
-        tb = (model(tb, bit, bb, pb) > 0).float()
-    pred = (tb.long() * (1 << torch.arange(BITS, device=device))).sum(dim=1)
-    return (pred == truth).float().mean().item()
-def chain_finetune_batch(model, primes_t, n, device, loss_fn):
-    """One end-to-end pass: STE state, per-step CE against true intermediates."""
-    p = primes_t[torch.randint(len(primes_t), (n,), device=device)]
-    a = (torch.rand(n, device=device) * p).long().clamp(max=p - 1)
-    b = (torch.rand(n, device=device) * p).long().clamp(max=p - 1)
-    bb, pb = to_bits(b), to_bits(p)
-    tb = torch.zeros(n, BITS, device=device)
-    t_true = torch.zeros_like(a)
-    loss = torch.zeros((), device=device)
-    for i in range(BITS - 1, -1, -1):
-        bit_i = (a >> i) & 1
-        t_true = (2 * t_true + bit_i * b) % p
-        logits = model(tb, bit_i.float().unsqueeze(1), bb, pb)
-        loss = loss + loss_fn(logits, to_bits(t_true))
-        hard = (logits > 0).float()
-        soft = torch.sigmoid(logits)
-        tb = hard + (soft - soft.detach())  # straight-through
-    return loss / BITS
-def main() -> int:
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--stage1-minutes", type=float, default=50.0)
-    ap.add_argument("--stage2-minutes", type=float, default=0.0)
-    ap.add_argument("--batch", type=int, default=32768)
-    ap.add_argument("--chain-batch", type=int, default=4096)
-    ap.add_argument("--lr", type=float, default=3e-4)
-    ap.add_argument("--lr2", type=float, default=5e-5)
-    ap.add_argument("--width", type=int, default=4096)
-    ap.add_argument("--depth", type=int, default=4)
-    ap.add_argument("--init", type=str, default="")
-    ap.add_argument("--out", type=str, default=str(HERE / "weights16.pt"))
-    args = ap.parse_args()
-    device = torch.device("cuda")
-    torch.manual_seed(0)
-    small = sieve_primes(256)
-    primes = [p for p in sieve_primes(1 << 16) if p >= 256]
-    g = torch.Generator().manual_seed(1)
-    perm = torch.randperm(len(primes), generator=g).tolist()
-    val_primes = torch.tensor([primes[i] for i in perm[: len(primes) // 10]], device=device)
-    train_primes = torch.tensor(
-        small + [primes[i] for i in perm[len(primes) // 10 :]], device=device
-    )
-    print(f"train primes {len(train_primes)}, val primes {len(val_primes)}")
-    model = HornerCell(args.width, args.depth).to(device)
-    if args.init:
-        ckpt = torch.load(args.init, map_location=device, weights_only=True)
-        model.load_state_dict(ckpt["state_dict"])
-        print(f"initialised from {args.init}")
-    ema = HornerCell(args.width, args.depth).to(device)
-    ema.load_state_dict(model.state_dict())
-    for q in ema.parameters():
-        q.requires_grad_(False)
-    print(f"params: {sum(t.numel() for t in model.parameters()):,}")
-    loss_fn = nn.BCEWithLogitsLoss()
-    EMA_DECAY = 0.999
-    def update_ema():
-        with torch.no_grad():
-            for q, w in zip(ema.parameters(), model.parameters()):
-                q.lerp_(w, 1 - EMA_DECAY)
-    best_chain = -1.0
-    def save_if_best(tag):
-        nonlocal best_chain
-        ch = chain_exact_rate(ema, val_primes, device)
-        if ch > best_chain:
-            best_chain = ch
-            torch.save({"state_dict": ema.state_dict(), "config": ema.config}, args.out)
-        return ch
-    # ----- Stage 1: cell training -----
-    if args.stage1_minutes > 0:
-        opt = torch.optim.AdamW(model.parameters(), lr=args.lr, weight_decay=1e-5)
-        total_steps = int(args.stage1_minutes * 60 * 16)
-        sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=total_steps, eta_min=args.lr * 0.02)
-        deadline = time.monotonic() + args.stage1_minutes * 60
-        start = time.monotonic()
-        step = 0
-        while time.monotonic() < deadline:
-            t, bit, b, p, z = sample_batch(train_primes, args.batch, device)
-            logits = model(to_bits(t), bit.float().unsqueeze(1), to_bits(b), to_bits(p))
-            loss = loss_fn(logits, to_bits(z))
-            opt.zero_grad()
-            loss.backward()
-            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
-            opt.step()
-            if step < total_steps:
-                sched.step()
-            update_ema()
-            step += 1
-            if step % 1000 == 0:
-                va = exact_rate(ema, val_primes, device, n=100_000)
-                ch = save_if_best("s1")
-                print(
-                    f"S1 step {step:6d} | loss {loss.item():.5f} | ema cell val {va:.5f} "
-                    f"| ema CHAIN val {ch:.4f} | {time.monotonic()-start:.0f}s",
-                    flush=True,
-                )
-    # ----- Stage 2: end-to-end chain fine-tuning (STE) -----
-    if args.stage2_minutes > 0:
-        opt = torch.optim.AdamW(model.parameters(), lr=args.lr2, weight_decay=1e-5)
-        total_steps = int(args.stage2_minutes * 60 * 3)
-        sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=total_steps, eta_min=args.lr2 * 0.1)
-        deadline = time.monotonic() + args.stage2_minutes * 60
-        start = time.monotonic()
-        step = 0
-        while time.monotonic() < deadline:
-            loss = chain_finetune_batch(model, train_primes, args.chain_batch, device, loss_fn)
-            opt.zero_grad()
-            loss.backward()
-            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
-            opt.step()
-            if step < total_steps:
-                sched.step()
-            update_ema()
-            step += 1
-            if step % 200 == 0:
-                va = exact_rate(ema, val_primes, device, n=100_000)
-                ch = save_if_best("s2")
-                print(
-                    f"S2 step {step:6d} | loss {loss.item():.5f} | ema cell val {va:.5f} "
-                    f"| ema CHAIN val {ch:.4f} | {time.monotonic()-start:.0f}s",
-                    flush=True,
-                )
-    va = exact_rate(ema, val_primes, device, n=500_000)
-    ch = chain_exact_rate(ema, val_primes, device, n=50_000)
-    print(f"FINAL ema cell val {va:.6f} | chain val {ch:.4f} | best chain {best_chain:.4f}")
-    return 0
-if __name__ == "__main__":
-    raise SystemExit(main())