etwk commited on
Commit
e258f44
Β·
1 Parent(s): 6b83eb7

Harden tier 10: 0.94 -> 0.98, overall 0.978 -> 0.982

Browse files

Hardening tail on the 2048 cell (worst-bit margin, accum 24, lr 4e-5)
lifts tier 10 0.94->0.98 (2047 27/27, 2048 71/73). overall_accuracy
0.982, highest_tier 10, deterministic. weights2048.pt updated (LFS),
manifest + card updated.

Files changed (3) hide show
  1. README.md +20 -13
  2. manifest.json +2 -2
  3. weights2048.pt +2 -2
README.md CHANGED
@@ -2,8 +2,8 @@
2
 
3
  A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
4
  2^2048) on the public benchmark β€” tiers 1-3 = 100%, tier 4 = 99%, tier 5 = 99%, tier 6 = 97%,
5
- tier 7 = 98%, tier 8 = 92%, tier 9 = 99%, **tier 10 = 94%** β€” so `highest_tier_above_90 = 10`
6
- (the maximum), overall_accuracy **0.978**. Its capability comes from *learning an algorithmic
7
  step* rather than memorising finite multiplication tables, and it verifiably generalises to
8
  primes never seen in training.
9
 
@@ -46,7 +46,7 @@ whose state holds its prime:
46
  | 256-bit | `< 2^256` | 7 | carry-aware TCN, 12 blocks, dil 1..128 | ~4.7M | tier 7 = 0.98 |
47
  | 512-bit | `< 2^512` | 8 | carry-aware TCN, 14 blocks, dil 1..256 | ~5.5M | tier 8 = 0.92 |
48
  | 1024-bit | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
49
- | 2048-bit | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 = 0.94 |
50
 
51
  For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]` fallback without
52
  invoking the network.
@@ -162,8 +162,12 @@ to extend the receptive field (`exploration/transfer_1024_to_2048.py`; no-train
162
  true 2048-bit primes β€” the rule transfers partially). Then the same benchmark-width-matched
163
  polish, in two stages: a first pass (lr 2e-4) relearns the high-bit reduction fast (eps
164
  0.74 β†’ ~9e-4) but oscillates at high lr; a **low-lr tail (lr 6e-5, accum 20, margin loss)**
165
- settles the per-step error below 5e-5 so the 2048-step chain clears **tier 10 = 0.94**. Full
166
- recipe and findings: `exploration/TIER10_NOTES.md`. Two new flags make 2048-bit tractable:
 
 
 
 
167
  `--max-rows` (subsample the trajectory micro-batch; grad-checkpointing 13 blocks at 2048-bit
168
  OOMs otherwise) and disk-cached prime pools (`--build-pools-only`; gmpy2 `next_prime` is
169
  ~227 ms/prime at 2048-bit). Validate with `python exploration/score_tier10.py <ckpt>`.
@@ -172,11 +176,11 @@ OOMs otherwise) and disk-cached prime pools (`--build-pools-only`; gmpy2 `next_p
172
 
173
  | Total problems | overall_accuracy | highest_tier_above_90 | deterministic |
174
  |---|---|---|---|
175
- | **1100** | **0.978** | **10** (max) | True |
176
 
177
  Per-tier at total=1100: tier 1 **1.00**, tier 2 **1.00**, tier 3 **1.00**, tier 4 **0.99**,
178
  tier 5 **0.99**, tier 6 **0.97**, tier 7 **0.98**, tier 8 **0.92**, tier 9 **0.99**,
179
- tier 10 **0.94** (overall_accuracy is the mean over tiers 1-10). Tier 0 (pure multiplication,
180
  primes near each width's maximum β€” a separate regime, not in overall_accuracy) is **0.63**, up
181
  from 0.53 because its largest primes in `[2^1024, 2^2048)` now route to the 2048 cell instead
182
  of the `[0]` fallback. Inference for all 1100 problems is 170s, within the 300s budget (the
@@ -194,7 +198,7 @@ of the `[0]` fallback. Inference for all 1100 problems is 170s, within the 300s
194
  with Gaussian noise scaled to each tensor's std collapses accuracy, and an untrained cell is
195
  at the floor β€” so the capability is in the trained parameters, not the architecture (e.g.
196
  tier 6 0.97 -> 0.11, tier 7 0.98 -> 0.03, tier 8 0.92 -> 0.04, tier 9 0.99 -> 0.04,
197
- tier 10 0.94 -> 0.04 at Οƒ=0.25; untrained 0.00 for all).
198
  - Generalisation against memorisation: 10% of primes at each bit-width were held out of
199
  training entirely; chain accuracy on them matches the training primes, and a fresh random
200
  eval seed still scores ~0.99 on tier 9.
@@ -203,8 +207,11 @@ of the `[0]` fallback. Inference for all 1100 problems is 170s, within the 300s
203
  ## What remains
204
 
205
  Every reduction tier, **1 through 10, is above 90%**, so `highest_tier_above_90 = 10` is at the
206
- ceiling of the benchmark. The only sub-0.90 regime left is **tier 0** (pure multiplication, never
207
- reduced, primes near each width's maximum), now at **0.63** β€” and tier 0 is excluded from
208
- `overall_accuracy`, so it moves neither ranking key. Lifting it further (its near-2^k primes are
209
- a width blind-spot, the same class of gap that hid in the rushed 1024 cell) and re-polishing
210
- tier 8's 0.92 are the remaining tiebreaker-only gains; the primary metric has no headroom left.
 
 
 
 
2
 
3
  A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
4
  2^2048) on the public benchmark β€” tiers 1-3 = 100%, tier 4 = 99%, tier 5 = 99%, tier 6 = 97%,
5
+ tier 7 = 98%, tier 8 = 92%, tier 9 = 99%, **tier 10 = 98%** β€” so `highest_tier_above_90 = 10`
6
+ (the maximum), overall_accuracy **0.982**. Its capability comes from *learning an algorithmic
7
  step* rather than memorising finite multiplication tables, and it verifiably generalises to
8
  primes never seen in training.
9
 
 
46
  | 256-bit | `< 2^256` | 7 | carry-aware TCN, 12 blocks, dil 1..128 | ~4.7M | tier 7 = 0.98 |
47
  | 512-bit | `< 2^512` | 8 | carry-aware TCN, 14 blocks, dil 1..256 | ~5.5M | tier 8 = 0.92 |
48
  | 1024-bit | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
49
+ | 2048-bit | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 = 0.98 |
50
 
51
  For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]` fallback without
52
  invoking the network.
 
162
  true 2048-bit primes β€” the rule transfers partially). Then the same benchmark-width-matched
163
  polish, in two stages: a first pass (lr 2e-4) relearns the high-bit reduction fast (eps
164
  0.74 β†’ ~9e-4) but oscillates at high lr; a **low-lr tail (lr 6e-5, accum 20, margin loss)**
165
+ settles the per-step error below 5e-5 so the 2048-step chain clears tier 10. A further
166
+ **hardening tail** (warm-start the shipped cell, accum 24, lr 4e-5, worst-bit margin loss) then
167
+ sharpens the precision tail on the hardest 2047/2048-bit reductions β€” the cell's *average* eps
168
+ is already ~1e-5, so the gain is in the worst-case bits, not the mean β€” lifting **tier 10
169
+ 0.94 β†’ 0.98** (2047-bit 27/27, 2048-bit 71/73). Full recipe and findings:
170
+ `exploration/TIER10_NOTES.md`. Two new flags make 2048-bit tractable:
171
  `--max-rows` (subsample the trajectory micro-batch; grad-checkpointing 13 blocks at 2048-bit
172
  OOMs otherwise) and disk-cached prime pools (`--build-pools-only`; gmpy2 `next_prime` is
173
  ~227 ms/prime at 2048-bit). Validate with `python exploration/score_tier10.py <ckpt>`.
 
176
 
177
  | Total problems | overall_accuracy | highest_tier_above_90 | deterministic |
178
  |---|---|---|---|
179
+ | **1100** | **0.982** | **10** (max) | True |
180
 
181
  Per-tier at total=1100: tier 1 **1.00**, tier 2 **1.00**, tier 3 **1.00**, tier 4 **0.99**,
182
  tier 5 **0.99**, tier 6 **0.97**, tier 7 **0.98**, tier 8 **0.92**, tier 9 **0.99**,
183
+ tier 10 **0.98** (overall_accuracy is the mean over tiers 1-10). Tier 0 (pure multiplication,
184
  primes near each width's maximum β€” a separate regime, not in overall_accuracy) is **0.63**, up
185
  from 0.53 because its largest primes in `[2^1024, 2^2048)` now route to the 2048 cell instead
186
  of the `[0]` fallback. Inference for all 1100 problems is 170s, within the 300s budget (the
 
198
  with Gaussian noise scaled to each tensor's std collapses accuracy, and an untrained cell is
199
  at the floor β€” so the capability is in the trained parameters, not the architecture (e.g.
200
  tier 6 0.97 -> 0.11, tier 7 0.98 -> 0.03, tier 8 0.92 -> 0.04, tier 9 0.99 -> 0.04,
201
+ tier 10 0.98 -> 0.04 at Οƒ=0.25; untrained 0.00 for all).
202
  - Generalisation against memorisation: 10% of primes at each bit-width were held out of
203
  training entirely; chain accuracy on them matches the training primes, and a fresh random
204
  eval seed still scores ~0.99 on tier 9.
 
207
  ## What remains
208
 
209
  Every reduction tier, **1 through 10, is above 90%**, so `highest_tier_above_90 = 10` is at the
210
+ ceiling of the benchmark. `highest_tier_above_90` is the *maximum* tier β‰₯ 0.90 (not a contiguous
211
+ run from tier 1), so it now depends only on tier 10 holding β‰₯ 0.90 on the private draw β€” which
212
+ the hardening tail widened from +0.04 to **+0.08 margin (tier 10 = 0.98)**. The single lowest
213
+ scored tier is now **tier 8 = 0.92**; re-polishing it with the same width-matched recipe used at
214
+ tier 9 (it was trained on 510–512-bit primes only, a potential short-prime blind spot on the
215
+ private draw) is the main remaining `overall_accuracy` (tiebreaker) lever. Tier 0 (pure
216
+ multiplication, primes near each width's maximum) sits at **0.63** but is excluded from
217
+ `overall_accuracy`, so it moves neither ranking key. The primary metric has no headroom left.
manifest.json CHANGED
@@ -2,6 +2,6 @@
2
  "entry_class": "model.HornerRNN",
3
  "output_base": 2,
4
  "framework": "pytorch",
5
- "model_description": "Bit-sequential RNN (~192M params across eight cells) for primes up to 2^2048. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Eight cells are shipped and routed by prime size: a 16-bit cell (MLP, width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (MLP, width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, a 64-bit cell for p < 2^64 covering tier 5 that is a CARRY-AWARE TCN (8 residual blocks, 256 channels, dilations cycling 1..32, ~3.2M params), a 128-bit cell for p < 2^128 covering tier 6 that is a CARRY-AWARE TCN: a non-causal dilated 1D-convolutional network over the 128 bit-positions (10 residual blocks, 256 channels, dilations cycling 1..64 so the receptive field spans all 128 bits, ~3.9M params), a 256-bit cell for p < 2^256 covering tier 7 that uses the SAME carry-aware TCN architecture scaled to 256 bit-positions (12 residual blocks, 256 channels, dilations cycling 1..128, ~4.7M params) reaching tier 7 = 0.98, and a 512-bit cell for p < 2^512 covering tier 8 that is the same carry-aware TCN scaled to 512 bit-positions (14 residual blocks, 256 channels, dilations cycling 1..256, ~5.5M params) reaching tier 8 = 0.92, and a 1024-bit cell for p < 2^1024 covering tier 9 that is the same carry-aware TCN scaled to 1024 bit-positions (12 residual blocks, 256 channels, dilations cycling 1..512, ~4.7M params) reaching tier 9 = 0.99, and a 2048-bit cell for p < 2^2048 covering tier 10 that is the same carry-aware TCN scaled to 2048 bit-positions (13 residual blocks, 256 channels, dilations cycling 1..1024, ~5.1M params) reaching tier 10 = 0.94. The per-step error floor rises with bit-width, so the 512-, 1024- and 2048-bit cells were trained with gradient accumulation (a large effective batch lowers the per-step error noise floor) to recover the precision a 512-/1024-/2048-step chain needs to clear 0.90. The convolution is weight-shared across bit positions, so it learns ONE carry/borrow rule applied everywhere (non-causally, so the addition carry can flow LSB->MSB and the mod-p compare/borrow MSB->LSB) instead of a full-width MLP learning a separate position-function per bit; this inductive bias drives the per-step error far below what an MLP cell reaches and is what makes the 128/256/512-bit chains (which compound the per-step error over 128/256/512 steps) accurate. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^2048 emits the honest [0] fallback without invoking the network.",
6
- "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training β€” val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The 64-bit cell is a carry-aware TCN (like the 128/256/512-bit cells) trained on TRUE Horner-trajectory single steps over distinct 62-64 bit primes, reaching tier 5 = 0.99. It replaced an earlier 944MB MLP cell that also scored ~0.98 on tier 5 but had a blind spot on primes very close to 2^64 (the carry-aware conv generalises to the top-of-range reduction where the unstructured MLP did not); the TCN fixes that and shrinks the cell from 944MB to ~13MB. The 128-bit (tier-6) cell is the carry-aware TCN, trained the same way β€” single-step BCE on TRUE Horner-trajectory states (t, bit, b, p) -> (2t + bit*b) mod p β€” from random init over a high-diversity pool of thousands of distinct 124-128 bit primes (so it generalises across primes rather than memorising the conditional subtraction for a few). Its weight-shared dilated-convolution inductive bias reaches a per-step error roughly 15x lower than the same-task MLP cell, giving 0.97 full-chain accuracy on held-out 124-128 bit primes; same supervised single-step objective, no backprop through the recurrence, AdamW + cosine decay + grad clip + EMA checkpointed by held-out full-chain accuracy. The 256-bit (tier-7) cell is the same carry-aware TCN scaled to 256 bit-positions (dilations cycling 1..128), trained identically β€” single-step BCE on TRUE Horner-trajectory states over a high-diversity pool of distinct 252-256 bit primes β€” reaching a per-step error low enough that the 256-step chain holds at 0.98 full-chain accuracy on held-out 252-256 bit primes. The 512-bit (tier-8) cell is the same carry-aware TCN scaled to 512 bit-positions (dilations cycling 1..256), trained on true-trajectory single steps over distinct 510-512 bit primes; the per-step error floor rises with width, so this cell additionally uses gradient accumulation (--accum: a larger effective batch lowers the gradient-noise floor on per-step error) to drive the 512-step chain to tier 8 = 0.92. The 1024-bit (tier-9) cell is the same carry-aware TCN scaled to 1024 bit-positions (12 residual blocks, dilations cycling 1..512), and exposes a finding specific to wide primes: the test generator draws p value-uniform in [2^513, 2^1024), so a large fraction of tier-9 primes are SHORTER than 1024 bits, and the conditional-subtraction reduction boundary lands at p's most-significant set bit -- at a DIFFERENT position for each prime width. A cell trained only on near-2^1024 primes learns that boundary at one position and scores ~0.00 on shorter primes (this gave tier 9 = 0.73, dominated by the single ~1020-bit benchmark prime failing entirely, 0/22). Training instead on a mix of value-uniform primes (benchmark-faithful) and bit-length-uniform primes over [990,1024] (equal weight to every boundary position) lets the weight-shared convolution learn the reduction at every MSB position; combined with gradient accumulation (--accum 16) and a worst-bit margin loss for the precision tail, this drives the 1024-step chain to tier 9 = 0.99, robust across prime widths (held-out value-uniform validation chain 0.99, per-width 1015-1024 all ~0.99). The 2048-bit (tier-10) cell was bootstrapped by OCTAVE TRANSFER rather than from random init: the conv weights are width-invariant in shape and the carry rule is position-invariant, so the trained 1024-bit cell's weights copy verbatim into a 2048-position cell, plus one identity-initialised dil=1024 residual block to extend the receptive field across all 2048 positions (exploration/transfer_1024_to_2048.py; no-train single-step eps 0.74 on true 2048-bit primes -- the carry rule transfers partially, far better than a cold start). It is then polished on the benchmark-matched width distribution (value-uniform [2^1025, 2^2048) + bit-length-uniform[2014,2048]) in two stages: a first pass (lr 2e-4, accum 16) relearns the high-bit reduction fast (eps 0.74 -> ~9e-4) but oscillates at high lr, then a low-lr tail (lr 6e-5, accum 20, margin loss) settles the per-step error below 5e-5 so the 2048-step chain clears tier 10 = 0.94 (held-out value-uniform validation chain ~0.95; private-draw simulation 0.955). Weight-perturbation compliance (exploration/compliance_perturb.py): each cell's accuracy at sigma=0 collapses toward the floor as the weights are perturbed and an untrained re-init scores 0.00 β€” e.g. tier 6 0.97 -> 0.11 (sigma=0.25), tier 7 0.98 -> 0.03 (sigma=0.25), tier 8 0.92 -> 0.04 (sigma=0.25), tier 9 0.99 -> 0.04 (sigma=0.25), tier 10 0.94 -> 0.04 (sigma=0.25), untrained 0.00 for all β€” so the arithmetic resides in the trained parameters. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner128_bigru.py --arch tcn (128-bit carry-aware TCN), exploration/train_horner_tcn.py --bits 64 / --bits 256 / --bits 512 --accum 2 (64-, 256- and 512-bit carry-aware TCN); --bits 1024 --lo-bits 513 --bitlen-frac 0.4 --bitlen-lo 990 --accum 16 --margin-weight 0.5 (1024-bit carry-aware TCN, benchmark-width-matched); exploration/transfer_1024_to_2048.py then exploration/train_horner_tcn.py --bits 2048 --blocks 13 --max-dil 1024 --init <transfer> --lo-bits 1025 --bitlen-frac 0.4 --bitlen-lo 2014 --max-rows 512 --grad-checkpoint --accum 16/20 --margin-weight 0.5 (2048-bit, octave transfer + low-lr tail; see exploration/TIER10_NOTES.md)."
7
  }
 
2
  "entry_class": "model.HornerRNN",
3
  "output_base": 2,
4
  "framework": "pytorch",
5
+ "model_description": "Bit-sequential RNN (~192M params across eight cells) for primes up to 2^2048. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Eight cells are shipped and routed by prime size: a 16-bit cell (MLP, width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (MLP, width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, a 64-bit cell for p < 2^64 covering tier 5 that is a CARRY-AWARE TCN (8 residual blocks, 256 channels, dilations cycling 1..32, ~3.2M params), a 128-bit cell for p < 2^128 covering tier 6 that is a CARRY-AWARE TCN: a non-causal dilated 1D-convolutional network over the 128 bit-positions (10 residual blocks, 256 channels, dilations cycling 1..64 so the receptive field spans all 128 bits, ~3.9M params), a 256-bit cell for p < 2^256 covering tier 7 that uses the SAME carry-aware TCN architecture scaled to 256 bit-positions (12 residual blocks, 256 channels, dilations cycling 1..128, ~4.7M params) reaching tier 7 = 0.98, and a 512-bit cell for p < 2^512 covering tier 8 that is the same carry-aware TCN scaled to 512 bit-positions (14 residual blocks, 256 channels, dilations cycling 1..256, ~5.5M params) reaching tier 8 = 0.92, and a 1024-bit cell for p < 2^1024 covering tier 9 that is the same carry-aware TCN scaled to 1024 bit-positions (12 residual blocks, 256 channels, dilations cycling 1..512, ~4.7M params) reaching tier 9 = 0.99, and a 2048-bit cell for p < 2^2048 covering tier 10 that is the same carry-aware TCN scaled to 2048 bit-positions (13 residual blocks, 256 channels, dilations cycling 1..1024, ~5.1M params) reaching tier 10 = 0.98. The per-step error floor rises with bit-width, so the 512-, 1024- and 2048-bit cells were trained with gradient accumulation (a large effective batch lowers the per-step error noise floor) to recover the precision a 512-/1024-/2048-step chain needs to clear 0.90. The convolution is weight-shared across bit positions, so it learns ONE carry/borrow rule applied everywhere (non-causally, so the addition carry can flow LSB->MSB and the mod-p compare/borrow MSB->LSB) instead of a full-width MLP learning a separate position-function per bit; this inductive bias drives the per-step error far below what an MLP cell reaches and is what makes the 128/256/512-bit chains (which compound the per-step error over 128/256/512 steps) accurate. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^2048 emits the honest [0] fallback without invoking the network.",
6
+ "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training β€” val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The 64-bit cell is a carry-aware TCN (like the 128/256/512-bit cells) trained on TRUE Horner-trajectory single steps over distinct 62-64 bit primes, reaching tier 5 = 0.99. It replaced an earlier 944MB MLP cell that also scored ~0.98 on tier 5 but had a blind spot on primes very close to 2^64 (the carry-aware conv generalises to the top-of-range reduction where the unstructured MLP did not); the TCN fixes that and shrinks the cell from 944MB to ~13MB. The 128-bit (tier-6) cell is the carry-aware TCN, trained the same way β€” single-step BCE on TRUE Horner-trajectory states (t, bit, b, p) -> (2t + bit*b) mod p β€” from random init over a high-diversity pool of thousands of distinct 124-128 bit primes (so it generalises across primes rather than memorising the conditional subtraction for a few). Its weight-shared dilated-convolution inductive bias reaches a per-step error roughly 15x lower than the same-task MLP cell, giving 0.97 full-chain accuracy on held-out 124-128 bit primes; same supervised single-step objective, no backprop through the recurrence, AdamW + cosine decay + grad clip + EMA checkpointed by held-out full-chain accuracy. The 256-bit (tier-7) cell is the same carry-aware TCN scaled to 256 bit-positions (dilations cycling 1..128), trained identically β€” single-step BCE on TRUE Horner-trajectory states over a high-diversity pool of distinct 252-256 bit primes β€” reaching a per-step error low enough that the 256-step chain holds at 0.98 full-chain accuracy on held-out 252-256 bit primes. The 512-bit (tier-8) cell is the same carry-aware TCN scaled to 512 bit-positions (dilations cycling 1..256), trained on true-trajectory single steps over distinct 510-512 bit primes; the per-step error floor rises with width, so this cell additionally uses gradient accumulation (--accum: a larger effective batch lowers the gradient-noise floor on per-step error) to drive the 512-step chain to tier 8 = 0.92. The 1024-bit (tier-9) cell is the same carry-aware TCN scaled to 1024 bit-positions (12 residual blocks, dilations cycling 1..512), and exposes a finding specific to wide primes: the test generator draws p value-uniform in [2^513, 2^1024), so a large fraction of tier-9 primes are SHORTER than 1024 bits, and the conditional-subtraction reduction boundary lands at p's most-significant set bit -- at a DIFFERENT position for each prime width. A cell trained only on near-2^1024 primes learns that boundary at one position and scores ~0.00 on shorter primes (this gave tier 9 = 0.73, dominated by the single ~1020-bit benchmark prime failing entirely, 0/22). Training instead on a mix of value-uniform primes (benchmark-faithful) and bit-length-uniform primes over [990,1024] (equal weight to every boundary position) lets the weight-shared convolution learn the reduction at every MSB position; combined with gradient accumulation (--accum 16) and a worst-bit margin loss for the precision tail, this drives the 1024-step chain to tier 9 = 0.99, robust across prime widths (held-out value-uniform validation chain 0.99, per-width 1015-1024 all ~0.99). The 2048-bit (tier-10) cell was bootstrapped by OCTAVE TRANSFER rather than from random init: the conv weights are width-invariant in shape and the carry rule is position-invariant, so the trained 1024-bit cell's weights copy verbatim into a 2048-position cell, plus one identity-initialised dil=1024 residual block to extend the receptive field across all 2048 positions (exploration/transfer_1024_to_2048.py; no-train single-step eps 0.74 on true 2048-bit primes -- the carry rule transfers partially, far better than a cold start). It is then polished on the benchmark-matched width distribution (value-uniform [2^1025, 2^2048) + bit-length-uniform[2014,2048]) in two stages: a first pass (lr 2e-4, accum 16) relearns the high-bit reduction fast (eps 0.74 -> ~9e-4) but oscillates at high lr, then a low-lr tail (lr 6e-5, accum 20, margin loss) settles the per-step error below 5e-5 so the 2048-step chain clears tier 10 = 0.94, and a final hardening tail (warm-start, accum 24, lr 4e-5, worst-bit margin loss) sharpens the worst 2047/2048-bit reductions -- the average eps is already ~1e-5, so the gain is in the worst-case bits not the mean -- lifting tier 10 to 0.98 (2047-bit 27/27, 2048-bit 71/73; held-out value-uniform validation chain ~0.98). Weight-perturbation compliance (exploration/compliance_perturb.py): each cell's accuracy at sigma=0 collapses toward the floor as the weights are perturbed and an untrained re-init scores 0.00 β€” e.g. tier 6 0.97 -> 0.11 (sigma=0.25), tier 7 0.98 -> 0.03 (sigma=0.25), tier 8 0.92 -> 0.04 (sigma=0.25), tier 9 0.99 -> 0.04 (sigma=0.25), tier 10 0.98 -> 0.04 (sigma=0.25), untrained 0.00 for all β€” so the arithmetic resides in the trained parameters. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner128_bigru.py --arch tcn (128-bit carry-aware TCN), exploration/train_horner_tcn.py --bits 64 / --bits 256 / --bits 512 --accum 2 (64-, 256- and 512-bit carry-aware TCN); --bits 1024 --lo-bits 513 --bitlen-frac 0.4 --bitlen-lo 990 --accum 16 --margin-weight 0.5 (1024-bit carry-aware TCN, benchmark-width-matched); exploration/transfer_1024_to_2048.py then exploration/train_horner_tcn.py --bits 2048 --blocks 13 --max-dil 1024 --init <transfer> --lo-bits 1025 --bitlen-frac 0.4 --bitlen-lo 2014 --max-rows 512 --grad-checkpoint --accum 16/20/24 --margin-weight 0.5 (2048-bit, octave transfer + low-lr tail + hardening tail accum 24 lr 4e-5; see exploration/TIER10_NOTES.md)."
7
  }
weights2048.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5eda60b4d7026042981c16b49ea931238e6e1dd5c7ff72338eecbd688ca82d43
3
- size 20535797
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:24a3aa1b0c300f935267558abaa7e2bfa23580dae5dffb6544174d5c27c5b3d5
3
+ size 20535973