File size: 8,961 Bytes
93db0e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f704813
93db0e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
# Evaluation reference

This document records **how `horner_rnn` is evaluated, how to reproduce the score, and how the
result behaves across different evaluation seeds and prime ranges** — i.e. how far the public
`1.000` generalises. It complements `README.md` (which documents how the weights were obtained).

All numbers here are reproducible from this repo plus the official challenge harness
(`modchallenge`); the per-tier sampling facts are read directly from the harness source cited
inline.

---

## 1. What the harness scores (the eight gates)

`modchallenge evaluate` runs a fixed pipeline
(`src/modchallenge/evaluation/pipeline.py:evaluate_local`). Every gate below must pass; the two
ranking keys are produced only at the end.

| # | Gate | This submission | Bound / spec |
|---|---|---|---|
| 1 | **Manifest validation** | `entry_class=model.HornerRNN`, `output_base=2` | well-formed `manifest.json` |
| 2 | **Artifact size** | **0.04 GB** | ≤ 20 GB (`EvalConfig.max_artifact_bytes`) |
| 3 | **Static analysis / compliance** (`security.static_check`) | 0 findings → *passed* | no hand-coded arithmetic on `p` |
| 4 | **Test-set generation** | 1100 problems = 100 × 11 tiers (0–10) | `total_problems` |
| 5 | **Model load** | ~2 s | must import + load |
| 6 | **Preprocess isolation** (`check_preprocess_isolation`) | passes — hooks are stateless identities | per-argument, no cross-leak |
| 7 | **Determinism** (`check_determinism`, 10 end-to-end re-runs) | `deterministic: true` | required to be ranked |
| 8 | **Inference within budget** | 173.6 s, all 11 tiers completed | ≤ 300 s wall (`timeout_seconds`) |

A tier that does not *finish* within the 300 s budget is scored **0** for that tier
(`run_inference` discards partial tiers) — so latency is a correctness gate, not just a
performance note (see §6).

**Ranking keys** (`evaluation/results.py`):
- `highest_tier_above_90` — the **maximum** scored tier (id > 0) with accuracy ≥ 0.90. Not a
  contiguous run; it depends only on the single highest tier clearing 0.90.
- `overall_accuracy` — mean accuracy over **completed scored tiers 1–10**. Tier 0 is excluded
  from both keys.

---

## 2. How each tier samples its range

Private evaluation uses `EvalConfig` (`config.py`), which draws **5 distinct primes per tier**
(`primes_per_tier = 5`) and **4 edge cases** (`a=0, b=0, a=1, b=1`). The public benchmark uses
the same structure with a fixed seed. So each tier's 100 problems are:

> 4 edge cases  +  96 problems spread over **5 distinct primes** (≈ 19 operand-pairs/prime).

A consequence worth stating plainly: **one weak prime ≈ 20 % of a tier.** This is why
robustness has to be measured by *resampling the 5 primes across seeds*, not by reading a single
seed (§5).

| Tier | Prime range `[2^min, 2^max)` | Operand range `a,b ∈ [0, 2^k)` |
|---|---|---|
| 1 | fixed primes {2,3,5,7} | 2³² |
| 2 | 2⁴ … 2⁸ | 2⁴⁸ |
| 3 | 2⁹ … 2¹⁶ | 2⁶⁴ |
| 4 | 2¹⁷ … 2³² | 2⁹⁶ |
| 5 | 2³³ … 2⁶⁴ | 2¹²⁸ |
| 6 | 2⁶⁵ … 2¹²⁸ | 2²⁵⁶ |
| 7 | 2¹²⁹ … 2²⁵⁶ | 2⁵¹² |
| 8 | 2²⁵⁷ … 2⁵¹² | 2¹⁰²⁴ |
| 9 | 2⁵¹³ … 2¹⁰²⁴ | 2²⁰⁴⁸ |
| 10 | 2¹⁰²⁵ … 2²⁰⁴⁸ | 2⁴⁰⁹⁶ |

Primes are drawn **value-uniform** (`randrange(2^min, 2^max)` then `nextprime`), which
concentrates mass at the top of each tier's bit-range. The weights are trained to match that
distribution (see README, "Width-robustness audit").

Tier 0 is a separate **pure-multiplication** diagnostic (`p` chosen so `a·b < p`, i.e. no
reduction); it is **excluded from both ranking keys** and so does not affect the score.

---

## 3. Reproducing the deterministic public score

The public benchmark seed is the hex of `b'modchallenge-public-benchmark-v1'`. The CLI parses
`--seed` as `bytes.fromhex(...)`, and an **empty `--seed` means a random draw** — so the explicit
seed is required for the reproducible number.

```bash
PUBLIC_SEED=$(python -c "print(b'modchallenge-public-benchmark-v1'.hex())")
# = 6d6f646368616c6c656e67652d7075626c69632d62656e63686d61726b2d7631
modchallenge evaluate horner_rnn --total 1100 --seed "$PUBLIC_SEED"
```

### Full public-seed result

```
overall_accuracy      = 1.0
highest_tier_above_90 = 10        (the maximum tier)
deterministic         = true
artifact size         = 0.04 GB
```

| Tier | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| accuracy | 0.70\* | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |

\* tier 0 is the unscored pure-multiplication diagnostic; it enters neither ranking key.

Cumulative wall time (GPU): tiers 0–8 finish in ~23 s, tier 9 at ~53 s, tier 10 at **173.6 s** —
the 2048-step tier-10 scan is essentially the entire cost.

---

## 4. What is and isn't guaranteed

- **No formal guarantee of exact arithmetic.** The cell is a *learned* approximation of the
  Horner step, not certified modular arithmetic; there is no proof it is 100 % on every input.
- **Generalisation is structural, not memorised.** One shared cell runs the same
  width-independent reduction circuit at every value, so a different prime/operand in the same
  tier is the same operation on different numbers — not out-of-distribution. Held-out-prime
  validation tracks training accuracy (no memorisation gap).
- **The ranked outcome is robust** (measured in §5): `highest_tier_above_90 = 10` holds with very
  high probability across seeds; `overall_accuracy` stays ≥ 0.997. What is *not* guaranteed is the
  cosmetic gap between ~0.997 and a literal 1.000 on the secondary key.

---

## 5. Seed / range robustness (the generalisation evidence)

The public `1.000` is **one draw**. To test generalisation, the scoring harness was run on five
**different** secret seeds (each draws a fresh set of 5 primes/tier + operands across every
range) — faithful private-eval simulations, since the private eval also uses `primes_per_tier = 5`.

| Seed (hex) | t1–t7 | t8 | t9 | t10 | **overall** | **htop** | det |
|---|---|---|---|---|---|---|---|
| `…public…` | 1.00 | 1.00 | 1.00 | 1.00 | **1.0000** | **10** | ✓ |
| `1111…` | 1.00 | 0.99 | 1.00 | 0.98 | 0.9970 | **10** | ✓ |
| `2222…` | 1.00 | 0.99 | 1.00 | 1.00 | 0.9990 | **10** | ✓ |
| `deadbeef…` | 1.00 | 0.97 | 1.00 | 1.00 | 0.9970 | **10** | ✓ |
| `cafef00d…` | 1.00 | 1.00 | 0.99 | 0.99 | 0.9980 | **10** | ✓ |
| `a5a5…` | 1.00 | 1.00 | 1.00 | 1.00 | 1.0000 | **10** | ✓ |

Reproduce any row with `modchallenge evaluate horner_rnn --total 1100 --seed <hex>`.

**Reading of the evidence:**
- **Primary key invariant:** `highest_tier_above_90 = 10` on 6/6 seeds. The worst *any* scored
  tier reached was **0.97** — never near the 0.90 threshold.
- **Secondary key in a tight band:** overall 0.9970 – 1.0000, mean ≈ 0.9985. A random private
  seed will most likely read ~0.997–0.999, not a literal 1.000.
- **All variation is confined to tiers 8–10** (257–2048-bit primes). Tiers 1–7 are perfectly
  stable across every seed.

This matches the larger faithful 5-prime bootstrap on the shipped weights
(`diag_5prime_boot.py` in the research repo): `P(tier < 0.90) ≈ 0.000 %` for tiers 1–9 and
≈ 0.002 % for tier 10; `E[tier10] ≈ 0.991`, worst observed near-max tier-10 prime ≈ 0.875. A
40k-draw width sweep (`audit_width_robustness.py`, research repo) finds **no accuracy "knee"** anywhere in the
samplable range — the residual misses are rare per-`(a,b)` reduction-boundary events scattered
≈ uniformly, in the deep tail only.

---

## 6. Timing under the official clock

The 173.6 s above is **GPU** timing (batched `predict_digits_batch`). The budget is **300 s total**
for all 1100 problems, and tier 10's 2048-step scan dominates. The one delivery risk that is *not*
about correctness: if the official runner is **CPU-only**, the tier-10 scan can exceed the budget
and time out — which would zero the timed-out tiers and drop the primary key. Confirm the
runner's hardware (GPU vs CPU) and, if CPU, do a dress-rehearsal run against the 300 s budget
before relying on the GPU timing. The *correctness* result (§3, §5) is independent of this.

---

## 7. Compliance, in one line each

(Full argument in `README.md` → "Compliance split" / "Status under the rules".)

- Preprocess hooks are pass-through identities — no cross-argument leakage (gate 6).
- `predict_digits` reduces only `a % p`, `b % p` (two-operand normalisation, allowed) and never
  forms the three-argument modular product directly.
- No add/multiply/compare-against-`p` is hand-coded; the forward pass is tokenise → learned cell
  → quantise → readout.
- **Principle 2, measured:** perturbing trained weights collapses accuracy to the untrained
  floor (`exploration/compliance_perturb.py`) — the arithmetic lives in the parameters.
- Passes `modchallenge check`; deterministic.