File size: 8,961 Bytes
93db0e9 f704813 93db0e9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | # Evaluation reference
This document records **how `horner_rnn` is evaluated, how to reproduce the score, and how the
result behaves across different evaluation seeds and prime ranges** — i.e. how far the public
`1.000` generalises. It complements `README.md` (which documents how the weights were obtained).
All numbers here are reproducible from this repo plus the official challenge harness
(`modchallenge`); the per-tier sampling facts are read directly from the harness source cited
inline.
---
## 1. What the harness scores (the eight gates)
`modchallenge evaluate` runs a fixed pipeline
(`src/modchallenge/evaluation/pipeline.py:evaluate_local`). Every gate below must pass; the two
ranking keys are produced only at the end.
| # | Gate | This submission | Bound / spec |
|---|---|---|---|
| 1 | **Manifest validation** | `entry_class=model.HornerRNN`, `output_base=2` | well-formed `manifest.json` |
| 2 | **Artifact size** | **0.04 GB** | ≤ 20 GB (`EvalConfig.max_artifact_bytes`) |
| 3 | **Static analysis / compliance** (`security.static_check`) | 0 findings → *passed* | no hand-coded arithmetic on `p` |
| 4 | **Test-set generation** | 1100 problems = 100 × 11 tiers (0–10) | `total_problems` |
| 5 | **Model load** | ~2 s | must import + load |
| 6 | **Preprocess isolation** (`check_preprocess_isolation`) | passes — hooks are stateless identities | per-argument, no cross-leak |
| 7 | **Determinism** (`check_determinism`, 10 end-to-end re-runs) | `deterministic: true` | required to be ranked |
| 8 | **Inference within budget** | 173.6 s, all 11 tiers completed | ≤ 300 s wall (`timeout_seconds`) |
A tier that does not *finish* within the 300 s budget is scored **0** for that tier
(`run_inference` discards partial tiers) — so latency is a correctness gate, not just a
performance note (see §6).
**Ranking keys** (`evaluation/results.py`):
- `highest_tier_above_90` — the **maximum** scored tier (id > 0) with accuracy ≥ 0.90. Not a
contiguous run; it depends only on the single highest tier clearing 0.90.
- `overall_accuracy` — mean accuracy over **completed scored tiers 1–10**. Tier 0 is excluded
from both keys.
---
## 2. How each tier samples its range
Private evaluation uses `EvalConfig` (`config.py`), which draws **5 distinct primes per tier**
(`primes_per_tier = 5`) and **4 edge cases** (`a=0, b=0, a=1, b=1`). The public benchmark uses
the same structure with a fixed seed. So each tier's 100 problems are:
> 4 edge cases + 96 problems spread over **5 distinct primes** (≈ 19 operand-pairs/prime).
A consequence worth stating plainly: **one weak prime ≈ 20 % of a tier.** This is why
robustness has to be measured by *resampling the 5 primes across seeds*, not by reading a single
seed (§5).
| Tier | Prime range `[2^min, 2^max)` | Operand range `a,b ∈ [0, 2^k)` |
|---|---|---|
| 1 | fixed primes {2,3,5,7} | 2³² |
| 2 | 2⁴ … 2⁸ | 2⁴⁸ |
| 3 | 2⁹ … 2¹⁶ | 2⁶⁴ |
| 4 | 2¹⁷ … 2³² | 2⁹⁶ |
| 5 | 2³³ … 2⁶⁴ | 2¹²⁸ |
| 6 | 2⁶⁵ … 2¹²⁸ | 2²⁵⁶ |
| 7 | 2¹²⁹ … 2²⁵⁶ | 2⁵¹² |
| 8 | 2²⁵⁷ … 2⁵¹² | 2¹⁰²⁴ |
| 9 | 2⁵¹³ … 2¹⁰²⁴ | 2²⁰⁴⁸ |
| 10 | 2¹⁰²⁵ … 2²⁰⁴⁸ | 2⁴⁰⁹⁶ |
Primes are drawn **value-uniform** (`randrange(2^min, 2^max)` then `nextprime`), which
concentrates mass at the top of each tier's bit-range. The weights are trained to match that
distribution (see README, "Width-robustness audit").
Tier 0 is a separate **pure-multiplication** diagnostic (`p` chosen so `a·b < p`, i.e. no
reduction); it is **excluded from both ranking keys** and so does not affect the score.
---
## 3. Reproducing the deterministic public score
The public benchmark seed is the hex of `b'modchallenge-public-benchmark-v1'`. The CLI parses
`--seed` as `bytes.fromhex(...)`, and an **empty `--seed` means a random draw** — so the explicit
seed is required for the reproducible number.
```bash
PUBLIC_SEED=$(python -c "print(b'modchallenge-public-benchmark-v1'.hex())")
# = 6d6f646368616c6c656e67652d7075626c69632d62656e63686d61726b2d7631
modchallenge evaluate horner_rnn --total 1100 --seed "$PUBLIC_SEED"
```
### Full public-seed result
```
overall_accuracy = 1.0
highest_tier_above_90 = 10 (the maximum tier)
deterministic = true
artifact size = 0.04 GB
```
| Tier | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| accuracy | 0.70\* | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
\* tier 0 is the unscored pure-multiplication diagnostic; it enters neither ranking key.
Cumulative wall time (GPU): tiers 0–8 finish in ~23 s, tier 9 at ~53 s, tier 10 at **173.6 s** —
the 2048-step tier-10 scan is essentially the entire cost.
---
## 4. What is and isn't guaranteed
- **No formal guarantee of exact arithmetic.** The cell is a *learned* approximation of the
Horner step, not certified modular arithmetic; there is no proof it is 100 % on every input.
- **Generalisation is structural, not memorised.** One shared cell runs the same
width-independent reduction circuit at every value, so a different prime/operand in the same
tier is the same operation on different numbers — not out-of-distribution. Held-out-prime
validation tracks training accuracy (no memorisation gap).
- **The ranked outcome is robust** (measured in §5): `highest_tier_above_90 = 10` holds with very
high probability across seeds; `overall_accuracy` stays ≥ 0.997. What is *not* guaranteed is the
cosmetic gap between ~0.997 and a literal 1.000 on the secondary key.
---
## 5. Seed / range robustness (the generalisation evidence)
The public `1.000` is **one draw**. To test generalisation, the scoring harness was run on five
**different** secret seeds (each draws a fresh set of 5 primes/tier + operands across every
range) — faithful private-eval simulations, since the private eval also uses `primes_per_tier = 5`.
| Seed (hex) | t1–t7 | t8 | t9 | t10 | **overall** | **htop** | det |
|---|---|---|---|---|---|---|---|
| `…public…` | 1.00 | 1.00 | 1.00 | 1.00 | **1.0000** | **10** | ✓ |
| `1111…` | 1.00 | 0.99 | 1.00 | 0.98 | 0.9970 | **10** | ✓ |
| `2222…` | 1.00 | 0.99 | 1.00 | 1.00 | 0.9990 | **10** | ✓ |
| `deadbeef…` | 1.00 | 0.97 | 1.00 | 1.00 | 0.9970 | **10** | ✓ |
| `cafef00d…` | 1.00 | 1.00 | 0.99 | 0.99 | 0.9980 | **10** | ✓ |
| `a5a5…` | 1.00 | 1.00 | 1.00 | 1.00 | 1.0000 | **10** | ✓ |
Reproduce any row with `modchallenge evaluate horner_rnn --total 1100 --seed <hex>`.
**Reading of the evidence:**
- **Primary key invariant:** `highest_tier_above_90 = 10` on 6/6 seeds. The worst *any* scored
tier reached was **0.97** — never near the 0.90 threshold.
- **Secondary key in a tight band:** overall 0.9970 – 1.0000, mean ≈ 0.9985. A random private
seed will most likely read ~0.997–0.999, not a literal 1.000.
- **All variation is confined to tiers 8–10** (257–2048-bit primes). Tiers 1–7 are perfectly
stable across every seed.
This matches the larger faithful 5-prime bootstrap on the shipped weights
(`diag_5prime_boot.py` in the research repo): `P(tier < 0.90) ≈ 0.000 %` for tiers 1–9 and
≈ 0.002 % for tier 10; `E[tier10] ≈ 0.991`, worst observed near-max tier-10 prime ≈ 0.875. A
40k-draw width sweep (`audit_width_robustness.py`, research repo) finds **no accuracy "knee"** anywhere in the
samplable range — the residual misses are rare per-`(a,b)` reduction-boundary events scattered
≈ uniformly, in the deep tail only.
---
## 6. Timing under the official clock
The 173.6 s above is **GPU** timing (batched `predict_digits_batch`). The budget is **300 s total**
for all 1100 problems, and tier 10's 2048-step scan dominates. The one delivery risk that is *not*
about correctness: if the official runner is **CPU-only**, the tier-10 scan can exceed the budget
and time out — which would zero the timed-out tiers and drop the primary key. Confirm the
runner's hardware (GPU vs CPU) and, if CPU, do a dress-rehearsal run against the 300 s budget
before relying on the GPU timing. The *correctness* result (§3, §5) is independent of this.
---
## 7. Compliance, in one line each
(Full argument in `README.md` → "Compliance split" / "Status under the rules".)
- Preprocess hooks are pass-through identities — no cross-argument leakage (gate 6).
- `predict_digits` reduces only `a % p`, `b % p` (two-operand normalisation, allowed) and never
forms the three-argument modular product directly.
- No add/multiply/compare-against-`p` is hand-coded; the forward pass is tokenise → learned cell
→ quantise → readout.
- **Principle 2, measured:** perturbing trained weights collapses accuracy to the untrained
floor (`exploration/compliance_perturb.py`) — the arithmetic lives in the parameters.
- Passes `modchallenge check`; deterministic.
|