Evaluation reference
This document records how horner_rnn is evaluated, how to reproduce the score, and how the
result behaves across different evaluation seeds and prime ranges — i.e. how far the public
1.000 generalises. It complements README.md (which documents how the weights were obtained).
All numbers here are reproducible from this repo plus the official challenge harness
(modchallenge); the per-tier sampling facts are read directly from the harness source cited
inline.
1. What the harness scores (the eight gates)
modchallenge evaluate runs a fixed pipeline
(src/modchallenge/evaluation/pipeline.py:evaluate_local). Every gate below must pass; the two
ranking keys are produced only at the end.
| # | Gate | This submission | Bound / spec |
|---|---|---|---|
| 1 | Manifest validation | entry_class=model.HornerRNN, output_base=2 |
well-formed manifest.json |
| 2 | Artifact size | 0.04 GB | ≤ 20 GB (EvalConfig.max_artifact_bytes) |
| 3 | Static analysis / compliance (security.static_check) |
0 findings → passed | no hand-coded arithmetic on p |
| 4 | Test-set generation | 1100 problems = 100 × 11 tiers (0–10) | total_problems |
| 5 | Model load | ~2 s | must import + load |
| 6 | Preprocess isolation (check_preprocess_isolation) |
passes — hooks are stateless identities | per-argument, no cross-leak |
| 7 | Determinism (check_determinism, 10 end-to-end re-runs) |
deterministic: true |
required to be ranked |
| 8 | Inference within budget | 173.6 s, all 11 tiers completed | ≤ 300 s wall (timeout_seconds) |
A tier that does not finish within the 300 s budget is scored 0 for that tier
(run_inference discards partial tiers) — so latency is a correctness gate, not just a
performance note (see §6).
Ranking keys (evaluation/results.py):
highest_tier_above_90— the maximum scored tier (id > 0) with accuracy ≥ 0.90. Not a contiguous run; it depends only on the single highest tier clearing 0.90.overall_accuracy— mean accuracy over completed scored tiers 1–10. Tier 0 is excluded from both keys.
2. How each tier samples its range
Private evaluation uses EvalConfig (config.py), which draws 5 distinct primes per tier
(primes_per_tier = 5) and 4 edge cases (a=0, b=0, a=1, b=1). The public benchmark uses
the same structure with a fixed seed. So each tier's 100 problems are:
4 edge cases + 96 problems spread over 5 distinct primes (≈ 19 operand-pairs/prime).
A consequence worth stating plainly: one weak prime ≈ 20 % of a tier. This is why robustness has to be measured by resampling the 5 primes across seeds, not by reading a single seed (§5).
| Tier | Prime range [2^min, 2^max) |
Operand range a,b ∈ [0, 2^k) |
|---|---|---|
| 1 | fixed primes {2,3,5,7} | 2³² |
| 2 | 2⁴ … 2⁸ | 2⁴⁸ |
| 3 | 2⁹ … 2¹⁶ | 2⁶⁴ |
| 4 | 2¹⁷ … 2³² | 2⁹⁶ |
| 5 | 2³³ … 2⁶⁴ | 2¹²⁸ |
| 6 | 2⁶⁵ … 2¹²⁸ | 2²⁵⁶ |
| 7 | 2¹²⁹ … 2²⁵⁶ | 2⁵¹² |
| 8 | 2²⁵⁷ … 2⁵¹² | 2¹⁰²⁴ |
| 9 | 2⁵¹³ … 2¹⁰²⁴ | 2²⁰⁴⁸ |
| 10 | 2¹⁰²⁵ … 2²⁰⁴⁸ | 2⁴⁰⁹⁶ |
Primes are drawn value-uniform (randrange(2^min, 2^max) then nextprime), which
concentrates mass at the top of each tier's bit-range. The weights are trained to match that
distribution (see README, "Width-robustness audit").
Tier 0 is a separate pure-multiplication diagnostic (p chosen so a·b < p, i.e. no
reduction); it is excluded from both ranking keys and so does not affect the score.
3. Reproducing the deterministic public score
The public benchmark seed is the hex of b'modchallenge-public-benchmark-v1'. The CLI parses
--seed as bytes.fromhex(...), and an empty --seed means a random draw — so the explicit
seed is required for the reproducible number.
PUBLIC_SEED=$(python -c "print(b'modchallenge-public-benchmark-v1'.hex())")
# = 6d6f646368616c6c656e67652d7075626c69632d62656e63686d61726b2d7631
modchallenge evaluate horner_rnn --total 1100 --seed "$PUBLIC_SEED"
Full public-seed result
overall_accuracy = 1.0
highest_tier_above_90 = 10 (the maximum tier)
deterministic = true
artifact size = 0.04 GB
| Tier | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| accuracy | 0.70* | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
* tier 0 is the unscored pure-multiplication diagnostic; it enters neither ranking key.
Cumulative wall time (GPU): tiers 0–8 finish in ~23 s, tier 9 at ~53 s, tier 10 at 173.6 s — the 2048-step tier-10 scan is essentially the entire cost.
4. What is and isn't guaranteed
- No formal guarantee of exact arithmetic. The cell is a learned approximation of the Horner step, not certified modular arithmetic; there is no proof it is 100 % on every input.
- Generalisation is structural, not memorised. One shared cell runs the same width-independent reduction circuit at every value, so a different prime/operand in the same tier is the same operation on different numbers — not out-of-distribution. Held-out-prime validation tracks training accuracy (no memorisation gap).
- The ranked outcome is robust (measured in §5):
highest_tier_above_90 = 10holds with very high probability across seeds;overall_accuracystays ≥ 0.997. What is not guaranteed is the cosmetic gap between ~0.997 and a literal 1.000 on the secondary key.
5. Seed / range robustness (the generalisation evidence)
The public 1.000 is one draw. To test generalisation, the scoring harness was run on five
different secret seeds (each draws a fresh set of 5 primes/tier + operands across every
range) — faithful private-eval simulations, since the private eval also uses primes_per_tier = 5.
| Seed (hex) | t1–t7 | t8 | t9 | t10 | overall | htop | det |
|---|---|---|---|---|---|---|---|
…public… |
1.00 | 1.00 | 1.00 | 1.00 | 1.0000 | 10 | ✓ |
1111… |
1.00 | 0.99 | 1.00 | 0.98 | 0.9970 | 10 | ✓ |
2222… |
1.00 | 0.99 | 1.00 | 1.00 | 0.9990 | 10 | ✓ |
deadbeef… |
1.00 | 0.97 | 1.00 | 1.00 | 0.9970 | 10 | ✓ |
cafef00d… |
1.00 | 1.00 | 0.99 | 0.99 | 0.9980 | 10 | ✓ |
a5a5… |
1.00 | 1.00 | 1.00 | 1.00 | 1.0000 | 10 | ✓ |
Reproduce any row with modchallenge evaluate horner_rnn --total 1100 --seed <hex>.
Reading of the evidence:
- Primary key invariant:
highest_tier_above_90 = 10on 6/6 seeds. The worst any scored tier reached was 0.97 — never near the 0.90 threshold. - Secondary key in a tight band: overall 0.9970 – 1.0000, mean ≈ 0.9985. A random private seed will most likely read ~0.997–0.999, not a literal 1.000.
- All variation is confined to tiers 8–10 (257–2048-bit primes). Tiers 1–7 are perfectly stable across every seed.
This matches the larger faithful 5-prime bootstrap on the shipped weights
(diag_5prime_boot.py in the research repo): P(tier < 0.90) ≈ 0.000 % for tiers 1–9 and
≈ 0.002 % for tier 10; E[tier10] ≈ 0.991, worst observed near-max tier-10 prime ≈ 0.875. A
40k-draw width sweep (audit_width_robustness.py, research repo) finds no accuracy "knee" anywhere in the
samplable range — the residual misses are rare per-(a,b) reduction-boundary events scattered
≈ uniformly, in the deep tail only.
6. Timing under the official clock
The 173.6 s above is GPU timing (batched predict_digits_batch). The budget is 300 s total
for all 1100 problems, and tier 10's 2048-step scan dominates. The one delivery risk that is not
about correctness: if the official runner is CPU-only, the tier-10 scan can exceed the budget
and time out — which would zero the timed-out tiers and drop the primary key. Confirm the
runner's hardware (GPU vs CPU) and, if CPU, do a dress-rehearsal run against the 300 s budget
before relying on the GPU timing. The correctness result (§3, §5) is independent of this.
7. Compliance, in one line each
(Full argument in README.md → "Compliance split" / "Status under the rules".)
- Preprocess hooks are pass-through identities — no cross-argument leakage (gate 6).
predict_digitsreduces onlya % p,b % p(two-operand normalisation, allowed) and never forms the three-argument modular product directly.- No add/multiply/compare-against-
pis hand-coded; the forward pass is tokenise → learned cell → quantise → readout. - Principle 2, measured: perturbing trained weights collapses accuracy to the untrained
floor (
exploration/compliance_perturb.py) — the arithmetic lives in the parameters. - Passes
modchallenge check; deterministic.