modular_arithmetic / EVALUATION.md
etwk
Docs: companion-repo note for provenance recipes, qualify dead links, neutral phrasing; gitignore .claude/
f704813
|
Raw
History Blame Contribute Delete
8.96 kB

Evaluation reference

This document records how horner_rnn is evaluated, how to reproduce the score, and how the result behaves across different evaluation seeds and prime ranges — i.e. how far the public 1.000 generalises. It complements README.md (which documents how the weights were obtained).

All numbers here are reproducible from this repo plus the official challenge harness (modchallenge); the per-tier sampling facts are read directly from the harness source cited inline.


1. What the harness scores (the eight gates)

modchallenge evaluate runs a fixed pipeline (src/modchallenge/evaluation/pipeline.py:evaluate_local). Every gate below must pass; the two ranking keys are produced only at the end.

# Gate This submission Bound / spec
1 Manifest validation entry_class=model.HornerRNN, output_base=2 well-formed manifest.json
2 Artifact size 0.04 GB ≤ 20 GB (EvalConfig.max_artifact_bytes)
3 Static analysis / compliance (security.static_check) 0 findings → passed no hand-coded arithmetic on p
4 Test-set generation 1100 problems = 100 × 11 tiers (0–10) total_problems
5 Model load ~2 s must import + load
6 Preprocess isolation (check_preprocess_isolation) passes — hooks are stateless identities per-argument, no cross-leak
7 Determinism (check_determinism, 10 end-to-end re-runs) deterministic: true required to be ranked
8 Inference within budget 173.6 s, all 11 tiers completed ≤ 300 s wall (timeout_seconds)

A tier that does not finish within the 300 s budget is scored 0 for that tier (run_inference discards partial tiers) — so latency is a correctness gate, not just a performance note (see §6).

Ranking keys (evaluation/results.py):

  • highest_tier_above_90 — the maximum scored tier (id > 0) with accuracy ≥ 0.90. Not a contiguous run; it depends only on the single highest tier clearing 0.90.
  • overall_accuracy — mean accuracy over completed scored tiers 1–10. Tier 0 is excluded from both keys.

2. How each tier samples its range

Private evaluation uses EvalConfig (config.py), which draws 5 distinct primes per tier (primes_per_tier = 5) and 4 edge cases (a=0, b=0, a=1, b=1). The public benchmark uses the same structure with a fixed seed. So each tier's 100 problems are:

4 edge cases + 96 problems spread over 5 distinct primes (≈ 19 operand-pairs/prime).

A consequence worth stating plainly: one weak prime ≈ 20 % of a tier. This is why robustness has to be measured by resampling the 5 primes across seeds, not by reading a single seed (§5).

Tier Prime range [2^min, 2^max) Operand range a,b ∈ [0, 2^k)
1 fixed primes {2,3,5,7} 2³²
2 2⁴ … 2⁸ 2⁴⁸
3 2⁹ … 2¹⁶ 2⁶⁴
4 2¹⁷ … 2³² 2⁹⁶
5 2³³ … 2⁶⁴ 2¹²⁸
6 2⁶⁵ … 2¹²⁸ 2²⁵⁶
7 2¹²⁹ … 2²⁵⁶ 2⁵¹²
8 2²⁵⁷ … 2⁵¹² 2¹⁰²⁴
9 2⁵¹³ … 2¹⁰²⁴ 2²⁰⁴⁸
10 2¹⁰²⁵ … 2²⁰⁴⁸ 2⁴⁰⁹⁶

Primes are drawn value-uniform (randrange(2^min, 2^max) then nextprime), which concentrates mass at the top of each tier's bit-range. The weights are trained to match that distribution (see README, "Width-robustness audit").

Tier 0 is a separate pure-multiplication diagnostic (p chosen so a·b < p, i.e. no reduction); it is excluded from both ranking keys and so does not affect the score.


3. Reproducing the deterministic public score

The public benchmark seed is the hex of b'modchallenge-public-benchmark-v1'. The CLI parses --seed as bytes.fromhex(...), and an empty --seed means a random draw — so the explicit seed is required for the reproducible number.

PUBLIC_SEED=$(python -c "print(b'modchallenge-public-benchmark-v1'.hex())")
# = 6d6f646368616c6c656e67652d7075626c69632d62656e63686d61726b2d7631
modchallenge evaluate horner_rnn --total 1100 --seed "$PUBLIC_SEED"

Full public-seed result

overall_accuracy      = 1.0
highest_tier_above_90 = 10        (the maximum tier)
deterministic         = true
artifact size         = 0.04 GB
Tier 0 1 2 3 4 5 6 7 8 9 10
accuracy 0.70* 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

* tier 0 is the unscored pure-multiplication diagnostic; it enters neither ranking key.

Cumulative wall time (GPU): tiers 0–8 finish in ~23 s, tier 9 at ~53 s, tier 10 at 173.6 s — the 2048-step tier-10 scan is essentially the entire cost.


4. What is and isn't guaranteed

  • No formal guarantee of exact arithmetic. The cell is a learned approximation of the Horner step, not certified modular arithmetic; there is no proof it is 100 % on every input.
  • Generalisation is structural, not memorised. One shared cell runs the same width-independent reduction circuit at every value, so a different prime/operand in the same tier is the same operation on different numbers — not out-of-distribution. Held-out-prime validation tracks training accuracy (no memorisation gap).
  • The ranked outcome is robust (measured in §5): highest_tier_above_90 = 10 holds with very high probability across seeds; overall_accuracy stays ≥ 0.997. What is not guaranteed is the cosmetic gap between ~0.997 and a literal 1.000 on the secondary key.

5. Seed / range robustness (the generalisation evidence)

The public 1.000 is one draw. To test generalisation, the scoring harness was run on five different secret seeds (each draws a fresh set of 5 primes/tier + operands across every range) — faithful private-eval simulations, since the private eval also uses primes_per_tier = 5.

Seed (hex) t1–t7 t8 t9 t10 overall htop det
…public… 1.00 1.00 1.00 1.00 1.0000 10
1111… 1.00 0.99 1.00 0.98 0.9970 10
2222… 1.00 0.99 1.00 1.00 0.9990 10
deadbeef… 1.00 0.97 1.00 1.00 0.9970 10
cafef00d… 1.00 1.00 0.99 0.99 0.9980 10
a5a5… 1.00 1.00 1.00 1.00 1.0000 10

Reproduce any row with modchallenge evaluate horner_rnn --total 1100 --seed <hex>.

Reading of the evidence:

  • Primary key invariant: highest_tier_above_90 = 10 on 6/6 seeds. The worst any scored tier reached was 0.97 — never near the 0.90 threshold.
  • Secondary key in a tight band: overall 0.9970 – 1.0000, mean ≈ 0.9985. A random private seed will most likely read ~0.997–0.999, not a literal 1.000.
  • All variation is confined to tiers 8–10 (257–2048-bit primes). Tiers 1–7 are perfectly stable across every seed.

This matches the larger faithful 5-prime bootstrap on the shipped weights (diag_5prime_boot.py in the research repo): P(tier < 0.90) ≈ 0.000 % for tiers 1–9 and ≈ 0.002 % for tier 10; E[tier10] ≈ 0.991, worst observed near-max tier-10 prime ≈ 0.875. A 40k-draw width sweep (audit_width_robustness.py, research repo) finds no accuracy "knee" anywhere in the samplable range — the residual misses are rare per-(a,b) reduction-boundary events scattered ≈ uniformly, in the deep tail only.


6. Timing under the official clock

The 173.6 s above is GPU timing (batched predict_digits_batch). The budget is 300 s total for all 1100 problems, and tier 10's 2048-step scan dominates. The one delivery risk that is not about correctness: if the official runner is CPU-only, the tier-10 scan can exceed the budget and time out — which would zero the timed-out tiers and drop the primary key. Confirm the runner's hardware (GPU vs CPU) and, if CPU, do a dress-rehearsal run against the 300 s budget before relying on the GPU timing. The correctness result (§3, §5) is independent of this.


7. Compliance, in one line each

(Full argument in README.md → "Compliance split" / "Status under the rules".)

  • Preprocess hooks are pass-through identities — no cross-argument leakage (gate 6).
  • predict_digits reduces only a % p, b % p (two-operand normalisation, allowed) and never forms the three-argument modular product directly.
  • No add/multiply/compare-against-p is hand-coded; the forward pass is tokenise → learned cell → quantise → readout.
  • Principle 2, measured: perturbing trained weights collapses accuracy to the untrained floor (exploration/compliance_perturb.py) — the arithmetic lives in the parameters.
  • Passes modchallenge check; deterministic.