etwk

Docs: companion-repo note for provenance recipes, qualify dead links, neutral phrasing; gitignore .claude/

f704813 4 days ago

8.96 kB

Evaluation reference

This document records how horner_rnn is evaluated, how to reproduce the score, and how the result behaves across different evaluation seeds and prime ranges — i.e. how far the public 1.000 generalises. It complements README.md (which documents how the weights were obtained).

All numbers here are reproducible from this repo plus the official challenge harness (modchallenge); the per-tier sampling facts are read directly from the harness source cited inline.

1. What the harness scores (the eight gates)

modchallenge evaluate runs a fixed pipeline (src/modchallenge/evaluation/pipeline.py:evaluate_local). Every gate below must pass; the two ranking keys are produced only at the end.

#	Gate	This submission	Bound / spec
1	Manifest validation	`entry_class=model.HornerRNN`, `output_base=2`	well-formed `manifest.json`
2	Artifact size	0.04 GB	≤ 20 GB (`EvalConfig.max_artifact_bytes`)
3	Static analysis / compliance (`security.static_check`)	0 findings → passed	no hand-coded arithmetic on `p`
4	Test-set generation	1100 problems = 100 × 11 tiers (0–10)	`total_problems`
5	Model load	~2 s	must import + load
6	Preprocess isolation (`check_preprocess_isolation`)	passes — hooks are stateless identities	per-argument, no cross-leak
7	Determinism (`check_determinism`, 10 end-to-end re-runs)	`deterministic: true`	required to be ranked
8	Inference within budget	173.6 s, all 11 tiers completed	≤ 300 s wall (`timeout_seconds`)

A tier that does not finish within the 300 s budget is scored 0 for that tier (run_inference discards partial tiers) — so latency is a correctness gate, not just a performance note (see §6).

Ranking keys (evaluation/results.py):

highest_tier_above_90 — the maximum scored tier (id > 0) with accuracy ≥ 0.90. Not a contiguous run; it depends only on the single highest tier clearing 0.90.
overall_accuracy — mean accuracy over completed scored tiers 1–10. Tier 0 is excluded from both keys.

2. How each tier samples its range

Private evaluation uses EvalConfig (config.py), which draws 5 distinct primes per tier (primes_per_tier = 5) and 4 edge cases (a=0, b=0, a=1, b=1). The public benchmark uses the same structure with a fixed seed. So each tier's 100 problems are:

4 edge cases + 96 problems spread over 5 distinct primes (≈ 19 operand-pairs/prime).

A consequence worth stating plainly: one weak prime ≈ 20 % of a tier. This is why robustness has to be measured by resampling the 5 primes across seeds, not by reading a single seed (§5).

Tier	Prime range `[2^min, 2^max)`	Operand range `a,b ∈ [0, 2^k)`
1	fixed primes {2,3,5,7}	2³²
2	2⁴ … 2⁸	2⁴⁸
3	2⁹ … 2¹⁶	2⁶⁴
4	2¹⁷ … 2³²	2⁹⁶
5	2³³ … 2⁶⁴	2¹²⁸
6	2⁶⁵ … 2¹²⁸	2²⁵⁶
7	2¹²⁹ … 2²⁵⁶	2⁵¹²
8	2²⁵⁷ … 2⁵¹²	2¹⁰²⁴
9	2⁵¹³ … 2¹⁰²⁴	2²⁰⁴⁸
10	2¹⁰²⁵ … 2²⁰⁴⁸	2⁴⁰⁹⁶

Primes are drawn value-uniform (randrange(2^min, 2^max) then nextprime), which concentrates mass at the top of each tier's bit-range. The weights are trained to match that distribution (see README, "Width-robustness audit").

Tier 0 is a separate pure-multiplication diagnostic (p chosen so a·b < p, i.e. no reduction); it is excluded from both ranking keys and so does not affect the score.

3. Reproducing the deterministic public score

The public benchmark seed is the hex of b'modchallenge-public-benchmark-v1'. The CLI parses --seed as bytes.fromhex(...), and an empty --seed means a random draw — so the explicit seed is required for the reproducible number.

PUBLIC_SEED=$(python -c "print(b'modchallenge-public-benchmark-v1'.hex())")
# = 6d6f646368616c6c656e67652d7075626c69632d62656e63686d61726b2d7631
modchallenge evaluate horner_rnn --total 1100 --seed "$PUBLIC_SEED"

Full public-seed result

overall_accuracy      = 1.0
highest_tier_above_90 = 10        (the maximum tier)
deterministic         = true
artifact size         = 0.04 GB

Tier	0	1	2	3	4	5	6	7	8	9	10
accuracy	0.70*	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00

* tier 0 is the unscored pure-multiplication diagnostic; it enters neither ranking key.

Cumulative wall time (GPU): tiers 0–8 finish in ~23 s, tier 9 at ~53 s, tier 10 at 173.6 s — the 2048-step tier-10 scan is essentially the entire cost.

4. What is and isn't guaranteed

No formal guarantee of exact arithmetic. The cell is a learned approximation of the Horner step, not certified modular arithmetic; there is no proof it is 100 % on every input.
Generalisation is structural, not memorised. One shared cell runs the same width-independent reduction circuit at every value, so a different prime/operand in the same tier is the same operation on different numbers — not out-of-distribution. Held-out-prime validation tracks training accuracy (no memorisation gap).
The ranked outcome is robust (measured in §5): highest_tier_above_90 = 10 holds with very high probability across seeds; overall_accuracy stays ≥ 0.997. What is not guaranteed is the cosmetic gap between ~0.997 and a literal 1.000 on the secondary key.

5. Seed / range robustness (the generalisation evidence)

The public 1.000 is one draw. To test generalisation, the scoring harness was run on five different secret seeds (each draws a fresh set of 5 primes/tier + operands across every range) — faithful private-eval simulations, since the private eval also uses primes_per_tier = 5.

Seed (hex)	t1–t7	t8	t9	t10	overall	htop	det
`…public…`	1.00	1.00	1.00	1.00	1.0000	10	✓
`1111…`	1.00	0.99	1.00	0.98	0.9970	10	✓
`2222…`	1.00	0.99	1.00	1.00	0.9990	10	✓
`deadbeef…`	1.00	0.97	1.00	1.00	0.9970	10	✓
`cafef00d…`	1.00	1.00	0.99	0.99	0.9980	10	✓
`a5a5…`	1.00	1.00	1.00	1.00	1.0000	10	✓

Reproduce any row with modchallenge evaluate horner_rnn --total 1100 --seed <hex>.

Reading of the evidence:

Primary key invariant: highest_tier_above_90 = 10 on 6/6 seeds. The worst any scored tier reached was 0.97 — never near the 0.90 threshold.
Secondary key in a tight band: overall 0.9970 – 1.0000, mean ≈ 0.9985. A random private seed will most likely read ~0.997–0.999, not a literal 1.000.
All variation is confined to tiers 8–10 (257–2048-bit primes). Tiers 1–7 are perfectly stable across every seed.

This matches the larger faithful 5-prime bootstrap on the shipped weights (diag_5prime_boot.py in the research repo): P(tier < 0.90) ≈ 0.000 % for tiers 1–9 and ≈ 0.002 % for tier 10; E[tier10] ≈ 0.991, worst observed near-max tier-10 prime ≈ 0.875. A 40k-draw width sweep (audit_width_robustness.py, research repo) finds no accuracy "knee" anywhere in the samplable range — the residual misses are rare per-(a,b) reduction-boundary events scattered ≈ uniformly, in the deep tail only.

6. Timing under the official clock

The 173.6 s above is GPU timing (batched predict_digits_batch). The budget is 300 s total for all 1100 problems, and tier 10's 2048-step scan dominates. The one delivery risk that is not about correctness: if the official runner is CPU-only, the tier-10 scan can exceed the budget and time out — which would zero the timed-out tiers and drop the primary key. Confirm the runner's hardware (GPU vs CPU) and, if CPU, do a dress-rehearsal run against the 300 s budget before relying on the GPU timing. The correctness result (§3, §5) is independent of this.

7. Compliance, in one line each

(Full argument in README.md → "Compliance split" / "Status under the rules".)

Preprocess hooks are pass-through identities — no cross-argument leakage (gate 6).
predict_digits reduces only a % p, b % p (two-operand normalisation, allowed) and never forms the three-argument modular product directly.
No add/multiply/compare-against-p is hand-coded; the forward pass is tokenise → learned cell → quantise → readout.
Principle 2, measured: perturbing trained weights collapses accuracy to the untrained floor (exploration/compliance_perturb.py) — the arithmetic lives in the parameters.
Passes modchallenge check; deterministic.