etwk

Docs: companion-repo note for provenance recipes, qualify dead links, neutral phrasing; gitignore .claude/

f704813 4 days ago

8.96 kB

	# Evaluation reference

	This document records **how `horner_rnn` is evaluated, how to reproduce the score, and how the
	result behaves across different evaluation seeds and prime ranges** — i.e. how far the public
	`1.000` generalises. It complements `README.md` (which documents how the weights were obtained).

	All numbers here are reproducible from this repo plus the official challenge harness
	(`modchallenge`); the per-tier sampling facts are read directly from the harness source cited
	inline.

	---

	## 1. What the harness scores (the eight gates)

	`modchallenge evaluate` runs a fixed pipeline
	(`src/modchallenge/evaluation/pipeline.py:evaluate_local`). Every gate below must pass; the two
	ranking keys are produced only at the end.

	\| # \| Gate \| This submission \| Bound / spec \|
	\|---\|---\|---\|---\|
	\| 1 \| Manifest validation \| `entry_class=model.HornerRNN`, `output_base=2` \| well-formed `manifest.json` \|
	\| 2 \| Artifact size \| 0.04 GB \| ≤ 20 GB (`EvalConfig.max_artifact_bytes`) \|
	\| 3 \| Static analysis / compliance (`security.static_check`) \| 0 findings → passed \| no hand-coded arithmetic on `p` \|
	\| 4 \| Test-set generation \| 1100 problems = 100 × 11 tiers (0–10) \| `total_problems` \|
	\| 5 \| Model load \| ~2 s \| must import + load \|
	\| 6 \| Preprocess isolation (`check_preprocess_isolation`) \| passes — hooks are stateless identities \| per-argument, no cross-leak \|
	\| 7 \| Determinism (`check_determinism`, 10 end-to-end re-runs) \| `deterministic: true` \| required to be ranked \|
	\| 8 \| Inference within budget \| 173.6 s, all 11 tiers completed \| ≤ 300 s wall (`timeout_seconds`) \|

	A tier that does not finish within the 300 s budget is scored 0 for that tier
	(`run_inference` discards partial tiers) — so latency is a correctness gate, not just a
	performance note (see §6).

	Ranking keys (`evaluation/results.py`):
	- `highest_tier_above_90` — the maximum scored tier (id > 0) with accuracy ≥ 0.90. Not a
	contiguous run; it depends only on the single highest tier clearing 0.90.
	- `overall_accuracy` — mean accuracy over completed scored tiers 1–10. Tier 0 is excluded
	from both keys.

	---

	## 2. How each tier samples its range

	Private evaluation uses `EvalConfig` (`config.py`), which draws 5 distinct primes per tier
	(`primes_per_tier = 5`) and 4 edge cases (`a=0, b=0, a=1, b=1`). The public benchmark uses
	the same structure with a fixed seed. So each tier's 100 problems are:

	> 4 edge cases + 96 problems spread over 5 distinct primes (≈ 19 operand-pairs/prime).

	A consequence worth stating plainly: one weak prime ≈ 20 % of a tier. This is why
	robustness has to be measured by resampling the 5 primes across seeds, not by reading a single
	seed (§5).

	\| Tier \| Prime range `[2^min, 2^max)` \| Operand range `a,b ∈ [0, 2^k)` \|
	\|---\|---\|---\|
	\| 1 \| fixed primes {2,3,5,7} \| 2³² \|
	\| 2 \| 2⁴ … 2⁸ \| 2⁴⁸ \|
	\| 3 \| 2⁹ … 2¹⁶ \| 2⁶⁴ \|
	\| 4 \| 2¹⁷ … 2³² \| 2⁹⁶ \|
	\| 5 \| 2³³ … 2⁶⁴ \| 2¹²⁸ \|
	\| 6 \| 2⁶⁵ … 2¹²⁸ \| 2²⁵⁶ \|
	\| 7 \| 2¹²⁹ … 2²⁵⁶ \| 2⁵¹² \|
	\| 8 \| 2²⁵⁷ … 2⁵¹² \| 2¹⁰²⁴ \|
	\| 9 \| 2⁵¹³ … 2¹⁰²⁴ \| 2²⁰⁴⁸ \|
	\| 10 \| 2¹⁰²⁵ … 2²⁰⁴⁸ \| 2⁴⁰⁹⁶ \|

	Primes are drawn value-uniform (`randrange(2^min, 2^max)` then `nextprime`), which
	concentrates mass at the top of each tier's bit-range. The weights are trained to match that
	distribution (see README, "Width-robustness audit").

	Tier 0 is a separate pure-multiplication diagnostic (`p` chosen so `a·b < p`, i.e. no
	reduction); it is excluded from both ranking keys and so does not affect the score.

	---

	## 3. Reproducing the deterministic public score

	The public benchmark seed is the hex of `b'modchallenge-public-benchmark-v1'`. The CLI parses
	`--seed` as `bytes.fromhex(...)`, and an empty `--seed` means a random draw — so the explicit
	seed is required for the reproducible number.

	```bash
	PUBLIC_SEED=$(python -c "print(b'modchallenge-public-benchmark-v1'.hex())")
	# = 6d6f646368616c6c656e67652d7075626c69632d62656e63686d61726b2d7631
	modchallenge evaluate horner_rnn --total 1100 --seed "$PUBLIC_SEED"
	```

	### Full public-seed result

	```
	overall_accuracy = 1.0
	highest_tier_above_90 = 10 (the maximum tier)
	deterministic = true
	artifact size = 0.04 GB
	```

	\| Tier \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| 9 \| 10 \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| accuracy \| 0.70\* \| 1.00 \| 1.00 \| 1.00 \| 1.00 \| 1.00 \| 1.00 \| 1.00 \| 1.00 \| 1.00 \| 1.00 \|

	\* tier 0 is the unscored pure-multiplication diagnostic; it enters neither ranking key.

	Cumulative wall time (GPU): tiers 0–8 finish in ~23 s, tier 9 at ~53 s, tier 10 at 173.6 s —
	the 2048-step tier-10 scan is essentially the entire cost.

	---

	## 4. What is and isn't guaranteed

	- No formal guarantee of exact arithmetic. The cell is a learned approximation of the
	Horner step, not certified modular arithmetic; there is no proof it is 100 % on every input.
	- Generalisation is structural, not memorised. One shared cell runs the same
	width-independent reduction circuit at every value, so a different prime/operand in the same
	tier is the same operation on different numbers — not out-of-distribution. Held-out-prime
	validation tracks training accuracy (no memorisation gap).
	- The ranked outcome is robust (measured in §5): `highest_tier_above_90 = 10` holds with very
	high probability across seeds; `overall_accuracy` stays ≥ 0.997. What is not guaranteed is the
	cosmetic gap between ~0.997 and a literal 1.000 on the secondary key.

	---

	## 5. Seed / range robustness (the generalisation evidence)

	The public `1.000` is one draw. To test generalisation, the scoring harness was run on five
	different secret seeds (each draws a fresh set of 5 primes/tier + operands across every
	range) — faithful private-eval simulations, since the private eval also uses `primes_per_tier = 5`.

	\| Seed (hex) \| t1–t7 \| t8 \| t9 \| t10 \| overall \| htop \| det \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| `…public…` \| 1.00 \| 1.00 \| 1.00 \| 1.00 \| 1.0000 \| 10 \| ✓ \|
	\| `1111…` \| 1.00 \| 0.99 \| 1.00 \| 0.98 \| 0.9970 \| 10 \| ✓ \|
	\| `2222…` \| 1.00 \| 0.99 \| 1.00 \| 1.00 \| 0.9990 \| 10 \| ✓ \|
	\| `deadbeef…` \| 1.00 \| 0.97 \| 1.00 \| 1.00 \| 0.9970 \| 10 \| ✓ \|
	\| `cafef00d…` \| 1.00 \| 1.00 \| 0.99 \| 0.99 \| 0.9980 \| 10 \| ✓ \|
	\| `a5a5…` \| 1.00 \| 1.00 \| 1.00 \| 1.00 \| 1.0000 \| 10 \| ✓ \|

	Reproduce any row with `modchallenge evaluate horner_rnn --total 1100 --seed <hex>`.

	Reading of the evidence:
	- Primary key invariant: `highest_tier_above_90 = 10` on 6/6 seeds. The worst any scored
	tier reached was 0.97 — never near the 0.90 threshold.
	- Secondary key in a tight band: overall 0.9970 – 1.0000, mean ≈ 0.9985. A random private
	seed will most likely read ~0.997–0.999, not a literal 1.000.
	- All variation is confined to tiers 8–10 (257–2048-bit primes). Tiers 1–7 are perfectly
	stable across every seed.

	This matches the larger faithful 5-prime bootstrap on the shipped weights
	(`diag_5prime_boot.py` in the research repo): `P(tier < 0.90) ≈ 0.000 %` for tiers 1–9 and
	≈ 0.002 % for tier 10; `E[tier10] ≈ 0.991`, worst observed near-max tier-10 prime ≈ 0.875. A
	40k-draw width sweep (`audit_width_robustness.py`, research repo) finds no accuracy "knee" anywhere in the
	samplable range — the residual misses are rare per-`(a,b)` reduction-boundary events scattered
	≈ uniformly, in the deep tail only.

	---

	## 6. Timing under the official clock

	The 173.6 s above is GPU timing (batched `predict_digits_batch`). The budget is 300 s total
	for all 1100 problems, and tier 10's 2048-step scan dominates. The one delivery risk that is not
	about correctness: if the official runner is CPU-only, the tier-10 scan can exceed the budget
	and time out — which would zero the timed-out tiers and drop the primary key. Confirm the
	runner's hardware (GPU vs CPU) and, if CPU, do a dress-rehearsal run against the 300 s budget
	before relying on the GPU timing. The correctness result (§3, §5) is independent of this.

	---

	## 7. Compliance, in one line each

	(Full argument in `README.md` → "Compliance split" / "Status under the rules".)

	- Preprocess hooks are pass-through identities — no cross-argument leakage (gate 6).
	- `predict_digits` reduces only `a % p`, `b % p` (two-operand normalisation, allowed) and never
	forms the three-argument modular product directly.
	- No add/multiply/compare-against-`p` is hand-coded; the forward pass is tokenise → learned cell
	→ quantise → readout.
	- Principle 2, measured: perturbing trained weights collapses accuracy to the untrained
	floor (`exploration/compliance_perturb.py`) — the arithmetic lives in the parameters.
	- Passes `modchallenge check`; deterministic.