etwk
Docs: companion-repo note for provenance recipes, qualify dead links, neutral phrasing; gitignore .claude/
f704813 | # Evaluation reference | |
| This document records **how `horner_rnn` is evaluated, how to reproduce the score, and how the | |
| result behaves across different evaluation seeds and prime ranges** — i.e. how far the public | |
| `1.000` generalises. It complements `README.md` (which documents how the weights were obtained). | |
| All numbers here are reproducible from this repo plus the official challenge harness | |
| (`modchallenge`); the per-tier sampling facts are read directly from the harness source cited | |
| inline. | |
| --- | |
| ## 1. What the harness scores (the eight gates) | |
| `modchallenge evaluate` runs a fixed pipeline | |
| (`src/modchallenge/evaluation/pipeline.py:evaluate_local`). Every gate below must pass; the two | |
| ranking keys are produced only at the end. | |
| | # | Gate | This submission | Bound / spec | | |
| |---|---|---|---| | |
| | 1 | **Manifest validation** | `entry_class=model.HornerRNN`, `output_base=2` | well-formed `manifest.json` | | |
| | 2 | **Artifact size** | **0.04 GB** | ≤ 20 GB (`EvalConfig.max_artifact_bytes`) | | |
| | 3 | **Static analysis / compliance** (`security.static_check`) | 0 findings → *passed* | no hand-coded arithmetic on `p` | | |
| | 4 | **Test-set generation** | 1100 problems = 100 × 11 tiers (0–10) | `total_problems` | | |
| | 5 | **Model load** | ~2 s | must import + load | | |
| | 6 | **Preprocess isolation** (`check_preprocess_isolation`) | passes — hooks are stateless identities | per-argument, no cross-leak | | |
| | 7 | **Determinism** (`check_determinism`, 10 end-to-end re-runs) | `deterministic: true` | required to be ranked | | |
| | 8 | **Inference within budget** | 173.6 s, all 11 tiers completed | ≤ 300 s wall (`timeout_seconds`) | | |
| A tier that does not *finish* within the 300 s budget is scored **0** for that tier | |
| (`run_inference` discards partial tiers) — so latency is a correctness gate, not just a | |
| performance note (see §6). | |
| **Ranking keys** (`evaluation/results.py`): | |
| - `highest_tier_above_90` — the **maximum** scored tier (id > 0) with accuracy ≥ 0.90. Not a | |
| contiguous run; it depends only on the single highest tier clearing 0.90. | |
| - `overall_accuracy` — mean accuracy over **completed scored tiers 1–10**. Tier 0 is excluded | |
| from both keys. | |
| --- | |
| ## 2. How each tier samples its range | |
| Private evaluation uses `EvalConfig` (`config.py`), which draws **5 distinct primes per tier** | |
| (`primes_per_tier = 5`) and **4 edge cases** (`a=0, b=0, a=1, b=1`). The public benchmark uses | |
| the same structure with a fixed seed. So each tier's 100 problems are: | |
| > 4 edge cases + 96 problems spread over **5 distinct primes** (≈ 19 operand-pairs/prime). | |
| A consequence worth stating plainly: **one weak prime ≈ 20 % of a tier.** This is why | |
| robustness has to be measured by *resampling the 5 primes across seeds*, not by reading a single | |
| seed (§5). | |
| | Tier | Prime range `[2^min, 2^max)` | Operand range `a,b ∈ [0, 2^k)` | | |
| |---|---|---| | |
| | 1 | fixed primes {2,3,5,7} | 2³² | | |
| | 2 | 2⁴ … 2⁸ | 2⁴⁸ | | |
| | 3 | 2⁹ … 2¹⁶ | 2⁶⁴ | | |
| | 4 | 2¹⁷ … 2³² | 2⁹⁶ | | |
| | 5 | 2³³ … 2⁶⁴ | 2¹²⁸ | | |
| | 6 | 2⁶⁵ … 2¹²⁸ | 2²⁵⁶ | | |
| | 7 | 2¹²⁹ … 2²⁵⁶ | 2⁵¹² | | |
| | 8 | 2²⁵⁷ … 2⁵¹² | 2¹⁰²⁴ | | |
| | 9 | 2⁵¹³ … 2¹⁰²⁴ | 2²⁰⁴⁸ | | |
| | 10 | 2¹⁰²⁵ … 2²⁰⁴⁸ | 2⁴⁰⁹⁶ | | |
| Primes are drawn **value-uniform** (`randrange(2^min, 2^max)` then `nextprime`), which | |
| concentrates mass at the top of each tier's bit-range. The weights are trained to match that | |
| distribution (see README, "Width-robustness audit"). | |
| Tier 0 is a separate **pure-multiplication** diagnostic (`p` chosen so `a·b < p`, i.e. no | |
| reduction); it is **excluded from both ranking keys** and so does not affect the score. | |
| --- | |
| ## 3. Reproducing the deterministic public score | |
| The public benchmark seed is the hex of `b'modchallenge-public-benchmark-v1'`. The CLI parses | |
| `--seed` as `bytes.fromhex(...)`, and an **empty `--seed` means a random draw** — so the explicit | |
| seed is required for the reproducible number. | |
| ```bash | |
| PUBLIC_SEED=$(python -c "print(b'modchallenge-public-benchmark-v1'.hex())") | |
| # = 6d6f646368616c6c656e67652d7075626c69632d62656e63686d61726b2d7631 | |
| modchallenge evaluate horner_rnn --total 1100 --seed "$PUBLIC_SEED" | |
| ``` | |
| ### Full public-seed result | |
| ``` | |
| overall_accuracy = 1.0 | |
| highest_tier_above_90 = 10 (the maximum tier) | |
| deterministic = true | |
| artifact size = 0.04 GB | |
| ``` | |
| | Tier | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | | |
| |---|---|---|---|---|---|---|---|---|---|---|---| | |
| | accuracy | 0.70\* | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | | |
| \* tier 0 is the unscored pure-multiplication diagnostic; it enters neither ranking key. | |
| Cumulative wall time (GPU): tiers 0–8 finish in ~23 s, tier 9 at ~53 s, tier 10 at **173.6 s** — | |
| the 2048-step tier-10 scan is essentially the entire cost. | |
| --- | |
| ## 4. What is and isn't guaranteed | |
| - **No formal guarantee of exact arithmetic.** The cell is a *learned* approximation of the | |
| Horner step, not certified modular arithmetic; there is no proof it is 100 % on every input. | |
| - **Generalisation is structural, not memorised.** One shared cell runs the same | |
| width-independent reduction circuit at every value, so a different prime/operand in the same | |
| tier is the same operation on different numbers — not out-of-distribution. Held-out-prime | |
| validation tracks training accuracy (no memorisation gap). | |
| - **The ranked outcome is robust** (measured in §5): `highest_tier_above_90 = 10` holds with very | |
| high probability across seeds; `overall_accuracy` stays ≥ 0.997. What is *not* guaranteed is the | |
| cosmetic gap between ~0.997 and a literal 1.000 on the secondary key. | |
| --- | |
| ## 5. Seed / range robustness (the generalisation evidence) | |
| The public `1.000` is **one draw**. To test generalisation, the scoring harness was run on five | |
| **different** secret seeds (each draws a fresh set of 5 primes/tier + operands across every | |
| range) — faithful private-eval simulations, since the private eval also uses `primes_per_tier = 5`. | |
| | Seed (hex) | t1–t7 | t8 | t9 | t10 | **overall** | **htop** | det | | |
| |---|---|---|---|---|---|---|---| | |
| | `…public…` | 1.00 | 1.00 | 1.00 | 1.00 | **1.0000** | **10** | ✓ | | |
| | `1111…` | 1.00 | 0.99 | 1.00 | 0.98 | 0.9970 | **10** | ✓ | | |
| | `2222…` | 1.00 | 0.99 | 1.00 | 1.00 | 0.9990 | **10** | ✓ | | |
| | `deadbeef…` | 1.00 | 0.97 | 1.00 | 1.00 | 0.9970 | **10** | ✓ | | |
| | `cafef00d…` | 1.00 | 1.00 | 0.99 | 0.99 | 0.9980 | **10** | ✓ | | |
| | `a5a5…` | 1.00 | 1.00 | 1.00 | 1.00 | 1.0000 | **10** | ✓ | | |
| Reproduce any row with `modchallenge evaluate horner_rnn --total 1100 --seed <hex>`. | |
| **Reading of the evidence:** | |
| - **Primary key invariant:** `highest_tier_above_90 = 10` on 6/6 seeds. The worst *any* scored | |
| tier reached was **0.97** — never near the 0.90 threshold. | |
| - **Secondary key in a tight band:** overall 0.9970 – 1.0000, mean ≈ 0.9985. A random private | |
| seed will most likely read ~0.997–0.999, not a literal 1.000. | |
| - **All variation is confined to tiers 8–10** (257–2048-bit primes). Tiers 1–7 are perfectly | |
| stable across every seed. | |
| This matches the larger faithful 5-prime bootstrap on the shipped weights | |
| (`diag_5prime_boot.py` in the research repo): `P(tier < 0.90) ≈ 0.000 %` for tiers 1–9 and | |
| ≈ 0.002 % for tier 10; `E[tier10] ≈ 0.991`, worst observed near-max tier-10 prime ≈ 0.875. A | |
| 40k-draw width sweep (`audit_width_robustness.py`, research repo) finds **no accuracy "knee"** anywhere in the | |
| samplable range — the residual misses are rare per-`(a,b)` reduction-boundary events scattered | |
| ≈ uniformly, in the deep tail only. | |
| --- | |
| ## 6. Timing under the official clock | |
| The 173.6 s above is **GPU** timing (batched `predict_digits_batch`). The budget is **300 s total** | |
| for all 1100 problems, and tier 10's 2048-step scan dominates. The one delivery risk that is *not* | |
| about correctness: if the official runner is **CPU-only**, the tier-10 scan can exceed the budget | |
| and time out — which would zero the timed-out tiers and drop the primary key. Confirm the | |
| runner's hardware (GPU vs CPU) and, if CPU, do a dress-rehearsal run against the 300 s budget | |
| before relying on the GPU timing. The *correctness* result (§3, §5) is independent of this. | |
| --- | |
| ## 7. Compliance, in one line each | |
| (Full argument in `README.md` → "Compliance split" / "Status under the rules".) | |
| - Preprocess hooks are pass-through identities — no cross-argument leakage (gate 6). | |
| - `predict_digits` reduces only `a % p`, `b % p` (two-operand normalisation, allowed) and never | |
| forms the three-argument modular product directly. | |
| - No add/multiply/compare-against-`p` is hand-coded; the forward pass is tokenise → learned cell | |
| → quantise → readout. | |
| - **Principle 2, measured:** perturbing trained weights collapses accuracy to the untrained | |
| floor (`exploration/compliance_perturb.py`) — the arithmetic lives in the parameters. | |
| - Passes `modchallenge check`; deterministic. | |