etwk commited on
Commit ·
93db0e9
1
Parent(s): 8ed8c45
Add EVALUATION.md: harness gates, per-tier sampling, seed-robustness table
Browse files- EVALUATION.md +178 -0
- README.md +4 -1
EVALUATION.md
ADDED
|
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Evaluation reference
|
| 2 |
+
|
| 3 |
+
This document records **how `horner_rnn` is evaluated, how to reproduce the score, and how the
|
| 4 |
+
result behaves across different evaluation seeds and prime ranges** — i.e. how far the public
|
| 5 |
+
`1.000` generalises. It complements `README.md` (which documents how the weights were obtained).
|
| 6 |
+
|
| 7 |
+
All numbers here are reproducible from this repo plus the official challenge harness
|
| 8 |
+
(`modchallenge`); the per-tier sampling facts are read directly from the harness source cited
|
| 9 |
+
inline.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## 1. What the harness scores (the eight gates)
|
| 14 |
+
|
| 15 |
+
`modchallenge evaluate` runs a fixed pipeline
|
| 16 |
+
(`src/modchallenge/evaluation/pipeline.py:evaluate_local`). Every gate below must pass; the two
|
| 17 |
+
ranking keys are produced only at the end.
|
| 18 |
+
|
| 19 |
+
| # | Gate | This submission | Bound / spec |
|
| 20 |
+
|---|---|---|---|
|
| 21 |
+
| 1 | **Manifest validation** | `entry_class=model.HornerRNN`, `output_base=2` | well-formed `manifest.json` |
|
| 22 |
+
| 2 | **Artifact size** | **0.04 GB** | ≤ 20 GB (`EvalConfig.max_artifact_bytes`) |
|
| 23 |
+
| 3 | **Static analysis / compliance** (`security.static_check`) | 0 findings → *passed* | no hand-coded arithmetic on `p` |
|
| 24 |
+
| 4 | **Test-set generation** | 1100 problems = 100 × 11 tiers (0–10) | `total_problems` |
|
| 25 |
+
| 5 | **Model load** | ~2 s | must import + load |
|
| 26 |
+
| 6 | **Preprocess isolation** (`check_preprocess_isolation`) | passes — hooks are stateless identities | per-argument, no cross-leak |
|
| 27 |
+
| 7 | **Determinism** (`check_determinism`, 10 end-to-end re-runs) | `deterministic: true` | required to be ranked |
|
| 28 |
+
| 8 | **Inference within budget** | 173.6 s, all 11 tiers completed | ≤ 300 s wall (`timeout_seconds`) |
|
| 29 |
+
|
| 30 |
+
A tier that does not *finish* within the 300 s budget is scored **0** for that tier
|
| 31 |
+
(`run_inference` discards partial tiers) — so latency is a correctness gate, not just a
|
| 32 |
+
performance note (see §6).
|
| 33 |
+
|
| 34 |
+
**Ranking keys** (`evaluation/results.py`):
|
| 35 |
+
- `highest_tier_above_90` — the **maximum** scored tier (id > 0) with accuracy ≥ 0.90. Not a
|
| 36 |
+
contiguous run; it depends only on the single highest tier clearing 0.90.
|
| 37 |
+
- `overall_accuracy` — mean accuracy over **completed scored tiers 1–10**. Tier 0 is excluded
|
| 38 |
+
from both keys.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## 2. How each tier samples its range
|
| 43 |
+
|
| 44 |
+
Private evaluation uses `EvalConfig` (`config.py`), which draws **5 distinct primes per tier**
|
| 45 |
+
(`primes_per_tier = 5`) and **4 edge cases** (`a=0, b=0, a=1, b=1`). The public benchmark uses
|
| 46 |
+
the same structure with a fixed seed. So each tier's 100 problems are:
|
| 47 |
+
|
| 48 |
+
> 4 edge cases + 96 problems spread over **5 distinct primes** (≈ 19 operand-pairs/prime).
|
| 49 |
+
|
| 50 |
+
A consequence worth stating plainly: **one weak prime ≈ 20 % of a tier.** This is why
|
| 51 |
+
robustness has to be measured by *resampling the 5 primes across seeds*, not by reading a single
|
| 52 |
+
seed (§5).
|
| 53 |
+
|
| 54 |
+
| Tier | Prime range `[2^min, 2^max)` | Operand range `a,b ∈ [0, 2^k)` |
|
| 55 |
+
|---|---|---|
|
| 56 |
+
| 1 | fixed primes {2,3,5,7} | 2³² |
|
| 57 |
+
| 2 | 2⁴ … 2⁸ | 2⁴⁸ |
|
| 58 |
+
| 3 | 2⁹ … 2¹⁶ | 2⁶⁴ |
|
| 59 |
+
| 4 | 2¹⁷ … 2³² | 2⁹⁶ |
|
| 60 |
+
| 5 | 2³³ … 2⁶⁴ | 2¹²⁸ |
|
| 61 |
+
| 6 | 2⁶⁵ … 2¹²⁸ | 2²⁵⁶ |
|
| 62 |
+
| 7 | 2¹²⁹ … 2²⁵⁶ | 2⁵¹² |
|
| 63 |
+
| 8 | 2²⁵⁷ … 2⁵¹² | 2¹⁰²⁴ |
|
| 64 |
+
| 9 | 2⁵¹³ … 2¹⁰²⁴ | 2²⁰⁴⁸ |
|
| 65 |
+
| 10 | 2¹⁰²⁵ … 2²⁰⁴⁸ | 2⁴⁰⁹⁶ |
|
| 66 |
+
|
| 67 |
+
Primes are drawn **value-uniform** (`randrange(2^min, 2^max)` then `nextprime`), which
|
| 68 |
+
concentrates mass at the top of each tier's bit-range. The weights are trained to match that
|
| 69 |
+
distribution (see README, "Width-robustness audit").
|
| 70 |
+
|
| 71 |
+
Tier 0 is a separate **pure-multiplication** diagnostic (`p` chosen so `a·b < p`, i.e. no
|
| 72 |
+
reduction); it is **excluded from both ranking keys** and so does not affect the score.
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## 3. Reproducing the deterministic public score
|
| 77 |
+
|
| 78 |
+
The public benchmark seed is the hex of `b'modchallenge-public-benchmark-v1'`. The CLI parses
|
| 79 |
+
`--seed` as `bytes.fromhex(...)`, and an **empty `--seed` means a random draw** — so the explicit
|
| 80 |
+
seed is required for the reproducible number.
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
PUBLIC_SEED=$(python -c "print(b'modchallenge-public-benchmark-v1'.hex())")
|
| 84 |
+
# = 6d6f646368616c6c656e67652d7075626c69632d62656e63686d61726b2d7631
|
| 85 |
+
modchallenge evaluate horner_rnn --total 1100 --seed "$PUBLIC_SEED"
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
### Full public-seed result
|
| 89 |
+
|
| 90 |
+
```
|
| 91 |
+
overall_accuracy = 1.0
|
| 92 |
+
highest_tier_above_90 = 10 (the maximum tier)
|
| 93 |
+
deterministic = true
|
| 94 |
+
artifact size = 0.04 GB
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
| Tier | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|
| 98 |
+
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 99 |
+
| accuracy | 0.70\* | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
|
| 100 |
+
|
| 101 |
+
\* tier 0 is the unscored pure-multiplication diagnostic; it enters neither ranking key.
|
| 102 |
+
|
| 103 |
+
Cumulative wall time (GPU): tiers 0–8 finish in ~23 s, tier 9 at ~53 s, tier 10 at **173.6 s** —
|
| 104 |
+
the 2048-step tier-10 scan is essentially the entire cost.
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
## 4. What is and isn't guaranteed
|
| 109 |
+
|
| 110 |
+
- **No formal guarantee of exact arithmetic.** The cell is a *learned* approximation of the
|
| 111 |
+
Horner step, not certified modular arithmetic; there is no proof it is 100 % on every input.
|
| 112 |
+
- **Generalisation is structural, not memorised.** One shared cell runs the same
|
| 113 |
+
width-independent reduction circuit at every value, so a different prime/operand in the same
|
| 114 |
+
tier is the same operation on different numbers — not out-of-distribution. Held-out-prime
|
| 115 |
+
validation tracks training accuracy (no memorisation gap).
|
| 116 |
+
- **The ranked outcome is robust** (measured in §5): `highest_tier_above_90 = 10` holds with very
|
| 117 |
+
high probability across seeds; `overall_accuracy` stays ≥ 0.997. What is *not* guaranteed is the
|
| 118 |
+
cosmetic gap between ~0.997 and a literal 1.000 on the secondary key.
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## 5. Seed / range robustness (the generalisation evidence)
|
| 123 |
+
|
| 124 |
+
The public `1.000` is **one draw**. To test generalisation, the scoring harness was run on five
|
| 125 |
+
**different** secret seeds (each draws a fresh set of 5 primes/tier + operands across every
|
| 126 |
+
range) — faithful private-eval simulations, since the private eval also uses `primes_per_tier = 5`.
|
| 127 |
+
|
| 128 |
+
| Seed (hex) | t1–t7 | t8 | t9 | t10 | **overall** | **htop** | det |
|
| 129 |
+
|---|---|---|---|---|---|---|---|
|
| 130 |
+
| `…public…` | 1.00 | 1.00 | 1.00 | 1.00 | **1.0000** | **10** | ✓ |
|
| 131 |
+
| `1111…` | 1.00 | 0.99 | 1.00 | 0.98 | 0.9970 | **10** | ✓ |
|
| 132 |
+
| `2222…` | 1.00 | 0.99 | 1.00 | 1.00 | 0.9990 | **10** | ✓ |
|
| 133 |
+
| `deadbeef…` | 1.00 | 0.97 | 1.00 | 1.00 | 0.9970 | **10** | ✓ |
|
| 134 |
+
| `cafef00d…` | 1.00 | 1.00 | 0.99 | 0.99 | 0.9980 | **10** | ✓ |
|
| 135 |
+
| `a5a5…` | 1.00 | 1.00 | 1.00 | 1.00 | 1.0000 | **10** | ✓ |
|
| 136 |
+
|
| 137 |
+
Reproduce any row with `modchallenge evaluate horner_rnn --total 1100 --seed <hex>`.
|
| 138 |
+
|
| 139 |
+
**Reading of the evidence:**
|
| 140 |
+
- **Primary key invariant:** `highest_tier_above_90 = 10` on 6/6 seeds. The worst *any* scored
|
| 141 |
+
tier reached was **0.97** — never near the 0.90 threshold.
|
| 142 |
+
- **Secondary key in a tight band:** overall 0.9970 – 1.0000, mean ≈ 0.9985. A random private
|
| 143 |
+
seed will most likely read ~0.997–0.999, not a literal 1.000.
|
| 144 |
+
- **All variation is confined to tiers 8–10** (257–2048-bit primes). Tiers 1–7 are perfectly
|
| 145 |
+
stable across every seed.
|
| 146 |
+
|
| 147 |
+
This matches the larger faithful 5-prime bootstrap on the shipped weights
|
| 148 |
+
(`diag_5prime_boot.py` in the research repo): `P(tier < 0.90) ≈ 0.000 %` for tiers 1–9 and
|
| 149 |
+
≈ 0.002 % for tier 10; `E[tier10] ≈ 0.991`, worst observed near-max tier-10 prime ≈ 0.875. A
|
| 150 |
+
40k-draw width sweep (`audit_width_robustness.py`) finds **no accuracy "knee"** anywhere in the
|
| 151 |
+
samplable range — the residual misses are rare per-`(a,b)` reduction-boundary events scattered
|
| 152 |
+
≈ uniformly, in the deep tail only.
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## 6. Timing under the official clock
|
| 157 |
+
|
| 158 |
+
The 173.6 s above is **GPU** timing (batched `predict_digits_batch`). The budget is **300 s total**
|
| 159 |
+
for all 1100 problems, and tier 10's 2048-step scan dominates. The one delivery risk that is *not*
|
| 160 |
+
about correctness: if the official runner is **CPU-only**, the tier-10 scan can exceed the budget
|
| 161 |
+
and time out — which would zero the timed-out tiers and drop the primary key. Confirm the
|
| 162 |
+
runner's hardware (GPU vs CPU) and, if CPU, do a dress-rehearsal run against the 300 s budget
|
| 163 |
+
before relying on the GPU timing. The *correctness* result (§3, §5) is independent of this.
|
| 164 |
+
|
| 165 |
+
---
|
| 166 |
+
|
| 167 |
+
## 7. Compliance, in one line each
|
| 168 |
+
|
| 169 |
+
(Full argument in `README.md` → "Compliance split" / "Status under the rules".)
|
| 170 |
+
|
| 171 |
+
- Preprocess hooks are pass-through identities — no cross-argument leakage (gate 6).
|
| 172 |
+
- `predict_digits` reduces only `a % p`, `b % p` (two-operand normalisation, allowed) and never
|
| 173 |
+
forms the three-argument modular product directly.
|
| 174 |
+
- No add/multiply/compare-against-`p` is hand-coded; the forward pass is tokenise → learned cell
|
| 175 |
+
→ quantise → readout.
|
| 176 |
+
- **Principle 2, measured:** perturbing trained weights collapses accuracy to the untrained
|
| 177 |
+
floor (`exploration/compliance_perturb.py`) — the arithmetic lives in the parameters.
|
| 178 |
+
- Passes `modchallenge check`; deterministic.
|
README.md
CHANGED
|
@@ -267,7 +267,10 @@ P(acc<0.90) 0.002% / worst-prime 0.933 (primary key held). Public benchmark: tie
|
|
| 267 |
| **1100** | **1.000** | **10** (max) | True |
|
| 268 |
|
| 269 |
Per-tier at total=1100: tiers 1–10 all **1.00**
|
| 270 |
-
(overall_accuracy is the mean over tiers 1-10).
|
|
|
|
|
|
|
|
|
|
| 271 |
width's maximum — a separate regime, not in overall_accuracy) is **0.70** on this fixed public
|
| 272 |
seed. Inference for all 1100 problems is ~174s **on GPU** (the 2048-step tier-10 scan is the bulk),
|
| 273 |
within the 300s budget; the rules' evaluation guidance assumes GPU batching via
|
|
|
|
| 267 |
| **1100** | **1.000** | **10** (max) | True |
|
| 268 |
|
| 269 |
Per-tier at total=1100: tiers 1–10 all **1.00**
|
| 270 |
+
(overall_accuracy is the mean over tiers 1-10). See **[`EVALUATION.md`](EVALUATION.md)** for the
|
| 271 |
+
full evaluation reference: the harness gates, the per-tier sampling structure, the exact
|
| 272 |
+
reproduction command, and a six-seed robustness table showing `highest_tier_above_90 = 10` holds
|
| 273 |
+
across different secret seeds. Tier 0 (pure multiplication, primes near each
|
| 274 |
width's maximum — a separate regime, not in overall_accuracy) is **0.70** on this fixed public
|
| 275 |
seed. Inference for all 1100 problems is ~174s **on GPU** (the 2048-step tier-10 scan is the bulk),
|
| 276 |
within the 300s budget; the rules' evaluation guidance assumes GPU batching via
|