johnsonchromia commited on
Commit
ccb509a
·
verified ·
1 Parent(s): 5d7b0db

Drop report-md link from card

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -50,8 +50,7 @@ GPQA-Diamond and BBH macro — the lm-eval-harness "release" suite at
50
  `--limit 200` — both land **within stderr of base**: E4B's larger capacity
51
  absorbs the SFT shift cleanly. The −3.3 pt MMLU dip on the limit-100 fast
52
  pass is at the edge of that suite's resolution and is not corroborated by
53
- the release pass. Full report — methodology, per-subtask BBH breakdown,
54
- what we did not measure — in [`release-benchmark-report.md`](release-benchmark-report.md).
55
 
56
  **vs Unbound E2B (current ship):** +8 pp useful-compliance, −3 pp
57
  hallucination, **~5× the GSM8K math score**, cleaner KL (3.25 vs 3.76).
 
50
  `--limit 200` — both land **within stderr of base**: E4B's larger capacity
51
  absorbs the SFT shift cleanly. The −3.3 pt MMLU dip on the limit-100 fast
52
  pass is at the edge of that suite's resolution and is not corroborated by
53
+ the release pass.
 
54
 
55
  **vs Unbound E2B (current ship):** +8 pp useful-compliance, −3 pp
56
  hallucination, **~5× the GSM8K math score**, cleaner KL (3.25 vs 3.76).