johnsonchromia commited on
Commit
5cbfb31
·
verified ·
1 Parent(s): 0ce9060

Add release-suite benchmarks (GPQA-Diamond, BBH); fix Ollama pull command

Browse files
Files changed (1) hide show
  1. README.md +11 -3
README.md CHANGED
@@ -42,8 +42,16 @@ are at [`evalengine/unbound-e4b-GGUF`](https://huggingface.co/evalengine/unbound
42
  | TruthfulQA mc2 (`--limit 100`) | 0.439 | 0.486 | +4.7 pt |
43
  | MMLU (`--limit 100`, 61 subtasks avg) | ~0.425 | 0.392 | −3.3 pt |
44
  | GSM8K (flexible-extract, `--limit 100`) | 0.74 (limit 200) | 0.58 | regression mostly limit-noise |
 
 
45
  | KL divergence vs base | 0 | 3.25 | (SFT-expected) |
46
 
 
 
 
 
 
 
47
  **vs Unbound E2B (current ship):** +8 pp useful-compliance, −3 pp
48
  hallucination, **~5× the GSM8K math score**, cleaner KL (3.25 vs 3.76).
49
  Refusal rate is essentially the same (~2.7%).
@@ -58,9 +66,9 @@ Refusal rate is essentially the same (~2.7%).
58
  ## Use
59
 
60
  ```bash
61
- # on-device (GGUF)
62
- ollama pull hf.co/evalengine/unbound-e4b-GGUF
63
- ollama run hf.co/evalengine/unbound-e4b-GGUF
64
  ```
65
 
66
  ```python
 
42
  | TruthfulQA mc2 (`--limit 100`) | 0.439 | 0.486 | +4.7 pt |
43
  | MMLU (`--limit 100`, 61 subtasks avg) | ~0.425 | 0.392 | −3.3 pt |
44
  | GSM8K (flexible-extract, `--limit 100`) | 0.74 (limit 200) | 0.58 | regression mostly limit-noise |
45
+ | GPQA-Diamond (`--limit 200`) | 25.25% | 25.76% | +0.5 pt (within stderr) |
46
+ | BBH macro (24 tasks, `--limit 200`) | 54.26% | 53.45% | −0.8 pt (within stderr) |
47
  | KL divergence vs base | 0 | 3.25 | (SFT-expected) |
48
 
49
+ GPQA-Diamond and BBH macro — the lm-eval-harness "release" suite at
50
+ `--limit 200` — both land **within stderr of base**: E4B's larger capacity
51
+ absorbs the SFT shift cleanly. The −3.3 pt MMLU dip on the limit-100 fast
52
+ pass is at the edge of that suite's resolution and is not corroborated by
53
+ the release pass.
54
+
55
  **vs Unbound E2B (current ship):** +8 pp useful-compliance, −3 pp
56
  hallucination, **~5× the GSM8K math score**, cleaner KL (3.25 vs 3.76).
57
  Refusal rate is essentially the same (~2.7%).
 
66
  ## Use
67
 
68
  ```bash
69
+ # on-device (Ollama Registry — single-file Q4_K_M, identity-grounded Modelfile)
70
+ ollama pull evalengine/unbound-e4b
71
+ ollama run evalengine/unbound-e4b
72
  ```
73
 
74
  ```python