Update README.md
Browse files
README.md
CHANGED
|
@@ -16,12 +16,10 @@ This is a merge of pre-trained language models created using [mergekit](https://
|
|
| 16 |
|
| 17 |
# First benchmarks
|
| 18 |
|
| 19 |
-
**Interpretation:** Significant gains on language understanding & pragmatic reasoning (ARC-C/E, Wino, BoolQ, HellaSwag, TriviaQA) with stability on other skills. Math/code are not the optimization target; GSM8K stays essentially stable relative to the BitNet 1.58-bit baseline.
|
| 20 |
All scores are reported in comparison with the original Microsoft BitNet b1.58 BF16 model.
|
| 21 |
Evaluations were performed using LM Eval Harness, all results are fully reproducible.
|
| 22 |
|
| 23 |
-
**ARC-Challenge:** 51.62 (First-ever ≥50 score for a model in the 2B category, i.e., >1.5B and <2.5B params)
|
| 24 |
-
|
| 25 |
| Benchmark (metric) | microsoft/bitnet-b1.58-2B-4T-bf16 | bitnet-dpo-merged-modelstock7 |
|
| 26 |
|------------------------------------|-----------------------------------|--------------------------------|
|
| 27 |
| arc_challenge 0 shot | 47.95 | **51.62** |
|
|
@@ -39,6 +37,7 @@ Evaluations were performed using LM Eval Harness, all results are fully reproduc
|
|
| 39 |
| mmlu 5 shot acc | 52.96 | **53.39** |
|
| 40 |
| commonsense_qa 10 shot acc | **71.17** | 70.76 |
|
| 41 |
|
|
|
|
| 42 |
|
| 43 |
| Model | arc_challenge (0 shot) |
|
| 44 |
|----------------------------------------------------|------------------------|
|
|
|
|
| 16 |
|
| 17 |
# First benchmarks
|
| 18 |
|
| 19 |
+
**Interpretation:** Significant gains on language understanding & pragmatic reasoning (ARC-C/E, Wino, BoolQ, HellaSwag, TriviaQA) with stability on other skills. Math/code are not the optimization target; GSM8K stays essentially stable relative to the BitNet 1.58-bit baseline (58,38).
|
| 20 |
All scores are reported in comparison with the original Microsoft BitNet b1.58 BF16 model.
|
| 21 |
Evaluations were performed using LM Eval Harness, all results are fully reproducible.
|
| 22 |
|
|
|
|
|
|
|
| 23 |
| Benchmark (metric) | microsoft/bitnet-b1.58-2B-4T-bf16 | bitnet-dpo-merged-modelstock7 |
|
| 24 |
|------------------------------------|-----------------------------------|--------------------------------|
|
| 25 |
| arc_challenge 0 shot | 47.95 | **51.62** |
|
|
|
|
| 37 |
| mmlu 5 shot acc | 52.96 | **53.39** |
|
| 38 |
| commonsense_qa 10 shot acc | **71.17** | 70.76 |
|
| 39 |
|
| 40 |
+
**ARC-Challenge:** 51.62 (First-ever ≥50 score for a model in the 2B category, i.e., >1.5B and <2.5B params)
|
| 41 |
|
| 42 |
| Model | arc_challenge (0 shot) |
|
| 43 |
|----------------------------------------------------|------------------------|
|