Update README.md
Browse files
README.md
CHANGED
|
@@ -69,34 +69,21 @@ trl chat --model_name_or_path HuggingFaceTB/SmolLM2-135M-Instruct --device cpu
|
|
| 69 |
|
| 70 |
In this section, we report the evaluation results of SmolLM2. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
|
| 71 |
|
| 72 |
-
##
|
| 73 |
-
|
| 74 |
-
|
|
| 75 |
-
|
| 76 |
-
|
|
| 77 |
-
| ARC (
|
| 78 |
-
|
|
| 79 |
-
|
|
| 80 |
-
|
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
## Instruction model
|
| 88 |
-
|
| 89 |
-
| Metric | SmolLM2-135M-Instruct | SmolLM-135M-Instruct |
|
| 90 |
-
|:-----------------------------|:---------------------:|:--------------------:|
|
| 91 |
-
| IFEval (Average prompt/inst) | **29.9** | 17.2 |
|
| 92 |
-
| MT-Bench | **19.8** | 16.8 |
|
| 93 |
-
| HellaSwag | **40.9** | 38.9 |
|
| 94 |
-
| ARC (Average) | **37.3** | 33.9 |
|
| 95 |
-
| PIQA | **66.3** | 64.0 |
|
| 96 |
-
| MMLU (cloze) | **29.3** | 28.3 |
|
| 97 |
-
| BBH (3-shot) | **28.2** | 25.2 |
|
| 98 |
-
| GSM8K (5-shot) | 1.4 | 1.4 |
|
| 99 |
-
|
| 100 |
|
| 101 |
|
| 102 |
## Limitations
|
|
|
|
| 69 |
|
| 70 |
In this section, we report the evaluation results of SmolLM2. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
|
| 71 |
|
| 72 |
+
## Instruction model Vs. Humanized model
|
| 73 |
+
|
| 74 |
+
| Metric | SmolLM2-135M-Instruct | SmolLM2-135M-Humanized |
|
| 75 |
+
|:-----------------------------|:---------------------:|:----------------------:|
|
| 76 |
+
| MMLU | **23.1** | 23.1 |
|
| 77 |
+
| ARC (Easy) | **54.3** | 50.2 |
|
| 78 |
+
| ARC (Challenge) | **26.1** | 25.3 |
|
| 79 |
+
| HellaSwag | **43.0** | 41.6 |
|
| 80 |
+
| PIQA | **67.2** | 66.2 |
|
| 81 |
+
| WinoGrande | **52.5** | 52.2 |
|
| 82 |
+
| TriviaQA | **0.3** | 0.1 |
|
| 83 |
+
| GSM8K | 0.2 | **0.5** |
|
| 84 |
+
| OpenBookQA | **32.6** | 32.0 |
|
| 85 |
+
| CommonSenseQA | **4.8** | 2.2 |
|
| 86 |
+
| QuAC (F1) | **14.1** | 11.0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
|
| 89 |
## Limitations
|