Update README.md
Browse files
README.md
CHANGED
|
@@ -69,23 +69,64 @@ For more detailed training information, please refer to Section 3.4 and Appendix
|
|
| 69 |
|
| 70 |
Here are the evaluation results for DCLM-Baseline-7B on various tasks:
|
| 71 |
|
| 72 |
-
| Task
|
| 73 |
-
|------
|
| 74 |
-
|
|
| 75 |
-
| MMLU (
|
| 76 |
-
|
|
| 77 |
-
|
|
| 78 |
-
|
|
| 79 |
-
|
|
| 80 |
-
|
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
## Limitations and Biases
|
| 91 |
|
|
|
|
| 69 |
|
| 70 |
Here are the evaluation results for DCLM-Baseline-7B on various tasks:
|
| 71 |
|
| 72 |
+
| Task | Score |
|
| 73 |
+
|------|-------|
|
| 74 |
+
| MMLU (zero-shot) | 0.5766 |
|
| 75 |
+
| MMLU (few-shot) | 0.6372 |
|
| 76 |
+
| HellaSwag (zero-shot) | 0.7987 |
|
| 77 |
+
| HellaSwag | 0.8043 |
|
| 78 |
+
| Jeopardy | 0.4745 |
|
| 79 |
+
| TriviaQA | 0.5270 |
|
| 80 |
+
| GSM8K (CoT) | 0.0250 |
|
| 81 |
+
| AGI Eval SAT Math (CoT) | 0.0136 |
|
| 82 |
+
| AQuA (CoT) | 0.0490 |
|
| 83 |
+
| SVAMP (CoT) | 0.4900 |
|
| 84 |
+
| BigBench QA Wikidata | 0.7120 |
|
| 85 |
+
| ARC Easy | 0.8220 |
|
| 86 |
+
| ARC Challenge | 0.5990 |
|
| 87 |
+
| BigBench Misconceptions | 0.6986 |
|
| 88 |
+
| COPA | 0.8500 |
|
| 89 |
+
| SIQA | 0.8291 |
|
| 90 |
+
| CommonsenseQA | 0.8018 |
|
| 91 |
+
| PIQA | 0.8128 |
|
| 92 |
+
| OpenBookQA | 0.4540 |
|
| 93 |
+
| BigBench Novel Concepts | 0.7188 |
|
| 94 |
+
| BigBench Strange Stories | 0.7586 |
|
| 95 |
+
| BigBench Strategy QA | 0.6173 |
|
| 96 |
+
| LAMBADA | 0.8220 |
|
| 97 |
+
| Winograd | 0.8828 |
|
| 98 |
+
| Winogrande | 0.7269 |
|
| 99 |
+
| BigBench Conlang Translation | 0.0244 |
|
| 100 |
+
| BigBench Language Identification | 0.5219 |
|
| 101 |
+
| BigBench Conceptual Combinations | 0.6990 |
|
| 102 |
+
| BigBench Elementary Math QA | 0.3431 |
|
| 103 |
+
| BigBench Dyck Languages | 0.4930 |
|
| 104 |
+
| AGI Eval LSAT AR | 0.2435 |
|
| 105 |
+
| BigBench CS Algorithms | 0.6121 |
|
| 106 |
+
| BigBench Logical Deduction | 0.3620 |
|
| 107 |
+
| BigBench Operators | 0.4857 |
|
| 108 |
+
| BigBench Repeat Copy Logic | 0.4063 |
|
| 109 |
+
| Simple Arithmetic (no spaces) | 0.2940 |
|
| 110 |
+
| Simple Arithmetic (with spaces) | 0.3110 |
|
| 111 |
+
| MathQA | 0.3098 |
|
| 112 |
+
| LogiQA | 0.4132 |
|
| 113 |
+
| PubMedQA | 0.7060 |
|
| 114 |
+
| SQuAD | 0.5856 |
|
| 115 |
+
| AGI Eval LSAT RC | 0.6716 |
|
| 116 |
+
| AGI Eval LSAT LR | 0.5392 |
|
| 117 |
+
| CoQA | 0.4074 |
|
| 118 |
+
| BigBench Understanding Fables | 0.6825 |
|
| 119 |
+
| BoolQ | 0.8343 |
|
| 120 |
+
| AGI Eval SAT EN | 0.7670 |
|
| 121 |
+
| Winogender MC (Female) | 0.6000 |
|
| 122 |
+
| Winogender MC (Male) | 0.5500 |
|
| 123 |
+
| Enterprise PII Classification | 0.7676 |
|
| 124 |
+
| BBQ | 0.6912 |
|
| 125 |
+
| GPQA Main | 0.2612 |
|
| 126 |
+
| GPQA Diamond | 0.2475 |
|
| 127 |
+
|
| 128 |
+
Note: All scores are presented as decimal values between 0 and 1, representing the proportion of correct answers or the model's performance on each task.
|
| 129 |
+
|
| 130 |
|
| 131 |
## Limitations and Biases
|
| 132 |
|