Update README.md
Browse files
README.md
CHANGED
|
@@ -65,61 +65,62 @@ Our data encompasses examples of a length up to 16384 tokens, further enhancing
|
|
| 65 |
## Evaluation
|
| 66 |
|
| 67 |
We ran all benchmarks using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) with `--apply_chat_template`.
|
| 68 |
-
For comparison, we performed the same benchmarks on the base model as well, in the exact same environment with the same parameters.
|
| 69 |
|
| 70 |
### English Benchmarks
|
| 71 |
|
| 72 |
-
| Benchmark | Mistral-Nemo-Instruct-2407 | educa-ai-nemo-dpo |
|
| 73 |
-
| --- | --- | --- |
|
| 74 |
-
| hellaswag (acc_norm) | 71.9% | **77.6%** |
|
| 75 |
-
| winogrande (acc) | 69.8% | **75.2%** |
|
| 76 |
-
| openbookqa (acc_norm) | 45.8% |
|
| 77 |
-
| commonsense_qa (acc) | 74.4% | **75.4%** |
|
| 78 |
-
| truthfulqa_mc1 (acc) | 39.66% | **41.5%** |
|
| 79 |
-
| mmlu (acc) | 64.9% | **66.5%** |
|
| 80 |
-
| triviaqa (exact_match) | 12.3% | **23.99%** |
|
| 81 |
-
| agieval (acc) | 36.6% | **39.1%** |
|
| 82 |
-
| arc_challenge (acc_norm) | 52.5% | **54.4%** |
|
| 83 |
-
| arc_easy (acc_norm) | 74.1% | **76.0%** |
|
| 84 |
-
| piqa (acc_norm) | 78.9% | **81.5%** |
|
| 85 |
-
| leaderboard_bbh (acc_norm) | 49.1% | **53.0%** |
|
| 86 |
-
| leaderboard_gpqa (acc_norm) | **30.6%** | 29.4% |
|
| 87 |
-
| leaderboard_ifeval (inst_level_loose_acc) | 72.8% |
|
| 88 |
-
| leaderboard_mmlu_pro (acc) | **35.1%** | 33.67% |
|
| 89 |
-
| leaderboard_musr (acc_norm) | 39.3% | **40.2%** |
|
| 90 |
|
| 91 |
### Multilingual Benchmarks
|
| 92 |
|
| 93 |
-
| Benchmark | Mistral-Nemo-Instruct-2407 | educa-ai-nemo-dpo |
|
| 94 |
-
| --- | --- | --- |
|
| 95 |
-
| global_mmlu_full (acc) | | |
|
| 96 |
-
|
|
| 97 |
-
|
|
| 98 |
-
|
|
| 99 |
-
|
|
| 100 |
-
|
|
| 101 |
-
|
|
| 102 |
-
|
|
| 103 |
-
|
|
| 104 |
-
|
|
| 105 |
-
| arc_challenge_mt (acc_norm) | | |
|
| 106 |
-
|
|
| 107 |
-
|
|
| 108 |
-
|
|
| 109 |
-
|
|
| 110 |
-
| xnli (acc) | | |
|
| 111 |
-
|
|
| 112 |
-
|
|
| 113 |
-
|
|
| 114 |
-
|
|
| 115 |
-
|
|
| 116 |
-
|
|
| 117 |
-
| xquad (f1) | | |
|
| 118 |
-
|
|
| 119 |
-
|
|
| 120 |
-
|
|
| 121 |
-
|
|
| 122 |
-
|
|
|
|
|
| 123 |
|
| 124 |
## Model Card Authors [optional]
|
| 125 |
|
|
|
|
| 65 |
## Evaluation
|
| 66 |
|
| 67 |
We ran all benchmarks using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) with `--apply_chat_template`.
|
| 68 |
+
For comparison, we performed the same benchmarks on the base model and Llama-3.1-8B-Instruct as well, in the exact same environment with the same parameters.
|
| 69 |
|
| 70 |
### English Benchmarks
|
| 71 |
|
| 72 |
+
| Benchmark | Llama-3.1-8B-Instruct | Mistral-Nemo-Instruct-2407 | educa-ai-nemo-dpo |
|
| 73 |
+
| --- | --- | --- | --- |
|
| 74 |
+
| hellaswag (acc_norm) | 72.6% | 71.9% | **77.6%** |
|
| 75 |
+
| winogrande (acc) | 68.0% | 69.8% | **75.2%** |
|
| 76 |
+
| openbookqa (acc_norm) | **49.0%** | 45.8% | 47.0% |
|
| 77 |
+
| commonsense_qa (acc) | 64.9% | 74.4% | **75.4%** |
|
| 78 |
+
| truthfulqa_mc1 (acc) | 40.4% | 39.66% | **41.5%** |
|
| 79 |
+
| mmlu (acc) | 63.2% | 64.9% | **66.5%** |
|
| 80 |
+
| triviaqa (exact_match) | 5.3% | 12.3% | **23.99%** |
|
| 81 |
+
| agieval (acc) | 36.3% | 36.6% | **39.1%** |
|
| 82 |
+
| arc_challenge (acc_norm) | 54.1% | 52.5% | **54.4%** |
|
| 83 |
+
| arc_easy (acc_norm) | 75.7% | 74.1% | **76.0%** |
|
| 84 |
+
| piqa (acc_norm) | 79.6% | 78.9% | **81.5%** |
|
| 85 |
+
| leaderboard_bbh (acc_norm) | 37.4% | 49.1% | **53.0%** |
|
| 86 |
+
| leaderboard_gpqa (acc_norm) | 28.5% | **30.6%** | 29.4% |
|
| 87 |
+
| leaderboard_ifeval (inst_level_loose_acc) | **84.7%** | 72.8% | 75.1% |
|
| 88 |
+
| leaderboard_mmlu_pro (acc) | 16.2% | **35.1%** | 33.67% |
|
| 89 |
+
| leaderboard_musr (acc_norm) | 38.8% | 39.3% | **40.2%** |
|
| 90 |
|
| 91 |
### Multilingual Benchmarks
|
| 92 |
|
| 93 |
+
| Benchmark | Llama-3.1-8B-Instruct | Mistral-Nemo-Instruct-2407 | educa-ai-nemo-dpo |
|
| 94 |
+
| --- | --- | --- | --- |
|
| 95 |
+
| global_mmlu_full (acc) | | | |
|
| 96 |
+
| - de | 48.2% | 55.8% | **57.5%** |
|
| 97 |
+
| - en | 60.0% | 63.1% | **63.8%** |
|
| 98 |
+
| - es | 54.7% | 58.1% | **58.9%** |
|
| 99 |
+
| - fr | 48.3% | 56.3% | **58.1%** |
|
| 100 |
+
| - it | 51.0% | 58.1% | **59.6%** |
|
| 101 |
+
| - ja | 47.4% | 50.0% | **51.0%** |
|
| 102 |
+
| - pt | 23.0% | 43.5% | **55.7%** |
|
| 103 |
+
| - ru | 41.4% | 54.9% | **55.0%** |
|
| 104 |
+
| - zh | 49.7% | 52.2% | **55.6%** |
|
| 105 |
+
| arc_challenge_mt (acc_norm) | | | |
|
| 106 |
+
| - de | 39.9% | 42.6% | **46.8%** |
|
| 107 |
+
| - es | 42.8% | 45.6% | **47.3%** |
|
| 108 |
+
| - it | 43.9% | 44.3% | **46.7%** |
|
| 109 |
+
| - pt | 41.9% | 42.3% | **46.8%** |
|
| 110 |
+
| xnli (acc) | | | |
|
| 111 |
+
| - de | **48.1%** | 47.6% | 47.1% |
|
| 112 |
+
| - en | 52.4% | 57.3% | **57.8%** |
|
| 113 |
+
| - es | 46.3% | 45.0% | **47.0%** |
|
| 114 |
+
| - fr | **51.6%** | 38.5% | 40.0% |
|
| 115 |
+
| - ru | **48.1%** | 41.8% | 38.6% |
|
| 116 |
+
| - zh | **40.3%** | 36.3% | 36.1% |
|
| 117 |
+
| xquad (f1) | | | |
|
| 118 |
+
| - de | 30.4% | 22.7% | **35.6%** |
|
| 119 |
+
| - en | **35.0%** | 21.8% | 29.9% |
|
| 120 |
+
| - es | **31.2%** | 17.6% | 29.6% |
|
| 121 |
+
| - ru | **39.6%** | 24.6% | 37.3% |
|
| 122 |
+
| - zh | **28.8%** | 10.0% | 16.7% |
|
| 123 |
+
|
| 124 |
|
| 125 |
## Model Card Authors [optional]
|
| 126 |
|