LenDigLearn commited on
Commit
79a76dc
·
verified ·
1 Parent(s): a3a4e3f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -49
README.md CHANGED
@@ -65,61 +65,62 @@ Our data encompasses examples of a length up to 16384 tokens, further enhancing
65
  ## Evaluation
66
 
67
  We ran all benchmarks using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) with `--apply_chat_template`.
68
- For comparison, we performed the same benchmarks on the base model as well, in the exact same environment with the same parameters.
69
 
70
  ### English Benchmarks
71
 
72
- | Benchmark | Mistral-Nemo-Instruct-2407 | educa-ai-nemo-dpo |
73
- | --- | --- | --- |
74
- | hellaswag (acc_norm) | 71.9% | **77.6%** |
75
- | winogrande (acc) | 69.8% | **75.2%** |
76
- | openbookqa (acc_norm) | 45.8% | **47.0%** |
77
- | commonsense_qa (acc) | 74.4% | **75.4%** |
78
- | truthfulqa_mc1 (acc) | 39.66% | **41.5%** |
79
- | mmlu (acc) | 64.9% | **66.5%** |
80
- | triviaqa (exact_match) | 12.3% | **23.99%** |
81
- | agieval (acc) | 36.6% | **39.1%** |
82
- | arc_challenge (acc_norm) | 52.5% | **54.4%** |
83
- | arc_easy (acc_norm) | 74.1% | **76.0%** |
84
- | piqa (acc_norm) | 78.9% | **81.5%** |
85
- | leaderboard_bbh (acc_norm) | 49.1% | **53.0%** |
86
- | leaderboard_gpqa (acc_norm) | **30.6%** | 29.4% |
87
- | leaderboard_ifeval (inst_level_loose_acc) | 72.8% | **75.1%** |
88
- | leaderboard_mmlu_pro (acc) | **35.1%** | 33.67% |
89
- | leaderboard_musr (acc_norm) | 39.3% | **40.2%** |
90
 
91
  ### Multilingual Benchmarks
92
 
93
- | Benchmark | Mistral-Nemo-Instruct-2407 | educa-ai-nemo-dpo |
94
- | --- | --- | --- |
95
- | global_mmlu_full (acc) | | |
96
- | * de | 55.8% | **57.5%** |
97
- | * en | 63.1% | **63.8%** |
98
- | * es | 58.1% | **58.9%** |
99
- | * fr | 56.3% | **58.1%** |
100
- | * it | 58.1% | **59.6%** |
101
- | * ja | 50.0% | **51.0%** |
102
- | * pt | 43.5% | **55.7%** |
103
- | * ru | 54.9% | **55.0%** |
104
- | * zh | 52.2% | **55.6%** |
105
- | arc_challenge_mt (acc_norm) | | |
106
- | * de | 42.6% | **46.8%** |
107
- | * es | 45.6% | **47.3%** |
108
- | * it | 44.3% | **46.7%** |
109
- | * pt | 42.3% | **46.8%** |
110
- | xnli (acc) | | |
111
- | * de | **47.6%** | 47.1% |
112
- | * en | 57.3% | **57.8%** |
113
- | * es | 45.0% | **47.0%** |
114
- | * fr | 38.5% | **40.0%** |
115
- | * ru | **41.8%** | 38.6% |
116
- | * zh | **36.3%** | 36.1% |
117
- | xquad (f1) | | |
118
- | * de | 22.7% | **35.6%** |
119
- | * en | 21.8% | **29.9%** |
120
- | * es | 17.6% | **29.6%** |
121
- | * ru | 24.6% | **37.3%** |
122
- | * zh | 10.0% | **16.7%** |
 
123
 
124
  ## Model Card Authors [optional]
125
 
 
65
  ## Evaluation
66
 
67
  We ran all benchmarks using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) with `--apply_chat_template`.
68
+ For comparison, we performed the same benchmarks on the base model and Llama-3.1-8B-Instruct as well, in the exact same environment with the same parameters.
69
 
70
  ### English Benchmarks
71
 
72
+ | Benchmark | Llama-3.1-8B-Instruct | Mistral-Nemo-Instruct-2407 | educa-ai-nemo-dpo |
73
+ | --- | --- | --- | --- |
74
+ | hellaswag (acc_norm) | 72.6% | 71.9% | **77.6%** |
75
+ | winogrande (acc) | 68.0% | 69.8% | **75.2%** |
76
+ | openbookqa (acc_norm) | **49.0%** | 45.8% | 47.0% |
77
+ | commonsense_qa (acc) | 64.9% | 74.4% | **75.4%** |
78
+ | truthfulqa_mc1 (acc) | 40.4% | 39.66% | **41.5%** |
79
+ | mmlu (acc) | 63.2% | 64.9% | **66.5%** |
80
+ | triviaqa (exact_match) | 5.3% | 12.3% | **23.99%** |
81
+ | agieval (acc) | 36.3% | 36.6% | **39.1%** |
82
+ | arc_challenge (acc_norm) | 54.1% | 52.5% | **54.4%** |
83
+ | arc_easy (acc_norm) | 75.7% | 74.1% | **76.0%** |
84
+ | piqa (acc_norm) | 79.6% | 78.9% | **81.5%** |
85
+ | leaderboard_bbh (acc_norm) | 37.4% | 49.1% | **53.0%** |
86
+ | leaderboard_gpqa (acc_norm) | 28.5% | **30.6%** | 29.4% |
87
+ | leaderboard_ifeval (inst_level_loose_acc) | **84.7%** | 72.8% | 75.1% |
88
+ | leaderboard_mmlu_pro (acc) | 16.2% | **35.1%** | 33.67% |
89
+ | leaderboard_musr (acc_norm) | 38.8% | 39.3% | **40.2%** |
90
 
91
  ### Multilingual Benchmarks
92
 
93
+ | Benchmark | Llama-3.1-8B-Instruct | Mistral-Nemo-Instruct-2407 | educa-ai-nemo-dpo |
94
+ | --- | --- | --- | --- |
95
+ | global_mmlu_full (acc) | | | |
96
+ | - de | 48.2% | 55.8% | **57.5%** |
97
+ | - en | 60.0% | 63.1% | **63.8%** |
98
+ | - es | 54.7% | 58.1% | **58.9%** |
99
+ | - fr | 48.3% | 56.3% | **58.1%** |
100
+ | - it | 51.0% | 58.1% | **59.6%** |
101
+ | - ja | 47.4% | 50.0% | **51.0%** |
102
+ | - pt | 23.0% | 43.5% | **55.7%** |
103
+ | - ru | 41.4% | 54.9% | **55.0%** |
104
+ | - zh | 49.7% | 52.2% | **55.6%** |
105
+ | arc_challenge_mt (acc_norm) | | | |
106
+ | - de | 39.9% | 42.6% | **46.8%** |
107
+ | - es | 42.8% | 45.6% | **47.3%** |
108
+ | - it | 43.9% | 44.3% | **46.7%** |
109
+ | - pt | 41.9% | 42.3% | **46.8%** |
110
+ | xnli (acc) | | | |
111
+ | - de | **48.1%** | 47.6% | 47.1% |
112
+ | - en | 52.4% | 57.3% | **57.8%** |
113
+ | - es | 46.3% | 45.0% | **47.0%** |
114
+ | - fr | **51.6%** | 38.5% | 40.0% |
115
+ | - ru | **48.1%** | 41.8% | 38.6% |
116
+ | - zh | **40.3%** | 36.3% | 36.1% |
117
+ | xquad (f1) | | | |
118
+ | - de | 30.4% | 22.7% | **35.6%** |
119
+ | - en | **35.0%** | 21.8% | 29.9% |
120
+ | - es | **31.2%** | 17.6% | 29.6% |
121
+ | - ru | **39.6%** | 24.6% | 37.3% |
122
+ | - zh | **28.8%** | 10.0% | 16.7% |
123
+
124
 
125
  ## Model Card Authors [optional]
126