satyamt commited on
Commit
ae8fdcc
·
verified ·
1 Parent(s): 81be9eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -62
README.md CHANGED
@@ -18,9 +18,10 @@ tags:
18
  # Evaluations
19
 
20
  ## Open LLM Leaderboard
21
- | Model | ARC |HellaSwag| MMLU |TruthfulQA|Winogrande|GSM8K|
22
- |---------------------------------------------------|----:|--------:|--------------------------|---------:|---------:|----:|
23
- |[MT7Bi](https://huggingface.co/Technoculture/MT7Bi)|50.94| 73.24|Error: File does not exist| 43.04| 72.06|22.52|
 
24
 
25
  ### Model Evaluation Benchmark
26
 
@@ -62,100 +63,100 @@ tags:
62
  | ------------------ | -------- | --------- | ---- | ---------- | ---------- | -------- |
63
  | Orca-2-7b | **78.4** | 76.1 | 53.7 | **52.4** | **74.2** | **47.2** |
64
  | LLAMA-2-7b | 43.2 | **77.1** | 44.4 | 38.7 | 69.5 | 16 |
65
- | MT7Bi (1 epoch) | 50.94 | 73.24 | - | 43.04 | 72.06 | 22.52 |
66
 
67
- ### ARC: 50.94%
68
  | Task |Version| Metric | Value | |Stderr|
69
- |-------------|-------|--------------------|-------------|---|------|
70
- |arc_challenge|Yaml |acc,none | 0.48| | |
71
  | | |acc_stderr,none | 0.01| | |
72
- | | |acc_norm,none | 0.51| | |
73
  | | |acc_norm_stderr,none| 0.01| | |
74
  | | |alias |arc_challenge| | |
75
 
76
- ### HellaSwag: 73.24%
77
  | Task |Version| Metric | Value | |Stderr|
78
- |---------|-------|--------------------|---------|---|------|
79
- |hellaswag|Yaml |acc,none | 0.54| | |
80
  | | |acc_stderr,none | 0| | |
81
- | | |acc_norm,none | 0.73| | |
82
  | | |acc_norm_stderr,none| 0| | |
83
  | | |alias |hellaswag| | |
84
 
85
- ### TruthfulQA: 43.04%
86
  | Task |Version| Metric | Value | |Stderr|
87
  |--------------|-------|-----------------------|-----------------|---|------|
88
- |truthfulqa |N/A |bleu_max,none | 16.17| | |
89
- | | |bleu_max_stderr,none | 0.38| | |
90
- | | |bleu_acc,none | 0.36| | |
91
  | | |bleu_acc_stderr,none | 0| | |
92
- | | |bleu_diff,none | -2.78| | |
93
- | | |bleu_diff_stderr,none | 0.26| | |
94
- | | |rouge1_max,none | 39.99| | |
95
- | | |rouge1_max_stderr,none | 0.64| | |
96
- | | |rouge1_acc,none | 0.36| | |
97
  | | |rouge1_acc_stderr,none | 0| | |
98
- | | |rouge1_diff,none | -4.19| | |
99
- | | |rouge1_diff_stderr,none| 0.45| | |
100
- | | |rouge2_max,none | 24.52| | |
101
- | | |rouge2_max_stderr,none | 0.68| | |
102
- | | |rouge2_acc,none | 0.29| | |
103
  | | |rouge2_acc_stderr,none | 0| | |
104
- | | |rouge2_diff,none | -4.90| | |
105
- | | |rouge2_diff_stderr,none| 0.55| | |
106
- | | |rougeL_max,none | 36.52| | |
107
- | | |rougeL_max_stderr,none | 0.64| | |
108
- | | |rougeL_acc,none | 0.33| | |
109
  | | |rougeL_acc_stderr,none | 0| | |
110
- | | |rougeL_diff,none | -4.56| | |
111
- | | |rougeL_diff_stderr,none| 0.45| | |
112
  | | |acc,none | 0.33| | |
113
  | | |acc_stderr,none | 0.05| | |
114
  | | |alias |truthfulqa | | |
115
- |truthfulqa_gen|Yaml |bleu_max,none | 16.17| | |
116
- | | |bleu_max_stderr,none | 0.61| | |
117
- | | |bleu_acc,none | 0.36| | |
118
  | | |bleu_acc_stderr,none | 0.02| | |
119
- | | |bleu_diff,none | -2.78| | |
120
- | | |bleu_diff_stderr,none | 0.51| | |
121
- | | |rouge1_max,none | 39.99| | |
122
- | | |rouge1_max_stderr,none | 0.80| | |
123
- | | |rouge1_acc,none | 0.36| | |
124
  | | |rouge1_acc_stderr,none | 0.02| | |
125
- | | |rouge1_diff,none | -4.19| | |
126
- | | |rouge1_diff_stderr,none| 0.67| | |
127
- | | |rouge2_max,none | 24.52| | |
128
- | | |rouge2_max_stderr,none | 0.83| | |
129
- | | |rouge2_acc,none | 0.29| | |
130
  | | |rouge2_acc_stderr,none | 0.02| | |
131
- | | |rouge2_diff,none | -4.90| | |
132
- | | |rouge2_diff_stderr,none| 0.74| | |
133
- | | |rougeL_max,none | 36.52| | |
134
- | | |rougeL_max_stderr,none | 0.80| | |
135
- | | |rougeL_acc,none | 0.33| | |
136
  | | |rougeL_acc_stderr,none | 0.02| | |
137
- | | |rougeL_diff,none | -4.56| | |
138
- | | |rougeL_diff_stderr,none| 0.67| | |
139
  | | |alias | - truthfulqa_gen| | |
140
- |truthfulqa_mc1|Yaml |acc,none | 0.28| | |
141
  | | |acc_stderr,none | 0.02| | |
142
  | | |alias | - truthfulqa_mc1| | |
143
- |truthfulqa_mc2|Yaml |acc,none | 0.43| | |
144
  | | |acc_stderr,none | 0.01| | |
145
  | | |alias | - truthfulqa_mc2| | |
146
 
147
- ### Winogrande: 72.06%
148
  | Task |Version| Metric | Value | |Stderr|
149
- |----------|-------|---------------|----------|---|------|
150
- |winogrande|Yaml |acc,none | 0.72| | |
151
  | | |acc_stderr,none| 0.01| | |
152
  | | |alias |winogrande| | |
153
 
154
- ### GSM8K: 22.52%
155
  |Task |Version| Metric |Value| |Stderr|
156
- |-----|-------|-----------------------------|-----|---|------|
157
- |gsm8k|Yaml |exact_match,get-answer | 0.23| | |
158
  | | |exact_match_stderr,get-answer| 0.01| | |
159
  | | |alias |gsm8k| | |
160
 
161
- Elapsed time: 03:56:55
 
18
  # Evaluations
19
 
20
  ## Open LLM Leaderboard
21
+ | Model | ARC |HellaSwag|TruthfulQA|Winogrande|GSM8K|
22
+ |---------------------------------------------------|----:|--------:|---------:|---------:|----:|
23
+ |[MT7Bi-sft (epoch 4)](https://huggingface.co/Technoculture/MT7Bi-sft)|54.1| 75.11| 43.08| 72.14|15.54|
24
+ |[MT7Bi-sft (epoch 1)](https://huggingface.co/Technoculture/MT7Bi)|50.94| 73.24| 43.04| 72.06|22.52|
25
 
26
  ### Model Evaluation Benchmark
27
 
 
63
  | ------------------ | -------- | --------- | ---- | ---------- | ---------- | -------- |
64
  | Orca-2-7b | **78.4** | 76.1 | 53.7 | **52.4** | **74.2** | **47.2** |
65
  | LLAMA-2-7b | 43.2 | **77.1** | 44.4 | 38.7 | 69.5 | 16 |
66
+ | MT7Bi-sft | 54.1 | 75.11 | - | 43.08 | 72.14 | 15.54 |
67
 
68
+ ### ARC: 54.1%
69
  | Task |Version| Metric | Value | |Stderr|
70
+ |-------------|------:|--------------------|-------------|---|------|
71
+ |arc_challenge| 1|acc,none | 0.51| | |
72
  | | |acc_stderr,none | 0.01| | |
73
+ | | |acc_norm,none | 0.54| | |
74
  | | |acc_norm_stderr,none| 0.01| | |
75
  | | |alias |arc_challenge| | |
76
 
77
+ ### HellaSwag: 75.11%
78
  | Task |Version| Metric | Value | |Stderr|
79
+ |---------|------:|--------------------|---------|---|------|
80
+ |hellaswag| 1|acc,none | 0.57| | |
81
  | | |acc_stderr,none | 0| | |
82
+ | | |acc_norm,none | 0.75| | |
83
  | | |acc_norm_stderr,none| 0| | |
84
  | | |alias |hellaswag| | |
85
 
86
+ ### TruthfulQA: 43.08%
87
  | Task |Version| Metric | Value | |Stderr|
88
  |--------------|-------|-----------------------|-----------------|---|------|
89
+ |truthfulqa |N/A |bleu_max,none | 18.31| | |
90
+ | | |bleu_max_stderr,none | 0.46| | |
91
+ | | |bleu_acc,none | 0.39| | |
92
  | | |bleu_acc_stderr,none | 0| | |
93
+ | | |bleu_diff,none | -1.63| | |
94
+ | | |bleu_diff_stderr,none | 0.39| | |
95
+ | | |rouge1_max,none | 41.99| | |
96
+ | | |rouge1_max_stderr,none | 0.71| | |
97
+ | | |rouge1_acc,none | 0.39| | |
98
  | | |rouge1_acc_stderr,none | 0| | |
99
+ | | |rouge1_diff,none | -2.88| | |
100
+ | | |rouge1_diff_stderr,none| 0.66| | |
101
+ | | |rouge2_max,none | 27.42| | |
102
+ | | |rouge2_max_stderr,none | 0.80| | |
103
+ | | |rouge2_acc,none | 0.32| | |
104
  | | |rouge2_acc_stderr,none | 0| | |
105
+ | | |rouge2_diff,none | -3.11| | |
106
+ | | |rouge2_diff_stderr,none| 0.78| | |
107
+ | | |rougeL_max,none | 38.81| | |
108
+ | | |rougeL_max_stderr,none | 0.71| | |
109
+ | | |rougeL_acc,none | 0.38| | |
110
  | | |rougeL_acc_stderr,none | 0| | |
111
+ | | |rougeL_diff,none | -3.01| | |
112
+ | | |rougeL_diff_stderr,none| 0.66| | |
113
  | | |acc,none | 0.33| | |
114
  | | |acc_stderr,none | 0.05| | |
115
  | | |alias |truthfulqa | | |
116
+ |truthfulqa_gen| 3|bleu_max,none | 18.31| | |
117
+ | | |bleu_max_stderr,none | 0.68| | |
118
+ | | |bleu_acc,none | 0.39| | |
119
  | | |bleu_acc_stderr,none | 0.02| | |
120
+ | | |bleu_diff,none | -1.63| | |
121
+ | | |bleu_diff_stderr,none | 0.62| | |
122
+ | | |rouge1_max,none | 41.99| | |
123
+ | | |rouge1_max_stderr,none | 0.84| | |
124
+ | | |rouge1_acc,none | 0.39| | |
125
  | | |rouge1_acc_stderr,none | 0.02| | |
126
+ | | |rouge1_diff,none | -2.88| | |
127
+ | | |rouge1_diff_stderr,none| 0.81| | |
128
+ | | |rouge2_max,none | 27.42| | |
129
+ | | |rouge2_max_stderr,none | 0.89| | |
130
+ | | |rouge2_acc,none | 0.32| | |
131
  | | |rouge2_acc_stderr,none | 0.02| | |
132
+ | | |rouge2_diff,none | -3.11| | |
133
+ | | |rouge2_diff_stderr,none| 0.88| | |
134
+ | | |rougeL_max,none | 38.81| | |
135
+ | | |rougeL_max_stderr,none | 0.84| | |
136
+ | | |rougeL_acc,none | 0.38| | |
137
  | | |rougeL_acc_stderr,none | 0.02| | |
138
+ | | |rougeL_diff,none | -3.01| | |
139
+ | | |rougeL_diff_stderr,none| 0.82| | |
140
  | | |alias | - truthfulqa_gen| | |
141
+ |truthfulqa_mc1| 2|acc,none | 0.28| | |
142
  | | |acc_stderr,none | 0.02| | |
143
  | | |alias | - truthfulqa_mc1| | |
144
+ |truthfulqa_mc2| 2|acc,none | 0.43| | |
145
  | | |acc_stderr,none | 0.01| | |
146
  | | |alias | - truthfulqa_mc2| | |
147
 
148
+ ### Winogrande: 72.14%
149
  | Task |Version| Metric | Value | |Stderr|
150
+ |----------|------:|---------------|----------|---|------|
151
+ |winogrande| 1|acc,none | 0.72| | |
152
  | | |acc_stderr,none| 0.01| | |
153
  | | |alias |winogrande| | |
154
 
155
+ ### GSM8K: 15.54%
156
  |Task |Version| Metric |Value| |Stderr|
157
+ |-----|------:|-----------------------------|-----|---|------|
158
+ |gsm8k| 2|exact_match,get-answer | 0.16| | |
159
  | | |exact_match_stderr,get-answer| 0.01| | |
160
  | | |alias |gsm8k| | |
161
 
162
+ Elapsed time: 04:06:36