nm-research commited on
Commit
1aba6ea
·
verified ·
1 Parent(s): da1d425

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -143
README.md CHANGED
@@ -90,170 +90,104 @@ This model was created by applying [LLM Compressor with calibration samples from
90
 
91
  ## Evaluation
92
 
93
- This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).
 
 
 
94
  <table>
95
  <thead>
96
  <tr>
97
  <th>Category</th>
98
  <th>Metric</th>
99
  <th>DeepSeek-R1-Distill-Qwen-32B</th>
100
- <th>DeepSeek-R1-Distill-Qwen-32B-NVFP4</th>
101
- <th>Recovery (%)</th>
102
  </tr>
103
  </thead>
104
  <tbody>
 
105
  <tr>
106
  <td rowspan="7"><b>OpenLLM V1</b></td>
107
- <td>ARC Challenge</td>
108
- <td>67.66</td>
109
- <td>64.25</td>
110
- <td>94.94%</td>
111
  </tr>
112
  <tr>
113
- <td>GSM8K</td>
114
- <td>83.02</td>
115
- <td>84.84</td>
116
- <td>102.19%</td>
117
  </tr>
118
  <tr>
119
- <td>Hellaswag</td>
120
- <td>83.79</td>
121
- <td>83.28</td>
122
- <td>99.39%</td>
123
  </tr>
124
  <tr>
125
- <td>MMLU</td>
126
- <td>81.25</td>
127
- <td>80.79</td>
128
- <td>99.43%</td>
129
  </tr>
130
  <tr>
131
- <td>TruthfulQA-mc2</td>
132
- <td>58.37</td>
133
- <td>57.50</td>
134
- <td>98.51%</td>
135
  </tr>
136
  <tr>
137
- <td>Winogrande</td>
138
- <td>75.77</td>
139
- <td>76.40</td>
140
- <td>100.83%</td>
141
  </tr>
142
  <tr>
143
  <td><b>Average</b></td>
144
- <td><b>74.98</b></td>
145
- <td><b>74.51</b></td>
146
- <td><b>99.38%</b></td>
147
- </tr>
148
- <tr>
149
- <td rowspan="7"><b>OpenLLM V2</b></td>
150
- <td>MMLU-Pro</td>
151
- <td></td>
152
- <td></td>
153
- <td>%</td>
154
- </tr>
155
- <tr>
156
- <td>IFEval</td>
157
- <td></td>
158
- <td></td>
159
- <td>%</td>
160
- </tr>
161
- <tr>
162
- <td>BBH</td>
163
- <td></td>
164
- <td></td>
165
- <td>%</td>
166
- </tr>
167
- <tr>
168
- <td>Math-Hard</td>
169
- <td></td>
170
- <td></td>
171
- <td>%</td>
172
- </tr>
173
- <tr>
174
- <td>GPQA</td>
175
- <td></td>
176
- <td></td>
177
- <td>%</td>
178
- </tr>
179
- <tr>
180
- <td>MuSR</td>
181
- <td></td>
182
- <td></td>
183
- <td>%</td>
184
- </tr>
185
- <tr>
186
- <td><b>Average</b></td>
187
- <td><b></b></td>
188
- <td><b></b></td>
189
- <td><b>%</b></td>
190
  </tr>
 
191
  <tr>
192
  <td rowspan="4"><b>Reasoning</b></td>
193
- <td>Math 500</td>
194
- <td>95.09</td>
195
- <td>95.60</td>
196
- <td>100.54%</td>
197
- </tr>
198
- <tr>
199
- <td>GPQA (diamond)</td>
200
- <td>64.05</td>
201
- <td>61.11</td>
202
- <td>95.41%</td>
203
- </tr>
204
- <tr>
205
- <td>AIME25</td>
206
- <td>69.75 (AIME24)</td>
207
- <td>53.33</td>
208
- <td>76.45%</td>
209
- </tr>
210
- <tr>
211
- <td>LCB: Code Generation</td>
212
- <td>–</td>
213
- <td>54.29</td>
214
- <td>–</td>
215
  </tr>
216
  <tr>
217
- <td rowspan="6"><b>Coding</b></td>
218
- <td>HumanEval Instruct pass@1</td>
219
- <td>–</td>
220
- <td>–</td>
221
- <td>–</td>
222
  </tr>
223
  <tr>
224
- <td>HumanEval 64 Instruct pass@2</td>
225
- <td>–</td>
226
- <td>–</td>
227
- <td>–</td>
228
  </tr>
229
  <tr>
230
- <td>HumanEval 64 Instruct pass@8</td>
231
- <td>–</td>
232
- <td>–</td>
233
- <td>–</td>
234
- </tr>
235
- <tr>
236
- <td>HumanEval 64 Instruct pass@16</td>
237
- <td>–</td>
238
- <td>–</td>
239
- <td>–</td>
240
- </tr>
241
- <tr>
242
- <td>HumanEval 64 Instruct pass@32</td>
243
- <td>–</td>
244
- <td>–</td>
245
- <td>–</td>
246
  </tr>
 
247
  <tr>
248
- <td>HumanEval 64 Instruct pass@64</td>
249
- <td>–</td>
250
- <td>–</td>
251
- <td>–</td>
 
252
  </tr>
253
  </tbody>
254
  </table>
255
 
256
 
 
257
  ### Reproduction
258
 
259
  The results were obtained using the following commands:
@@ -273,34 +207,41 @@ lm_eval \
273
  ```
274
 
275
 
276
- #### OpenLLM v2
277
- ```
278
- lm_eval \
279
- --model vllm \
280
- --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=15000,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
281
- --apply_chat_template \
282
- --fewshot_as_multiturn \
283
- --tasks leaderboard \
284
- --batch_size auto
285
- ```
286
 
287
- #### HumanEval and HumanEval_64
288
  ```
289
  lm_eval \
290
  --model vllm \
291
  --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
292
  --apply_chat_template \
293
  --fewshot_as_multiturn \
294
- --tasks humaneval_instruct \
295
  --batch_size auto
 
296
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
297
 
298
- lm_eval \
299
- --model vllm \
300
- --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
301
- --apply_chat_template \
302
- --fewshot_as_multiturn \
303
- --tasks humaneval_64_instruct \
304
- --batch_size auto
305
  ```
306
  </details>
 
90
 
91
  ## Evaluation
92
 
93
+ This model was evaluated on the well-known OpenLLM v1 and HumanEval_64 benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval).
94
+
95
+ ### Accuracy
96
+
97
  <table>
98
  <thead>
99
  <tr>
100
  <th>Category</th>
101
  <th>Metric</th>
102
  <th>DeepSeek-R1-Distill-Qwen-32B</th>
103
+ <th>DeepSeek-R1-Distill-Qwen-32B NVFP4</th>
104
+ <th>Recovery</th>
105
  </tr>
106
  </thead>
107
  <tbody>
108
+ <!-- OpenLLM V1 -->
109
  <tr>
110
  <td rowspan="7"><b>OpenLLM V1</b></td>
111
+ <td>arc_challenge</td>
112
+ <td>63.48</td>
113
+ <td>62.12</td>
114
+ <td>97.86</td>
115
  </tr>
116
  <tr>
117
+ <td>gsm8k</td>
118
+ <td>86.88</td>
119
+ <td>88.32</td>
120
+ <td>101.66</td>
121
  </tr>
122
  <tr>
123
+ <td>hellaswag</td>
124
+ <td>83.51</td>
125
+ <td>82.38</td>
126
+ <td>98.65</td>
127
  </tr>
128
  <tr>
129
+ <td>mmlu</td>
130
+ <td>80.97</td>
131
+ <td>80.42</td>
132
+ <td>99.32</td>
133
  </tr>
134
  <tr>
135
+ <td>truthfulqa_mc2</td>
136
+ <td>56.82</td>
137
+ <td>55.75</td>
138
+ <td>98.12</td>
139
  </tr>
140
  <tr>
141
+ <td>winogrande</td>
142
+ <td>75.93</td>
143
+ <td>75.14</td>
144
+ <td>98.96</td>
145
  </tr>
146
  <tr>
147
  <td><b>Average</b></td>
148
+ <td><b>74.60</b></td>
149
+ <td><b>74.02</b></td>
150
+ <td><b>99.23</b></td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  </tr>
152
+ <!-- Reasoning -->
153
  <tr>
154
  <td rowspan="4"><b>Reasoning</b></td>
155
+ <td>AIME24 (0-shot)</td>
156
+ <td>72.41</td>
157
+ <td>62.07</td>
158
+ <td>85.69</td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  </tr>
160
  <tr>
161
+ <td>AIME25 (0-shot)</td>
162
+ <td>58.62</td>
163
+ <td>62.07</td>
164
+ <td>105.89</td>
 
165
  </tr>
166
  <tr>
167
+ <td>GPQA (Diamond, 0-shot)</td>
168
+ <td>68.02</td>
169
+ <td>65.48</td>
170
+ <td>96.27</td>
171
  </tr>
172
  <tr>
173
+ <td><b>Average</b></td>
174
+ <td><b>66.35</b></td>
175
+ <td><b>63.21</b></td>
176
+ <td><b>95.95</b></td>
 
 
 
 
 
 
 
 
 
 
 
 
177
  </tr>
178
+ <!-- Coding -->
179
  <tr>
180
+ <td rowspan="2"><b>Coding</b></td>
181
+ <td>HumanEval_64 pass@2</td>
182
+ <td>90.00</td>
183
+ <td>89.32</td>
184
+ <td>99.24</td>
185
  </tr>
186
  </tbody>
187
  </table>
188
 
189
 
190
+
191
  ### Reproduction
192
 
193
  The results were obtained using the following commands:
 
207
  ```
208
 
209
 
210
+ #### HumanEval_64
 
 
 
 
 
 
 
 
 
211
 
 
212
  ```
213
  lm_eval \
214
  --model vllm \
215
  --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
216
  --apply_chat_template \
217
  --fewshot_as_multiturn \
218
+ --tasks humaneval_64_instruct \
219
  --batch_size auto
220
+ ```
221
 
222
+ #### LightEval
223
+ ```
224
+ # --- model_args.yaml ---
225
+ cat > model_args.yaml <<'YAML'
226
+ model_parameters:
227
+ model_name: "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4"
228
+ dtype: auto
229
+ gpu_memory_utilization: 0.9
230
+ tensor_parallel_size: 2
231
+ max_model_length: 40960
232
+ generation_parameters:
233
+ seed: 42
234
+ temperature: 0.6
235
+ top_k: 20
236
+ top_p: 0.95
237
+ min_p: 0.0
238
+ max_new_tokens: 32768
239
+ YAML
240
+
241
+ lighteval vllm model_args.yaml \
242
+ "lighteval|aime24|0,lighteval|aime25|0,lighteval|gpqa:diamond|0" \
243
+ --max-samples -1 \
244
+ --output-dir out_dir
245
 
 
 
 
 
 
 
 
246
  ```
247
  </details>