nm-research commited on
Commit
3ee1f4f
·
verified ·
1 Parent(s): 37e8b49

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -34
README.md CHANGED
@@ -42,7 +42,7 @@ from transformers import AutoTokenizer
42
  from vllm import LLM, SamplingParams
43
 
44
  max_model_len, tp_size = 4096, 1
45
- model_name = "neuralmagic-ent/granite-3.1-8b-instruct-FP8-dynamic"
46
  tokenizer = AutoTokenizer.from_pretrained(model_name)
47
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
48
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -65,6 +65,8 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
65
 
66
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
67
 
 
 
68
 
69
  ```bash
70
  python quantize.py --model_id ibm-granite/granite-3.1-8b-instruct --save_path "output_dir/"
@@ -110,16 +112,20 @@ def main():
110
  if __name__ == "__main__":
111
  main()
112
  ```
 
113
 
114
  ## Evaluation
115
 
116
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
117
 
 
 
 
118
  OpenLLM Leaderboard V1:
119
  ```
120
  lm_eval \
121
  --model vllm \
122
- --model_args pretrained="neuralmagic-ent/granite-3.1-8b-instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
123
  --tasks openllm \
124
  --write_out \
125
  --batch_size auto \
@@ -127,11 +133,23 @@ lm_eval \
127
  --show_config
128
  ```
129
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  #### HumanEval
131
  ##### Generation
132
  ```
133
  python3 codegen/generate.py \
134
- --model neuralmagic-ent/granite-3.1-8b-instruct-FP8-dynamic \
135
  --bs 16 \
136
  --temperature 0.2 \
137
  --n_samples 50 \
@@ -141,45 +159,128 @@ python3 codegen/generate.py \
141
  ##### Sanitization
142
  ```
143
  python3 evalplus/sanitize.py \
144
- humaneval/neuralmagic-ent--granite-3.1-8b-instruct-FP8-dynamic_vllm_temp_0.2
145
  ```
146
  ##### Evaluation
147
  ```
148
  evalplus.evaluate \
149
  --dataset humaneval \
150
- --samples humaneval/neuralmagic-ent--granite-3.1-8b-instruct-FP8-dynamic_vllm_temp_0.2-sanitized
151
  ```
 
152
 
153
  ### Accuracy
154
 
155
- #### OpenLLM Leaderboard V1 evaluation scores
156
-
157
- | Metric | ibm-granite/granite-3.1-8b-instruct | neuralmagic-ent/granite-3.1-8b-instruct-FP8-dynamic |
158
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
159
- | ARC-Challenge (Acc-Norm, 25-shot) | 66.81 | 66.81 |
160
- | GSM8K (Strict-Match, 5-shot) | 64.52 | 66.64 |
161
- | HellaSwag (Acc-Norm, 10-shot) | 84.18 | 84.16 |
162
- | MMLU (Acc, 5-shot) | 65.52 | 65.36 |
163
- | TruthfulQA (MC2, 0-shot) | 60.57 | 60.52 |
164
- | Winogrande (Acc, 5-shot) | 80.19 | 79.95 |
165
- | **Average Score** | **70.30** | **70.57** |
166
- | **Recovery** | **100.00** | **100.39** |
167
-
168
- | Metric | ibm-granite/granite-3.1-8b-instruct | neuralmagic-ent/granite-3.1-8b-instruct-FP8-dynamic |
169
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
170
- | IFEval (Inst Level Strict Acc, 0-shot)| 74.10 | 73.62 |
171
- | BBH (Acc-Norm, 3-shot) | 53.19 | 53.26 |
172
- | Math-Hard (Exact-Match, 4-shot) | 14.77 | 16.79 |
173
- | GPQA (Acc-Norm, 0-shot) | 31.76 | 32.58 |
174
- | MUSR (Acc-Norm, 0-shot) | 46.01 | 47.34 |
175
- | MMLU-Pro (Acc, 5-shot) | 35.81 | 35.72 |
176
- | **Average Score** | **42.61** | **43.22** |
177
- | **Recovery** | **100.00** | **101.43** |
178
-
179
- #### HumanEval pass@1 scores
180
- | Metric | ibm-granite/granite-3.1-8b-instruct | neuralmagic-ent/granite-3.1-8b-instruct-FP8-dynamic |
181
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
182
- | HumanEval Pass@1 | 71.00 | 69.90 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
 
184
 
185
  ## Inference Performance
@@ -188,6 +289,16 @@ evalplus.evaluate \
188
  This model achieves up to 1.5x speedup in single-stream deployment and up to 1.1x speedup in multi-stream asynchronous deployment on L40 GPUs.
189
  The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
190
 
 
 
 
 
 
 
 
 
 
 
191
  ### Single-stream performance (measured with vLLM version 0.6.6.post1)
192
  <table>
193
  <tr>
 
42
  from vllm import LLM, SamplingParams
43
 
44
  max_model_len, tp_size = 4096, 1
45
+ model_name = "neuralmagic/granite-3.1-8b-instruct-FP8-dynamic"
46
  tokenizer = AutoTokenizer.from_pretrained(model_name)
47
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
48
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
65
 
66
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
67
 
68
+ <details>
69
+ <summary>Model Creation Code</summary>
70
 
71
  ```bash
72
  python quantize.py --model_id ibm-granite/granite-3.1-8b-instruct --save_path "output_dir/"
 
112
  if __name__ == "__main__":
113
  main()
114
  ```
115
+ </details>
116
 
117
  ## Evaluation
118
 
119
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
120
 
121
+ <details>
122
+ <summary>Evaluation Commands</summary>
123
+
124
  OpenLLM Leaderboard V1:
125
  ```
126
  lm_eval \
127
  --model vllm \
128
+ --model_args pretrained="neuralmagic/granite-3.1-8b-instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
129
  --tasks openllm \
130
  --write_out \
131
  --batch_size auto \
 
133
  --show_config
134
  ```
135
 
136
+ OpenLLM Leaderboard V2:
137
+ ```
138
+ lm_eval \
139
+ --model vllm \
140
+ --model_args pretrained="neuralmagic/granite-3.1-8b-instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
141
+ --tasks leaderboard \
142
+ --write_out \
143
+ --batch_size auto \
144
+ --output_path output_dir \
145
+ --show_config
146
+ ```
147
+
148
  #### HumanEval
149
  ##### Generation
150
  ```
151
  python3 codegen/generate.py \
152
+ --model neuralmagic/granite-3.1-8b-instruct-FP8-dynamic \
153
  --bs 16 \
154
  --temperature 0.2 \
155
  --n_samples 50 \
 
159
  ##### Sanitization
160
  ```
161
  python3 evalplus/sanitize.py \
162
+ humaneval/neuralmagic--granite-3.1-8b-instruct-FP8-dynamic_vllm_temp_0.2
163
  ```
164
  ##### Evaluation
165
  ```
166
  evalplus.evaluate \
167
  --dataset humaneval \
168
+ --samples humaneval/neuralmagic--granite-3.1-8b-instruct-FP8-dynamic_vllm_temp_0.2-sanitized
169
  ```
170
+ </details>
171
 
172
  ### Accuracy
173
 
174
+ <table>
175
+ <thead>
176
+ <tr>
177
+ <th>Category</th>
178
+ <th>Metric</th>
179
+ <th>ibm-granite/granite-3.1-8b-instruct</th>
180
+ <th>neuralmagic/granite-3.1-8b-instruct-FP8-dynamic</th>
181
+ <th>Recovery (%)</th>
182
+ </tr>
183
+ </thead>
184
+ <tbody>
185
+ <!-- OpenLLM Leaderboard V1 -->
186
+ <tr>
187
+ <td rowspan="7"><b>OpenLLM Leaderboard V1</b></td>
188
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
189
+ <td>66.81</td>
190
+ <td>66.81</td>
191
+ <td>100.00</td>
192
+ </tr>
193
+ <tr>
194
+ <td>GSM8K (Strict-Match, 5-shot)</td>
195
+ <td>64.52</td>
196
+ <td>66.64</td>
197
+ <td>103.29</td>
198
+ </tr>
199
+ <tr>
200
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
201
+ <td>84.18</td>
202
+ <td>84.16</td>
203
+ <td>99.98</td>
204
+ </tr>
205
+ <tr>
206
+ <td>MMLU (Acc, 5-shot)</td>
207
+ <td>65.52</td>
208
+ <td>65.36</td>
209
+ <td>99.76</td>
210
+ </tr>
211
+ <tr>
212
+ <td>TruthfulQA (MC2, 0-shot)</td>
213
+ <td>60.57</td>
214
+ <td>60.52</td>
215
+ <td>99.92</td>
216
+ </tr>
217
+ <tr>
218
+ <td>Winogrande (Acc, 5-shot)</td>
219
+ <td>80.19</td>
220
+ <td>79.95</td>
221
+ <td>99.70</td>
222
+ </tr>
223
+ <tr>
224
+ <td><b>Average Score</b></td>
225
+ <td><b>70.30</b></td>
226
+ <td><b>70.57</b></td>
227
+ <td><b>100.39</b></td>
228
+ </tr>
229
+ <!-- OpenLLM Leaderboard V2 -->
230
+ <tr>
231
+ <td rowspan="7"><b>OpenLLM Leaderboard V2</b></td>
232
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
233
+ <td>74.10</td>
234
+ <td>73.62</td>
235
+ <td>99.35</td>
236
+ </tr>
237
+ <tr>
238
+ <td>BBH (Acc-Norm, 3-shot)</td>
239
+ <td>53.19</td>
240
+ <td>53.26</td>
241
+ <td>100.13</td>
242
+ </tr>
243
+ <tr>
244
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
245
+ <td>14.77</td>
246
+ <td>16.79</td>
247
+ <td>113.66</td>
248
+ </tr>
249
+ <tr>
250
+ <td>GPQA (Acc-Norm, 0-shot)</td>
251
+ <td>31.76</td>
252
+ <td>32.58</td>
253
+ <td>102.58</td>
254
+ </tr>
255
+ <tr>
256
+ <td>MUSR (Acc-Norm, 0-shot)</td>
257
+ <td>46.01</td>
258
+ <td>47.34</td>
259
+ <td>102.89</td>
260
+ </tr>
261
+ <tr>
262
+ <td>MMLU-Pro (Acc, 5-shot)</td>
263
+ <td>35.81</td>
264
+ <td>35.72</td>
265
+ <td>99.75</td>
266
+ </tr>
267
+ <tr>
268
+ <td><b>Average Score</b></td>
269
+ <td><b>42.61</b></td>
270
+ <td><b>43.22</b></td>
271
+ <td><b>101.43</b></td>
272
+ </tr>
273
+ <!-- HumanEval -->
274
+ <tr>
275
+ <td rowspan="2"><b>HumanEval</b></td>
276
+ <td>HumanEval Pass@1</td>
277
+ <td>71.00</td>
278
+ <td>69.90</td>
279
+ <td><b>98.45</b></td>
280
+ </tr>
281
+ </tbody>
282
+ </table>
283
+
284
 
285
 
286
  ## Inference Performance
 
289
  This model achieves up to 1.5x speedup in single-stream deployment and up to 1.1x speedup in multi-stream asynchronous deployment on L40 GPUs.
290
  The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
291
 
292
+ <details>
293
+ <summary>Benchmarking Command</summary>
294
+
295
+ ```
296
+ guidellm --model neuralmagic/granite-3.1-8b-instruct-FP8-dynamic --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
297
+ ```
298
+
299
+ </details>
300
+
301
+
302
  ### Single-stream performance (measured with vLLM version 0.6.6.post1)
303
  <table>
304
  <tr>