nm-research commited on
Commit
3a4d074
·
verified ·
1 Parent(s): cf03315

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -37
README.md CHANGED
@@ -43,7 +43,7 @@ from transformers import AutoTokenizer
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
- model_name = "neuralmagic-ent/granite-3.1-8b-instruct-quantized.w8a8"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -65,7 +65,8 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
65
  ## Creation
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
-
 
69
 
70
  ```bash
71
  python quantize.py --model_path ibm-granite/granite-3.1-8b-instruct --quant_path "output_dir/granite-3.1-8b-instruct-quantized.w8a8" --calib_size 3072 --dampening_frac 0.1 --observer mse
@@ -151,16 +152,20 @@ oneshot(
151
  model.save_pretrained(quant_path, save_compressed=True)
152
  tokenizer.save_pretrained(quant_path)
153
  ```
 
154
 
155
  ## Evaluation
156
 
157
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
158
 
 
 
 
159
  OpenLLM Leaderboard V1:
160
  ```
161
  lm_eval \
162
  --model vllm \
163
- --model_args pretrained="neuralmagic-ent/granite-3.1-8b-instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
164
  --tasks openllm \
165
  --write_out \
166
  --batch_size auto \
@@ -168,11 +173,23 @@ lm_eval \
168
  --show_config
169
  ```
170
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  #### HumanEval
172
  ##### Generation
173
  ```
174
  python3 codegen/generate.py \
175
- --model neuralmagic-ent/granite-3.1-8b-instruct-quantized.w8a8 \
176
  --bs 16 \
177
  --temperature 0.2 \
178
  --n_samples 50 \
@@ -182,47 +199,128 @@ python3 codegen/generate.py \
182
  ##### Sanitization
183
  ```
184
  python3 evalplus/sanitize.py \
185
- humaneval/neuralmagic-ent--granite-3.1-8b-instruct-quantized.w8a8_vllm_temp_0.2
186
  ```
187
  ##### Evaluation
188
  ```
189
  evalplus.evaluate \
190
  --dataset humaneval \
191
- --samples humaneval/neuralmagic-ent--granite-3.1-8b-instruct-quantized.w8a8_vllm_temp_0.2-sanitized
192
  ```
 
193
 
194
  ### Accuracy
195
 
196
- #### OpenLLM Leaderboard V1 evaluation scores
197
-
198
- | Metric | ibm-granite/granite-3.1-8b-instruct | neuralmagic-ent/granite-3.1-8b-instruct-quantized.w8a8 |
199
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
200
- | ARC-Challenge (Acc-Norm, 25-shot) | 66.81 | 67.06 |
201
- | GSM8K (Strict-Match, 5-shot) | 64.52 | 65.66 |
202
- | HellaSwag (Acc-Norm, 10-shot) | 84.18 | 83.93 |
203
- | MMLU (Acc, 5-shot) | 65.52 | 65.03 |
204
- | TruthfulQA (MC2, 0-shot) | 60.57 | 60.02 |
205
- | Winogrande (Acc, 5-shot) | 80.19 | 79.87 |
206
- | **Average Score** | **70.30** | **70.26** |
207
- | **Recovery** | **100.00** | **99.95** |
208
-
209
- #### OpenLLM Leaderboard V2 evaluation scores
210
-
211
- | Metric | ibm-granite/granite-3.1-8b-instruct | neuralmagic-ent/granite-3.1-8b-instruct-quantized.w8a8 |
212
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
213
- | IFEval (Inst Level Strict Acc, 0-shot)| 74.01 | 73.50 |
214
- | BBH (Acc-Norm, 3-shot) | 53.19 | 52.59 |
215
- | Math-Hard (Exact-Match, 4-shot) | 14.77 | 15.73 |
216
- | GPQA (Acc-Norm, 0-shot) | 31.76 | 30.62 |
217
- | MUSR (Acc-Norm, 0-shot) | 46.01 | 44.30 |
218
- | MMLU-Pro (Acc, 5-shot) | 35.81 | 35.41 |
219
- | **Average Score** | **42.61** | **42.03** |
220
- | **Recovery** | **100.00** | **98.64** |
221
-
222
- #### HumanEval pass@1 scores
223
- | Metric | ibm-granite/granite-3.1-8b-instruct | neuralmagic-ent/granite-3.1-8b-instruct-quantized.w8a8 |
224
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
225
- | HumanEval Pass@1 | 71.00 | 70.50 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
 
227
 
228
  ## Inference Performance
@@ -231,6 +329,15 @@ evalplus.evaluate \
231
  This model achieves up to 1.6x speedup in single-stream deployment and up to 1.7x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario.
232
  The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
233
 
 
 
 
 
 
 
 
 
 
234
  ### Single-stream performance (measured with vLLM version 0.6.6.post1)
235
  <table>
236
  <tr>
 
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
+ model_name = "neuralmagic/granite-3.1-8b-instruct-quantized.w8a8"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
65
  ## Creation
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
+ <details>
69
+ <summary>Model Creation Code</summary>
70
 
71
  ```bash
72
  python quantize.py --model_path ibm-granite/granite-3.1-8b-instruct --quant_path "output_dir/granite-3.1-8b-instruct-quantized.w8a8" --calib_size 3072 --dampening_frac 0.1 --observer mse
 
152
  model.save_pretrained(quant_path, save_compressed=True)
153
  tokenizer.save_pretrained(quant_path)
154
  ```
155
+ </details>
156
 
157
  ## Evaluation
158
 
159
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
160
 
161
+ <details>
162
+ <summary>Evaluation Commands</summary>
163
+
164
  OpenLLM Leaderboard V1:
165
  ```
166
  lm_eval \
167
  --model vllm \
168
+ --model_args pretrained="neuralmagic/granite-3.1-8b-instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
169
  --tasks openllm \
170
  --write_out \
171
  --batch_size auto \
 
173
  --show_config
174
  ```
175
 
176
+ OpenLLM Leaderboard V2:
177
+ ```
178
+ lm_eval \
179
+ --model vllm \
180
+ --model_args pretrained="neuralmagic/granite-3.1-8b-instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
181
+ --tasks leaderboard \
182
+ --write_out \
183
+ --batch_size auto \
184
+ --output_path output_dir \
185
+ --show_config
186
+ ```
187
+
188
  #### HumanEval
189
  ##### Generation
190
  ```
191
  python3 codegen/generate.py \
192
+ --model neuralmagic/granite-3.1-8b-instruct-quantized.w8a8 \
193
  --bs 16 \
194
  --temperature 0.2 \
195
  --n_samples 50 \
 
199
  ##### Sanitization
200
  ```
201
  python3 evalplus/sanitize.py \
202
+ humaneval/neuralmagic--granite-3.1-8b-instruct-quantized.w8a8_vllm_temp_0.2
203
  ```
204
  ##### Evaluation
205
  ```
206
  evalplus.evaluate \
207
  --dataset humaneval \
208
+ --samples humaneval/neuralmagic--granite-3.1-8b-instruct-quantized.w8a8_vllm_temp_0.2-sanitized
209
  ```
210
+ </details>
211
 
212
  ### Accuracy
213
 
214
+ <table>
215
+ <thead>
216
+ <tr>
217
+ <th>Category</th>
218
+ <th>Metric</th>
219
+ <th>ibm-granite/granite-3.1-8b-instruct</th>
220
+ <th>neuralmagic/granite-3.1-8b-instruct-quantized.w8a8</th>
221
+ <th>Recovery (%)</th>
222
+ </tr>
223
+ </thead>
224
+ <tbody>
225
+ <!-- OpenLLM Leaderboard V1 -->
226
+ <tr>
227
+ <td rowspan="7"><b>OpenLLM Leaderboard V1</b></td>
228
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
229
+ <td>66.81</td>
230
+ <td>67.06</td>
231
+ <td>100.37</td>
232
+ </tr>
233
+ <tr>
234
+ <td>GSM8K (Strict-Match, 5-shot)</td>
235
+ <td>64.52</td>
236
+ <td>65.66</td>
237
+ <td>101.77</td>
238
+ </tr>
239
+ <tr>
240
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
241
+ <td>84.18</td>
242
+ <td>83.93</td>
243
+ <td>99.70</td>
244
+ </tr>
245
+ <tr>
246
+ <td>MMLU (Acc, 5-shot)</td>
247
+ <td>65.52</td>
248
+ <td>65.03</td>
249
+ <td>99.25</td>
250
+ </tr>
251
+ <tr>
252
+ <td>TruthfulQA (MC2, 0-shot)</td>
253
+ <td>60.57</td>
254
+ <td>60.02</td>
255
+ <td>99.09</td>
256
+ </tr>
257
+ <tr>
258
+ <td>Winogrande (Acc, 5-shot)</td>
259
+ <td>80.19</td>
260
+ <td>79.87</td>
261
+ <td>99.60</td>
262
+ </tr>
263
+ <tr>
264
+ <td><b>Average Score</b></td>
265
+ <td><b>70.30</b></td>
266
+ <td><b>70.26</b></td>
267
+ <td><b>99.95</b></td>
268
+ </tr>
269
+ <!-- OpenLLM Leaderboard V2 -->
270
+ <tr>
271
+ <td rowspan="7"><b>OpenLLM Leaderboard V2</b></td>
272
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
273
+ <td>74.01</td>
274
+ <td>73.50</td>
275
+ <td>99.31</td>
276
+ </tr>
277
+ <tr>
278
+ <td>BBH (Acc-Norm, 3-shot)</td>
279
+ <td>53.19</td>
280
+ <td>52.59</td>
281
+ <td>98.87</td>
282
+ </tr>
283
+ <tr>
284
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
285
+ <td>14.77</td>
286
+ <td>15.73</td>
287
+ <td>106.50</td>
288
+ </tr>
289
+ <tr>
290
+ <td>GPQA (Acc-Norm, 0-shot)</td>
291
+ <td>31.76</td>
292
+ <td>30.62</td>
293
+ <td>96.40</td>
294
+ </tr>
295
+ <tr>
296
+ <td>MUSR (Acc-Norm, 0-shot)</td>
297
+ <td>46.01</td>
298
+ <td>44.30</td>
299
+ <td>96.28</td>
300
+ </tr>
301
+ <tr>
302
+ <td>MMLU-Pro (Acc, 5-shot)</td>
303
+ <td>35.81</td>
304
+ <td>35.41</td>
305
+ <td>98.88</td>
306
+ </tr>
307
+ <tr>
308
+ <td><b>Average Score</b></td>
309
+ <td><b>42.61</b></td>
310
+ <td><b>42.03</b></td>
311
+ <td><b>98.64</b></td>
312
+ </tr>
313
+ <!-- HumanEval -->
314
+ <tr>
315
+ <td rowspan="2"><b>HumanEval</b></td>
316
+ <td>HumanEval Pass@1</td>
317
+ <td>71.00</td>
318
+ <td>70.50</td>
319
+ <td><b>99.30</b></td>
320
+ </tr>
321
+ </tbody>
322
+ </table>
323
+
324
 
325
 
326
  ## Inference Performance
 
329
  This model achieves up to 1.6x speedup in single-stream deployment and up to 1.7x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario.
330
  The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
331
 
332
+ <details>
333
+ <summary>Benchmarking Command</summary>
334
+
335
+ ```
336
+ guidellm --model neuralmagic/granite-3.1-8b-instruct-quantized.w8a8 --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
337
+ ```
338
+
339
+ </details>
340
+
341
  ### Single-stream performance (measured with vLLM version 0.6.6.post1)
342
  <table>
343
  <tr>