nm-research commited on
Commit
0cd5367
·
verified ·
1 Parent(s): 6268ee6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -23
README.md CHANGED
@@ -43,7 +43,7 @@ from transformers import AutoTokenizer
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
- model_name = "neuralmagic-ent/granite-3.1-2b-base-quantized.w8a8"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -66,6 +66,8 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
 
 
69
 
70
  ```bash
71
  python quantize.py --model_path ibm-granite/granite-3.1-2b-base --quant_path "output_dir/granite-3.1-2b-base-quantized.w8a8" --calib_size 1024 --dampening_frac 0.01 --observer mse
@@ -151,16 +153,20 @@ oneshot(
151
  model.save_pretrained(quant_path, save_compressed=True)
152
  tokenizer.save_pretrained(quant_path)
153
  ```
 
154
 
155
  ## Evaluation
156
 
157
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
 
 
 
158
 
159
  OpenLLM Leaderboard V1:
160
  ```
161
  lm_eval \
162
  --model vllm \
163
- --model_args pretrained="neuralmagic-ent/granite-3.1-2b-base-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
164
  --tasks openllm \
165
  --write_out \
166
  --batch_size auto \
@@ -172,7 +178,7 @@ lm_eval \
172
  ##### Generation
173
  ```
174
  python3 codegen/generate.py \
175
- --model neuralmagic-ent/granite-3.1-2b-base-quantized.w8a8 \
176
  --bs 16 \
177
  --temperature 0.2 \
178
  --n_samples 50 \
@@ -182,33 +188,81 @@ python3 codegen/generate.py \
182
  ##### Sanitization
183
  ```
184
  python3 evalplus/sanitize.py \
185
- humaneval/neuralmagic-ent--granite-3.1-2b-base-quantized.w8a8_vllm_temp_0.2
186
  ```
187
  ##### Evaluation
188
  ```
189
  evalplus.evaluate \
190
  --dataset humaneval \
191
- --samples humaneval/neuralmagic-ent--granite-3.1-2b-base-quantized.w8a8_vllm_temp_0.2-sanitized
192
  ```
 
193
 
194
  ### Accuracy
195
 
196
- #### OpenLLM Leaderboard V1 evaluation scores
197
-
198
- | Metric | ibm-granite/granite-3.1-2b-base | neuralmagic-ent/granite-3.1-2b-base-quantized.w8a8 |
199
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
200
- | ARC-Challenge (Acc-Norm, 25-shot) | 53.75 | 54.01 |
201
- | GSM8K (Strict-Match, 5-shot) | 47.84 | 46.55 |
202
- | HellaSwag (Acc-Norm, 10-shot) | 77.94 | 77.94 |
203
- | MMLU (Acc, 5-shot) | 52.88 | 52.34 |
204
- | TruthfulQA (MC2, 0-shot) | 39.04 | 38.12 |
205
- | Winogrande (Acc, 5-shot) | 74.43 | 74.35 |
206
- | **Average Score** | **57.65** | **57.22** |
207
- | **Recovery** | **100.00** | **99.26** |
208
-
209
- #### HumanEval pass@1 scores
210
- | Metric | ibm-granite/granite-3.1-2b-base | neuralmagic-ent/granite-3.1-2b-base-quantized.w8a8 |
211
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
212
- | HumanEval Pass@1 | 30.00 | 29.6 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
 
214
 
 
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
+ model_name = "neuralmagic/granite-3.1-2b-base-quantized.w8a8"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
69
+ <details>
70
+ <summary>Model Creation Code</summary>
71
 
72
  ```bash
73
  python quantize.py --model_path ibm-granite/granite-3.1-2b-base --quant_path "output_dir/granite-3.1-2b-base-quantized.w8a8" --calib_size 1024 --dampening_frac 0.01 --observer mse
 
153
  model.save_pretrained(quant_path, save_compressed=True)
154
  tokenizer.save_pretrained(quant_path)
155
  ```
156
+ </details>
157
 
158
  ## Evaluation
159
 
160
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
161
+
162
+ <details>
163
+ <summary>Evaluation Commands</summary>
164
 
165
  OpenLLM Leaderboard V1:
166
  ```
167
  lm_eval \
168
  --model vllm \
169
+ --model_args pretrained="neuralmagic/granite-3.1-2b-base-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
170
  --tasks openllm \
171
  --write_out \
172
  --batch_size auto \
 
178
  ##### Generation
179
  ```
180
  python3 codegen/generate.py \
181
+ --model neuralmagic/granite-3.1-2b-base-quantized.w8a8 \
182
  --bs 16 \
183
  --temperature 0.2 \
184
  --n_samples 50 \
 
188
  ##### Sanitization
189
  ```
190
  python3 evalplus/sanitize.py \
191
+ humaneval/neuralmagic--granite-3.1-2b-base-quantized.w8a8_vllm_temp_0.2
192
  ```
193
  ##### Evaluation
194
  ```
195
  evalplus.evaluate \
196
  --dataset humaneval \
197
+ --samples humaneval/neuralmagic--granite-3.1-2b-base-quantized.w8a8_vllm_temp_0.2-sanitized
198
  ```
199
+ </details>
200
 
201
  ### Accuracy
202
 
203
+ <table>
204
+ <thead>
205
+ <tr>
206
+ <th>Category</th>
207
+ <th>Metric</th>
208
+ <th>ibm-granite/granite-3.1-2b-base</th>
209
+ <th>neuralmagic-ent/granite-3.1-2b-base-quantized.w8a8</th>
210
+ <th>Recovery (%)</th>
211
+ </tr>
212
+ </thead>
213
+ <tbody>
214
+ <tr>
215
+ <td rowspan="7"><b>OpenLLM Leaderboard V1</b></td>
216
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
217
+ <td>53.75</td>
218
+ <td>54.01</td>
219
+ <td>100.48</td>
220
+ </tr>
221
+ <tr>
222
+ <td>GSM8K (Strict-Match, 5-shot)</td>
223
+ <td>47.84</td>
224
+ <td>46.55</td>
225
+ <td>97.30</td>
226
+ </tr>
227
+ <tr>
228
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
229
+ <td>77.94</td>
230
+ <td>77.94</td>
231
+ <td>100.00</td>
232
+ </tr>
233
+ <tr>
234
+ <td>MMLU (Acc, 5-shot)</td>
235
+ <td>52.88</td>
236
+ <td>52.34</td>
237
+ <td>98.98</td>
238
+ </tr>
239
+ <tr>
240
+ <td>TruthfulQA (MC2, 0-shot)</td>
241
+ <td>39.04</td>
242
+ <td>38.12</td>
243
+ <td>97.64</td>
244
+ </tr>
245
+ <tr>
246
+ <td>Winogrande (Acc, 5-shot)</td>
247
+ <td>74.43</td>
248
+ <td>74.35</td>
249
+ <td>99.89</td>
250
+ </tr>
251
+ <tr>
252
+ <td><b>Average Score</b></td>
253
+ <td><b>57.65</b></td>
254
+ <td><b>57.22</b></td>
255
+ <td><b>99.26</b></td>
256
+ </tr>
257
+ <tr>
258
+ <td rowspan="2"><b>HumanEval</b></td>
259
+ <td>HumanEval Pass@1</td>
260
+ <td>30.00</td>
261
+ <td>29.60</td>
262
+ <td><b>98.67</b></td>
263
+ </tr>
264
+ </tbody>
265
+ </table>
266
+
267
 
268