nm-research commited on
Commit
c677fdb
·
verified ·
1 Parent(s): f9b4aef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -23
README.md CHANGED
@@ -43,7 +43,7 @@ from transformers import AutoTokenizer
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
- model_name = "neuralmagic-ent/granite-3.1-2b-base-quantized.w4a16"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -66,6 +66,8 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
 
 
69
 
70
  ```bash
71
  python quantize.py --model_path ibm-granite/granite-3.1-2b-base --quant_path "output_dir/granite-3.1-2b-base-quantized.w4a16" --calib_size 1024 --dampening_frac 0.01 --observer mse
@@ -144,16 +146,20 @@ oneshot(
144
  model.save_pretrained(quant_path, save_compressed=True)
145
  tokenizer.save_pretrained(quant_path)
146
  ```
 
147
 
148
  ## Evaluation
149
 
150
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
 
 
 
151
 
152
  OpenLLM Leaderboard V1:
153
  ```
154
  lm_eval \
155
  --model vllm \
156
- --model_args pretrained="neuralmagic-ent/granite-3.1-2b-base-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
157
  --tasks openllm \
158
  --write_out \
159
  --batch_size auto \
@@ -165,7 +171,7 @@ lm_eval \
165
  ##### Generation
166
  ```
167
  python3 codegen/generate.py \
168
- --model neuralmagic-ent/granite-3.1-2b-base-quantized.w4a16 \
169
  --bs 16 \
170
  --temperature 0.2 \
171
  --n_samples 50 \
@@ -175,32 +181,79 @@ python3 codegen/generate.py \
175
  ##### Sanitization
176
  ```
177
  python3 evalplus/sanitize.py \
178
- humaneval/neuralmagic-ent--granite-3.1-2b-base-quantized.w4a16_vllm_temp_0.2
179
  ```
180
  ##### Evaluation
181
  ```
182
  evalplus.evaluate \
183
  --dataset humaneval \
184
- --samples humaneval/neuralmagic-ent--granite-3.1-2b-base-quantized.w4a16_vllm_temp_0.2-sanitized
185
  ```
 
186
 
187
  ### Accuracy
188
 
189
- #### OpenLLM Leaderboard V1 evaluation scores
190
-
191
- | Metric | ibm-granite/granite-3.1-2b-base | neuralmagic-ent/granite-3.1-2b-base-quantized.w4a16 |
192
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
193
- | ARC-Challenge (Acc-Norm, 25-shot) | 53.75 | 51.96 |
194
- | GSM8K (Strict-Match, 5-shot) | 47.84 | 42.53 |
195
- | HellaSwag (Acc-Norm, 10-shot) | 77.94 | 75.38 |
196
- | MMLU (Acc, 5-shot) | 52.88 | 51.09 |
197
- | TruthfulQA (MC2, 0-shot) | 39.04 | 41.35 |
198
- | Winogrande (Acc, 5-shot) | 74.43 | 74.27 |
199
- | **Average Score** | **57.65** | **56.10** |
200
- | **Recovery** | **100.00** | **97.31** |
201
-
202
- #### HumanEval pass@1 scores
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
 
204
- | Metric | ibm-granite/granite-3.1-2b-base | neuralmagic-ent/granite-3.1-2b-base-quantized.w4a16 |
205
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
206
- | HumanEval Pass@1 | 30.00 | 0.298 |
 
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
+ model_name = "neuralmagic/granite-3.1-2b-base-quantized.w4a16"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
69
+ <details>
70
+ <summary>Model Creation Code</summary>
71
 
72
  ```bash
73
  python quantize.py --model_path ibm-granite/granite-3.1-2b-base --quant_path "output_dir/granite-3.1-2b-base-quantized.w4a16" --calib_size 1024 --dampening_frac 0.01 --observer mse
 
146
  model.save_pretrained(quant_path, save_compressed=True)
147
  tokenizer.save_pretrained(quant_path)
148
  ```
149
+ </details>
150
 
151
  ## Evaluation
152
 
153
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
154
+
155
+ <details>
156
+ <summary>Evaluation Commands</summary>
157
 
158
  OpenLLM Leaderboard V1:
159
  ```
160
  lm_eval \
161
  --model vllm \
162
+ --model_args pretrained="neuralmagic/granite-3.1-2b-base-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
163
  --tasks openllm \
164
  --write_out \
165
  --batch_size auto \
 
171
  ##### Generation
172
  ```
173
  python3 codegen/generate.py \
174
+ --model neuralmagic/granite-3.1-2b-base-quantized.w4a16 \
175
  --bs 16 \
176
  --temperature 0.2 \
177
  --n_samples 50 \
 
181
  ##### Sanitization
182
  ```
183
  python3 evalplus/sanitize.py \
184
+ humaneval/neuralmagic--granite-3.1-2b-base-quantized.w4a16_vllm_temp_0.2
185
  ```
186
  ##### Evaluation
187
  ```
188
  evalplus.evaluate \
189
  --dataset humaneval \
190
+ --samples humaneval/neuralmagic--granite-3.1-2b-base-quantized.w4a16_vllm_temp_0.2-sanitized
191
  ```
192
+ </details>
193
 
194
  ### Accuracy
195
 
196
+ <table>
197
+ <thead>
198
+ <tr>
199
+ <th>Category</th>
200
+ <th>Metric</th>
201
+ <th>ibm-granite/granite-3.1-2b-base</th>
202
+ <th>neuralmagic/granite-3.1-2b-base-quantized.w4a16</th>
203
+ <th>Recovery (%)</th>
204
+ </tr>
205
+ </thead>
206
+ <tbody>
207
+ <tr>
208
+ <td rowspan="7"><b>OpenLLM Leaderboard V1</b></td>
209
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
210
+ <td>53.75</td>
211
+ <td>51.96</td>
212
+ <td>96.67</td>
213
+ </tr>
214
+ <tr>
215
+ <td>GSM8K (Strict-Match, 5-shot)</td>
216
+ <td>47.84</td>
217
+ <td>42.53</td>
218
+ <td>88.89</td>
219
+ </tr>
220
+ <tr>
221
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
222
+ <td>77.94</td>
223
+ <td>75.38</td>
224
+ <td>96.71</td>
225
+ </tr>
226
+ <tr>
227
+ <td>MMLU (Acc, 5-shot)</td>
228
+ <td>52.88</td>
229
+ <td>51.09</td>
230
+ <td>96.61</td>
231
+ </tr>
232
+ <tr>
233
+ <td>TruthfulQA (MC2, 0-shot)</td>
234
+ <td>39.04</td>
235
+ <td>41.35</td>
236
+ <td>105.93</td>
237
+ </tr>
238
+ <tr>
239
+ <td>Winogrande (Acc, 5-shot)</td>
240
+ <td>74.43</td>
241
+ <td>74.27</td>
242
+ <td>99.78</td>
243
+ </tr>
244
+ <tr>
245
+ <td><b>Average Score</b></td>
246
+ <td><b>57.65</b></td>
247
+ <td><b>56.10</b></td>
248
+ <td><b>97.31</b></td>
249
+ </tr>
250
+ <tr>
251
+ <td rowspan="2"><b>HumanEval</b></td>
252
+ <td>HumanEval Pass@1</td>
253
+ <td>30.00</td>
254
+ <td>29.80</td>
255
+ <td><b>99.33</b></td>
256
+ </tr>
257
+ </tbody>
258
+ </table>
259