alexmarques commited on
Commit
e54f985
·
verified ·
1 Parent(s): b271778

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +213 -19
README.md CHANGED
@@ -17,29 +17,223 @@ pipeline_tag: text-generation
17
  - **Model Developers:** Neural Magic
18
 
19
  Quantized version of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
20
- It achieves an average score of 68.69% on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 68.54%.
21
 
22
- ## Model Optimizations
23
 
24
  This model was obtained by quantizing the weights of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to INT8 data type.
25
- Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
26
- [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization.
27
  This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ## Evaluation
30
 
31
- The model was evaluated with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) using the [vLLM](https://docs.vllm.ai/en/stable/) engine.
32
-
33
- ## Accuracy
34
-
35
- ### Open LLM Leaderboard evaluation scores
36
- | | [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | Meta-Llama-3-8B-Instruct-quantized.w8a16<br>(this model) |
37
- | :------------------: | :----------------------: | :------------------------------------------------: |
38
- | arc-c<br>25-shot | 62.63% | 61.52% |
39
- | hellaswag<br>10-shot | 78.81% | 78.69% |
40
- | mmlu<br>5-shot | 66.54% | 66.55% |
41
- | truthfulqa<br>0-shot | 52.49% | 52.60% |
42
- | winogrande<br>5-shot | 76.48% | 76.01% |
43
- | gsm8k<br>5-shot | 75.21% | 75.89% |
44
- | **Average<br>Accuracy** | **68.69%** | **68.54%** |
45
- | **Recovery** | **100%** | **99.78%** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  - **Model Developers:** Neural Magic
18
 
19
  Quantized version of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
20
+ It achieves an average score of 68.69% on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 68.54%.
21
 
22
+ ### Model Optimizations
23
 
24
  This model was obtained by quantizing the weights of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to INT8 data type.
 
 
25
  This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
26
 
27
+ Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
28
+ [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization with 1% damping factor and 256 sequences of 8,192 random tokens.
29
+
30
+
31
+ ## Usage and Creation
32
+
33
+ - **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), this models is intended for assistant-like chat.
34
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
35
+
36
+ ### Use with transformers
37
+
38
+ This model is supported by Transformers leveraging the integration with the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) data format.
39
+ The following examples contemplate how the model can be used as part of a Transformers pipeline or using the `generate()` function.
40
+
41
+ #### Transformers pipeline
42
+
43
+ ```python
44
+ import transformers
45
+ import torch
46
+
47
+ model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
48
+
49
+ pipeline = transformers.pipeline(
50
+ "text-generation",
51
+ model=model_id,
52
+ model_kwargs={"torch_dtype": "auto"},
53
+ device_map="auto",
54
+ )
55
+
56
+ messages = [
57
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
58
+ {"role": "user", "content": "Who are you?"},
59
+ ]
60
+
61
+ terminators = [
62
+ pipeline.tokenizer.eos_token_id,
63
+ pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
64
+ ]
65
+
66
+ outputs = pipeline(
67
+ messages,
68
+ max_new_tokens=256,
69
+ eos_token_id=terminators,
70
+ do_sample=True,
71
+ temperature=0.6,
72
+ top_p=0.9,
73
+ )
74
+ print(outputs[0]["generated_text"][-1])
75
+ ```
76
+
77
+ #### Transformers AutoModelForCausalLM
78
+
79
+ ```python
80
+ from transformers import AutoTokenizer, AutoModelForCausalLM
81
+ import torch
82
+
83
+ model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
84
+
85
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
86
+ model = AutoModelForCausalLM.from_pretrained(
87
+ model_id,
88
+ torch_dtype="auto",
89
+ device_map="auto",
90
+ )
91
+
92
+ messages = [
93
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
94
+ {"role": "user", "content": "Who are you?"},
95
+ ]
96
+
97
+ input_ids = tokenizer.apply_chat_template(
98
+ messages,
99
+ add_generation_prompt=True,
100
+ return_tensors="pt"
101
+ ).to(model.device)
102
+
103
+ terminators = [
104
+ tokenizer.eos_token_id,
105
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
106
+ ]
107
+
108
+ outputs = model.generate(
109
+ input_ids,
110
+ max_new_tokens=256,
111
+ eos_token_id=terminators,
112
+ do_sample=True,
113
+ temperature=0.6,
114
+ top_p=0.9,
115
+ )
116
+ response = outputs[0][input_ids.shape[-1]:]
117
+ print(tokenizer.decode(response, skip_special_tokens=True))
118
+ ```
119
+
120
+ ### vLLM Deployment
121
+
122
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
123
+
124
+ ```
125
+ from vllm import LLM, SamplingParams
126
+ from transformers import AutoTokenizer
127
+
128
+ model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
129
+
130
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=300)
131
+
132
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
133
+
134
+ messages = [
135
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
136
+ {"role": "user", "content": "Who are you?"},
137
+ ]
138
+
139
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
140
+
141
+ llm = LLM(model=model_id)
142
+
143
+ outputs = llm.generate(prompts, sampling_params)
144
+
145
+ generated_text = outputs[0].outputs[0].text
146
+ print(generated_text)
147
+ ```
148
+
149
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
150
+
151
  ## Evaluation
152
 
153
+ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) using the [vLLM](https://docs.vllm.ai/en/stable/) engine.
154
+
155
+ ### Accuracy
156
+
157
+ #### Open LLM Leaderboard evaluation scores
158
+ <table>
159
+ <tr>
160
+ <td><strong>Benchmark</strong>
161
+ </td>
162
+ <td><strong>Meta-Llama-3-8B-Instruct </strong>
163
+ </td>
164
+ <td><strong>Meta-Llama-3-8B-Instruct-quantized.w8a16(this model)</strong>
165
+ </td>
166
+ <td><strong>Recovery</strong>
167
+ </td>
168
+ </tr>
169
+ <tr>
170
+ <td>MMLU (5-shot)
171
+ </td>
172
+ <td>66.54
173
+ </td>
174
+ <td>66.55
175
+ </td>
176
+ <td>100.0%
177
+ </td>
178
+ </tr>
179
+ <tr>
180
+ <td>ARC Challenge (25-shot)
181
+ </td>
182
+ <td>62.63
183
+ </td>
184
+ <td>61.52
185
+ </td>
186
+ <td>98.2%
187
+ </td>
188
+ </tr>
189
+ <tr>
190
+ <td>GSM-8K (5-shot, strict-match)
191
+ </td>
192
+ <td>75.21
193
+ </td>
194
+ <td>75.89
195
+ </td>
196
+ <td>100.9%
197
+ </td>
198
+ </tr>
199
+ <tr>
200
+ <td>Hellaswag (10-shot)
201
+ </td>
202
+ <td>78.81
203
+ </td>
204
+ <td>78.69
205
+ </td>
206
+ <td>99.8%
207
+ </td>
208
+ </tr>
209
+ <tr>
210
+ <td>Winogrande (5-shot)
211
+ </td>
212
+ <td>76.48
213
+ </td>
214
+ <td>76.01
215
+ </td>
216
+ <td>98.2%
217
+ </td>
218
+ </tr>
219
+ <tr>
220
+ <td>TruthfulQA (0-shot)
221
+ </td>
222
+ <td>52.49
223
+ </td>
224
+ <td>52.60
225
+ </td>
226
+ <td>100.2%
227
+ </td>
228
+ </tr>
229
+ <tr>
230
+ <td><strong>Average</strong>
231
+ </td>
232
+ <td><strong>68.69</strong>
233
+ </td>
234
+ <td><strong>68.54</strong>
235
+ </td>
236
+ <td><strong>99.8%</strong>
237
+ </td>
238
+ </tr>
239
+ </table>