Files changed (1) hide show
  1. README.md +254 -242
README.md CHANGED
@@ -1,242 +1,254 @@
1
- ---
2
- license: apache-2.0
3
- license_link: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE
4
- language:
5
- - en
6
- pipeline_tag: text-generation
7
- base_model: Qwen/Qwen2.5-7B-Instruct
8
- tags:
9
- - chat
10
- - neuralmagic
11
- - llmcompressor
12
- - int8
13
- ---
14
-
15
- # Qwen2.5-7B-Instruct-quantized.w8a8
16
-
17
- ## Model Overview
18
- - **Model Architecture:** Qwen2
19
- - **Input:** Text
20
- - **Output:** Text
21
- - **Model Optimizations:**
22
- - **Activation quantization:** INT8
23
- - **Weight quantization:** INT8
24
- - **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
25
- - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
26
- - **Release Date:** 10/09/2024
27
- - **Version:** 1.0
28
- - **License(s):** [apache-2.0](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE)
29
- - **Model Developers:** Neural Magic
30
-
31
- ### Model Optimizations
32
-
33
- This model was obtained by quantizing activations and weights of [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) to INT8 data type.
34
- This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
35
- Weight quantization also reduces disk size requirements by approximately 50%.
36
-
37
- Only weights and activations of the linear operators within transformers blocks are quantized.
38
- Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
39
- A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](https://arxiv.org/abs/2210.17323) algorithms is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
40
-
41
- ## Deployment
42
-
43
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
44
-
45
- ```python
46
- from vllm import LLM, SamplingParams
47
- from transformers import AutoTokenizer
48
-
49
- model_id = "RedHatAI/Qwen2.5-7B-Instruct-quantized.w8a8"
50
- number_gpus = 1
51
- max_model_len = 8192
52
-
53
- sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
54
-
55
- tokenizer = AutoTokenizer.from_pretrained(model_id)
56
-
57
- messages = [
58
- {"role": "user", "content": "Give me a short introduction to large language model."},
59
- ]
60
-
61
- prompts = tokenizer.apply_chat_template(messages, tokenize=False)
62
-
63
- llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
64
-
65
- outputs = llm.generate(prompts, sampling_params)
66
-
67
- generated_text = outputs[0].outputs[0].text
68
- print(generated_text)
69
- ```
70
-
71
- vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
72
-
73
- ## Creation
74
-
75
- <details>
76
- <summary>Creation details</summary>
77
- This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
78
-
79
-
80
- ```python
81
- from transformers import AutoModelForCausalLM, AutoTokenizer
82
- from llmcompressor.modifiers.quantization import GPTQModifier
83
- from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
84
- from llmcompressor.transformers import oneshot
85
- from datasets import load_dataset
86
-
87
- # Load model
88
- model_stub = "Qwen/Qwen2.5-7B-Instruct"
89
- model_name = model_stub.split("/")[-1]
90
-
91
- num_samples = 512
92
- max_seq_len = 8192
93
-
94
- tokenizer = AutoTokenizer.from_pretrained(model_stub)
95
-
96
- model = AutoModelForCausalLM.from_pretrained(
97
- model_stub,
98
- device_map="auto",
99
- torch_dtype="auto",
100
- )
101
-
102
- def preprocess_fn(example):
103
- return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
104
-
105
- ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
106
- ds = ds.map(preprocess_fn)
107
-
108
- # Configure the quantization algorithm and scheme
109
- recipe = [
110
- SmoothQuantModifier(
111
- smoothing_strength=0.8,
112
- mappings=[
113
- [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
114
- [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
115
- [["re:.*down_proj"], "re:.*up_proj"],
116
- ],
117
- ),
118
- GPTQModifier(
119
- ignore=["lm_head"],
120
- sequential_targets=["Qwen2DecoderLayer"],
121
- dampening_frac=0.01,
122
- targets="Linear",
123
- scheme="W8A8",
124
- ),
125
- ]
126
-
127
- # Apply quantization
128
- oneshot(
129
- model=model,
130
- dataset=ds,
131
- recipe=recipe,
132
- max_seq_length=max_seq_len,
133
- num_calibration_samples=num_samples,
134
- )
135
-
136
- # Save to disk in compressed-tensors format
137
- save_path = model_name + "-quantized.w8a8"
138
- model.save_pretrained(save_path)
139
- tokenizer.save_pretrained(save_path)
140
- print(f"Model and tokenizer saved to: {save_path}")
141
- ```
142
- </details>
143
-
144
- ## Evaluation
145
-
146
- The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/387Bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 387Bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
147
- ```
148
- lm_eval \
149
- --model vllm \
150
- --model_args pretrained="neuralmagic/Qwen2.5-7B-Instruct-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \
151
- --apply_chat_template \
152
- --fewshot_as_multiturn \
153
- --tasks openllm \
154
- --batch_size auto
155
- ```
156
-
157
- ### Accuracy
158
-
159
- #### Open LLM Leaderboard evaluation scores
160
- <table>
161
- <tr>
162
- <th>Benchmark
163
- </th>
164
- <th>Qwen2.5-7B-Instruct
165
- </th>
166
- <th>Qwen2.5-7B-Instruct-quantized.w8a8<br>(this model)
167
- </th>
168
- <th>Recovery
169
- </th>
170
- </tr>
171
- <tr>
172
- <td>MMLU (5-shot)
173
- </td>
174
- <td>74.24
175
- </td>
176
- <td>73.87
177
- </td>
178
- <td>99.5%
179
- </td>
180
- </tr>
181
- <tr>
182
- <td>ARC Challenge (25-shot)
183
- </td>
184
- <td>63.40
185
- </td>
186
- <td>63.23
187
- </td>
188
- <td>99.7%
189
- </td>
190
- </tr>
191
- <tr>
192
- <td>GSM-8K (5-shot, strict-match)
193
- </td>
194
- <td>80.36
195
- </td>
196
- <td>80.74
197
- </td>
198
- <td>100.5%
199
- </td>
200
- </tr>
201
- <tr>
202
- <td>Hellaswag (10-shot)
203
- </td>
204
- <td>81.52
205
- </td>
206
- <td>81.06
207
- </td>
208
- <td>99.4%
209
- </td>
210
- </tr>
211
- <tr>
212
- <td>Winogrande (5-shot)
213
- </td>
214
- <td>74.66
215
- </td>
216
- <td>74.82
217
- </td>
218
- <td>100.2%
219
- </td>
220
- </tr>
221
- <tr>
222
- <td>TruthfulQA (0-shot, mc2)
223
- </td>
224
- <td>64.76
225
- </td>
226
- <td>64.58
227
- </td>
228
- <td>99.7%
229
- </td>
230
- </tr>
231
- <tr>
232
- <td><strong>Average</strong>
233
- </td>
234
- <td><strong>73.16</strong>
235
- </td>
236
- <td><strong>73.05</strong>
237
- </td>
238
- <td><strong>99.4%</strong>
239
- </td>
240
- </tr>
241
- </table>
242
-
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE
4
+ language:
5
+ - zho
6
+ - eng
7
+ - fra
8
+ - spa
9
+ - por
10
+ - deu
11
+ - ita
12
+ - rus
13
+ - jpn
14
+ - kor
15
+ - vie
16
+ - tha
17
+ - ara
18
+ pipeline_tag: text-generation
19
+ base_model: Qwen/Qwen2.5-7B-Instruct
20
+ tags:
21
+ - chat
22
+ - neuralmagic
23
+ - llmcompressor
24
+ - int8
25
+ ---
26
+
27
+ # Qwen2.5-7B-Instruct-quantized.w8a8
28
+
29
+ ## Model Overview
30
+ - **Model Architecture:** Qwen2
31
+ - **Input:** Text
32
+ - **Output:** Text
33
+ - **Model Optimizations:**
34
+ - **Activation quantization:** INT8
35
+ - **Weight quantization:** INT8
36
+ - **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
37
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
38
+ - **Release Date:** 10/09/2024
39
+ - **Version:** 1.0
40
+ - **License(s):** [apache-2.0](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE)
41
+ - **Model Developers:** Neural Magic
42
+
43
+ ### Model Optimizations
44
+
45
+ This model was obtained by quantizing activations and weights of [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) to INT8 data type.
46
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
47
+ Weight quantization also reduces disk size requirements by approximately 50%.
48
+
49
+ Only weights and activations of the linear operators within transformers blocks are quantized.
50
+ Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
51
+ A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](https://arxiv.org/abs/2210.17323) algorithms is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
52
+
53
+ ## Deployment
54
+
55
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
56
+
57
+ ```python
58
+ from vllm import LLM, SamplingParams
59
+ from transformers import AutoTokenizer
60
+
61
+ model_id = "RedHatAI/Qwen2.5-7B-Instruct-quantized.w8a8"
62
+ number_gpus = 1
63
+ max_model_len = 8192
64
+
65
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
66
+
67
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
68
+
69
+ messages = [
70
+ {"role": "user", "content": "Give me a short introduction to large language model."},
71
+ ]
72
+
73
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
74
+
75
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
76
+
77
+ outputs = llm.generate(prompts, sampling_params)
78
+
79
+ generated_text = outputs[0].outputs[0].text
80
+ print(generated_text)
81
+ ```
82
+
83
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
84
+
85
+ ## Creation
86
+
87
+ <details>
88
+ <summary>Creation details</summary>
89
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
90
+
91
+
92
+ ```python
93
+ from transformers import AutoModelForCausalLM, AutoTokenizer
94
+ from llmcompressor.modifiers.quantization import GPTQModifier
95
+ from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
96
+ from llmcompressor.transformers import oneshot
97
+ from datasets import load_dataset
98
+
99
+ # Load model
100
+ model_stub = "Qwen/Qwen2.5-7B-Instruct"
101
+ model_name = model_stub.split("/")[-1]
102
+
103
+ num_samples = 512
104
+ max_seq_len = 8192
105
+
106
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
107
+
108
+ model = AutoModelForCausalLM.from_pretrained(
109
+ model_stub,
110
+ device_map="auto",
111
+ torch_dtype="auto",
112
+ )
113
+
114
+ def preprocess_fn(example):
115
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
116
+
117
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
118
+ ds = ds.map(preprocess_fn)
119
+
120
+ # Configure the quantization algorithm and scheme
121
+ recipe = [
122
+ SmoothQuantModifier(
123
+ smoothing_strength=0.8,
124
+ mappings=[
125
+ [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
126
+ [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
127
+ [["re:.*down_proj"], "re:.*up_proj"],
128
+ ],
129
+ ),
130
+ GPTQModifier(
131
+ ignore=["lm_head"],
132
+ sequential_targets=["Qwen2DecoderLayer"],
133
+ dampening_frac=0.01,
134
+ targets="Linear",
135
+ scheme="W8A8",
136
+ ),
137
+ ]
138
+
139
+ # Apply quantization
140
+ oneshot(
141
+ model=model,
142
+ dataset=ds,
143
+ recipe=recipe,
144
+ max_seq_length=max_seq_len,
145
+ num_calibration_samples=num_samples,
146
+ )
147
+
148
+ # Save to disk in compressed-tensors format
149
+ save_path = model_name + "-quantized.w8a8"
150
+ model.save_pretrained(save_path)
151
+ tokenizer.save_pretrained(save_path)
152
+ print(f"Model and tokenizer saved to: {save_path}")
153
+ ```
154
+ </details>
155
+
156
+ ## Evaluation
157
+
158
+ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/387Bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 387Bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
159
+ ```
160
+ lm_eval \
161
+ --model vllm \
162
+ --model_args pretrained="neuralmagic/Qwen2.5-7B-Instruct-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \
163
+ --apply_chat_template \
164
+ --fewshot_as_multiturn \
165
+ --tasks openllm \
166
+ --batch_size auto
167
+ ```
168
+
169
+ ### Accuracy
170
+
171
+ #### Open LLM Leaderboard evaluation scores
172
+ <table>
173
+ <tr>
174
+ <th>Benchmark
175
+ </th>
176
+ <th>Qwen2.5-7B-Instruct
177
+ </th>
178
+ <th>Qwen2.5-7B-Instruct-quantized.w8a8<br>(this model)
179
+ </th>
180
+ <th>Recovery
181
+ </th>
182
+ </tr>
183
+ <tr>
184
+ <td>MMLU (5-shot)
185
+ </td>
186
+ <td>74.24
187
+ </td>
188
+ <td>73.87
189
+ </td>
190
+ <td>99.5%
191
+ </td>
192
+ </tr>
193
+ <tr>
194
+ <td>ARC Challenge (25-shot)
195
+ </td>
196
+ <td>63.40
197
+ </td>
198
+ <td>63.23
199
+ </td>
200
+ <td>99.7%
201
+ </td>
202
+ </tr>
203
+ <tr>
204
+ <td>GSM-8K (5-shot, strict-match)
205
+ </td>
206
+ <td>80.36
207
+ </td>
208
+ <td>80.74
209
+ </td>
210
+ <td>100.5%
211
+ </td>
212
+ </tr>
213
+ <tr>
214
+ <td>Hellaswag (10-shot)
215
+ </td>
216
+ <td>81.52
217
+ </td>
218
+ <td>81.06
219
+ </td>
220
+ <td>99.4%
221
+ </td>
222
+ </tr>
223
+ <tr>
224
+ <td>Winogrande (5-shot)
225
+ </td>
226
+ <td>74.66
227
+ </td>
228
+ <td>74.82
229
+ </td>
230
+ <td>100.2%
231
+ </td>
232
+ </tr>
233
+ <tr>
234
+ <td>TruthfulQA (0-shot, mc2)
235
+ </td>
236
+ <td>64.76
237
+ </td>
238
+ <td>64.58
239
+ </td>
240
+ <td>99.7%
241
+ </td>
242
+ </tr>
243
+ <tr>
244
+ <td><strong>Average</strong>
245
+ </td>
246
+ <td><strong>73.16</strong>
247
+ </td>
248
+ <td><strong>73.05</strong>
249
+ </td>
250
+ <td><strong>99.4%</strong>
251
+ </td>
252
+ </tr>
253
+ </table>
254
+