Improve language tag

#1
by lbourdois - opened
Files changed (1) hide show
  1. README.md +241 -229
README.md CHANGED
@@ -1,229 +1,241 @@
1
- ---
2
- license: apache-2.0
3
- license_link: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE
4
- language:
5
- - en
6
- pipeline_tag: text-generation
7
- base_model: Qwen/Qwen2.5-7B-Instruct
8
- tags:
9
- - chat
10
- - neuralmagic
11
- - llmcompressor
12
- - int8
13
- ---
14
-
15
- # Qwen2.5-7B-Instruct-quantized.w4a16
16
-
17
- ## Model Overview
18
- - **Model Architecture:** Qwen2
19
- - **Input:** Text
20
- - **Output:** Text
21
- - **Model Optimizations:**
22
- - **Weight quantization:** INT4
23
- - **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
24
- - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
25
- - **Release Date:** 04/16/2025
26
- - **Version:** 1.0
27
- - **License(s):** [apache-2.0](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE)
28
- - **Model Developers:** Neural Magic
29
-
30
- ### Model Optimizations
31
-
32
- This model was obtained by quantizing the weights of [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) to INT4 data type.
33
- This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
34
-
35
- Only the weights of the linear operators within transformers blocks are quantized.
36
- Weights are quantized using a symmetric per-group scheme, with group size 128.
37
- The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
38
-
39
- ## Deployment
40
-
41
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
42
-
43
- ```python
44
- from vllm import LLM, SamplingParams
45
- from transformers import AutoTokenizer
46
-
47
- model_id = "RedHatAI/Qwen2.5-7B-Instruct-quantized.w4a16"
48
- number_gpus = 1
49
- max_model_len = 8192
50
-
51
- sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
52
-
53
- tokenizer = AutoTokenizer.from_pretrained(model_id)
54
-
55
- messages = [
56
- {"role": "user", "content": "Give me a short introduction to large language model."},
57
- ]
58
-
59
- prompts = tokenizer.apply_chat_template(messages, tokenize=False)
60
-
61
- llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
62
-
63
- outputs = llm.generate(prompts, sampling_params)
64
-
65
- generated_text = outputs[0].outputs[0].text
66
- print(generated_text)
67
- ```
68
-
69
- vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
70
-
71
- ## Creation
72
-
73
- <details>
74
- <summary>Creation details</summary>
75
- This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
76
-
77
-
78
- ```python
79
- from transformers import AutoModelForCausalLM, AutoTokenizer
80
- from llmcompressor.modifiers.quantization import GPTQModifier
81
- from llmcompressor.transformers import oneshot
82
- from datasets import load_dataset
83
-
84
- # Load model
85
- model_stub = "Qwen/Qwen2.5-7B-Instruct"
86
- model_name = model_stub.split("/")[-1]
87
-
88
- num_samples = 3072
89
- max_seq_len = 8192
90
-
91
- tokenizer = AutoTokenizer.from_pretrained(model_stub)
92
-
93
- model = AutoModelForCausalLM.from_pretrained(
94
- model_stub,
95
- device_map="auto",
96
- torch_dtype="auto",
97
- )
98
-
99
- def preprocess_fn(example):
100
- return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
101
-
102
- ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
103
- ds = ds.map(preprocess_fn)
104
-
105
- # Configure the quantization algorithm and scheme
106
- recipe = GPTQModifier(
107
- targets="Linear",
108
- scheme="W4A16",
109
- ignore=["lm_head"],
110
- sequential_targets=["Qwen2DecoderLayer"],
111
- dampening_frac=0.2,
112
- )
113
-
114
- # Apply quantization
115
- oneshot(
116
- model=model,
117
- dataset=ds,
118
- recipe=recipe,
119
- max_seq_length=max_seq_len,
120
- num_calibration_samples=num_samples,
121
- )
122
-
123
- # Save to disk in compressed-tensors format
124
- save_path = model_name + "-quantized.w4a16"
125
- model.save_pretrained(save_path)
126
- tokenizer.save_pretrained(save_path)
127
- print(f"Model and tokenizer saved to: {save_path}")
128
- ```
129
- </details>
130
-
131
- ## Evaluation
132
-
133
- The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/387Bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 387Bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
134
- ```
135
- lm_eval \
136
- --model vllm \
137
- --model_args pretrained="neuralmagic/Qwen2.5-7B-Instruct-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \
138
- --apply_chat_template \
139
- --fewshot_as_multiturn \
140
- --tasks openllm \
141
- --batch_size auto
142
- ```
143
-
144
- ### Accuracy
145
-
146
- #### Open LLM Leaderboard evaluation scores
147
- <table>
148
- <tr>
149
- <th>Benchmark
150
- </th>
151
- <th>Qwen2.5-7B-Instruct
152
- </th>
153
- <th>Qwen2.5-7B-Instruct-quantized.w4a16<br>(this model)
154
- </th>
155
- <th>Recovery
156
- </th>
157
- </tr>
158
- <tr>
159
- <td>MMLU (5-shot)
160
- </td>
161
- <td>74.24
162
- </td>
163
- <td>73.19
164
- </td>
165
- <td>98.6%
166
- </td>
167
- </tr>
168
- <tr>
169
- <td>ARC Challenge (25-shot)
170
- </td>
171
- <td>63.40
172
- </td>
173
- <td>63.23
174
- </td>
175
- <td>99.7%
176
- </td>
177
- </tr>
178
- <tr>
179
- <td>GSM-8K (5-shot, strict-match)
180
- </td>
181
- <td>80.36
182
- </td>
183
- <td>80.59
184
- </td>
185
- <td>100.3%
186
- </td>
187
- </tr>
188
- <tr>
189
- <td>Hellaswag (10-shot)
190
- </td>
191
- <td>81.52
192
- </td>
193
- <td>80.65
194
- </td>
195
- <td>98.9%
196
- </td>
197
- </tr>
198
- <tr>
199
- <td>Winogrande (5-shot)
200
- </td>
201
- <td>74.66
202
- </td>
203
- <td>74.19
204
- </td>
205
- <td>99.4%
206
- </td>
207
- </tr>
208
- <tr>
209
- <td>TruthfulQA (0-shot, mc2)
210
- </td>
211
- <td>64.76
212
- </td>
213
- <td>64.27
214
- </td>
215
- <td>99.3%
216
- </td>
217
- </tr>
218
- <tr>
219
- <td><strong>Average</strong>
220
- </td>
221
- <td><strong>73.16</strong>
222
- </td>
223
- <td><strong>72.69</strong>
224
- </td>
225
- <td><strong>98.6%</strong>
226
- </td>
227
- </tr>
228
- </table>
229
-
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE
4
+ language:
5
+ - zho
6
+ - eng
7
+ - fra
8
+ - spa
9
+ - por
10
+ - deu
11
+ - ita
12
+ - rus
13
+ - jpn
14
+ - kor
15
+ - vie
16
+ - tha
17
+ - ara
18
+ pipeline_tag: text-generation
19
+ base_model: Qwen/Qwen2.5-7B-Instruct
20
+ tags:
21
+ - chat
22
+ - neuralmagic
23
+ - llmcompressor
24
+ - int8
25
+ ---
26
+
27
+ # Qwen2.5-7B-Instruct-quantized.w4a16
28
+
29
+ ## Model Overview
30
+ - **Model Architecture:** Qwen2
31
+ - **Input:** Text
32
+ - **Output:** Text
33
+ - **Model Optimizations:**
34
+ - **Weight quantization:** INT4
35
+ - **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
36
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
37
+ - **Release Date:** 04/16/2025
38
+ - **Version:** 1.0
39
+ - **License(s):** [apache-2.0](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE)
40
+ - **Model Developers:** Neural Magic
41
+
42
+ ### Model Optimizations
43
+
44
+ This model was obtained by quantizing the weights of [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) to INT4 data type.
45
+ This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
46
+
47
+ Only the weights of the linear operators within transformers blocks are quantized.
48
+ Weights are quantized using a symmetric per-group scheme, with group size 128.
49
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
50
+
51
+ ## Deployment
52
+
53
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
54
+
55
+ ```python
56
+ from vllm import LLM, SamplingParams
57
+ from transformers import AutoTokenizer
58
+
59
+ model_id = "RedHatAI/Qwen2.5-7B-Instruct-quantized.w4a16"
60
+ number_gpus = 1
61
+ max_model_len = 8192
62
+
63
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
64
+
65
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
66
+
67
+ messages = [
68
+ {"role": "user", "content": "Give me a short introduction to large language model."},
69
+ ]
70
+
71
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
72
+
73
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
74
+
75
+ outputs = llm.generate(prompts, sampling_params)
76
+
77
+ generated_text = outputs[0].outputs[0].text
78
+ print(generated_text)
79
+ ```
80
+
81
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
82
+
83
+ ## Creation
84
+
85
+ <details>
86
+ <summary>Creation details</summary>
87
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
88
+
89
+
90
+ ```python
91
+ from transformers import AutoModelForCausalLM, AutoTokenizer
92
+ from llmcompressor.modifiers.quantization import GPTQModifier
93
+ from llmcompressor.transformers import oneshot
94
+ from datasets import load_dataset
95
+
96
+ # Load model
97
+ model_stub = "Qwen/Qwen2.5-7B-Instruct"
98
+ model_name = model_stub.split("/")[-1]
99
+
100
+ num_samples = 3072
101
+ max_seq_len = 8192
102
+
103
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
104
+
105
+ model = AutoModelForCausalLM.from_pretrained(
106
+ model_stub,
107
+ device_map="auto",
108
+ torch_dtype="auto",
109
+ )
110
+
111
+ def preprocess_fn(example):
112
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
113
+
114
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
115
+ ds = ds.map(preprocess_fn)
116
+
117
+ # Configure the quantization algorithm and scheme
118
+ recipe = GPTQModifier(
119
+ targets="Linear",
120
+ scheme="W4A16",
121
+ ignore=["lm_head"],
122
+ sequential_targets=["Qwen2DecoderLayer"],
123
+ dampening_frac=0.2,
124
+ )
125
+
126
+ # Apply quantization
127
+ oneshot(
128
+ model=model,
129
+ dataset=ds,
130
+ recipe=recipe,
131
+ max_seq_length=max_seq_len,
132
+ num_calibration_samples=num_samples,
133
+ )
134
+
135
+ # Save to disk in compressed-tensors format
136
+ save_path = model_name + "-quantized.w4a16"
137
+ model.save_pretrained(save_path)
138
+ tokenizer.save_pretrained(save_path)
139
+ print(f"Model and tokenizer saved to: {save_path}")
140
+ ```
141
+ </details>
142
+
143
+ ## Evaluation
144
+
145
+ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/387Bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 387Bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
146
+ ```
147
+ lm_eval \
148
+ --model vllm \
149
+ --model_args pretrained="neuralmagic/Qwen2.5-7B-Instruct-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \
150
+ --apply_chat_template \
151
+ --fewshot_as_multiturn \
152
+ --tasks openllm \
153
+ --batch_size auto
154
+ ```
155
+
156
+ ### Accuracy
157
+
158
+ #### Open LLM Leaderboard evaluation scores
159
+ <table>
160
+ <tr>
161
+ <th>Benchmark
162
+ </th>
163
+ <th>Qwen2.5-7B-Instruct
164
+ </th>
165
+ <th>Qwen2.5-7B-Instruct-quantized.w4a16<br>(this model)
166
+ </th>
167
+ <th>Recovery
168
+ </th>
169
+ </tr>
170
+ <tr>
171
+ <td>MMLU (5-shot)
172
+ </td>
173
+ <td>74.24
174
+ </td>
175
+ <td>73.19
176
+ </td>
177
+ <td>98.6%
178
+ </td>
179
+ </tr>
180
+ <tr>
181
+ <td>ARC Challenge (25-shot)
182
+ </td>
183
+ <td>63.40
184
+ </td>
185
+ <td>63.23
186
+ </td>
187
+ <td>99.7%
188
+ </td>
189
+ </tr>
190
+ <tr>
191
+ <td>GSM-8K (5-shot, strict-match)
192
+ </td>
193
+ <td>80.36
194
+ </td>
195
+ <td>80.59
196
+ </td>
197
+ <td>100.3%
198
+ </td>
199
+ </tr>
200
+ <tr>
201
+ <td>Hellaswag (10-shot)
202
+ </td>
203
+ <td>81.52
204
+ </td>
205
+ <td>80.65
206
+ </td>
207
+ <td>98.9%
208
+ </td>
209
+ </tr>
210
+ <tr>
211
+ <td>Winogrande (5-shot)
212
+ </td>
213
+ <td>74.66
214
+ </td>
215
+ <td>74.19
216
+ </td>
217
+ <td>99.4%
218
+ </td>
219
+ </tr>
220
+ <tr>
221
+ <td>TruthfulQA (0-shot, mc2)
222
+ </td>
223
+ <td>64.76
224
+ </td>
225
+ <td>64.27
226
+ </td>
227
+ <td>99.3%
228
+ </td>
229
+ </tr>
230
+ <tr>
231
+ <td><strong>Average</strong>
232
+ </td>
233
+ <td><strong>73.16</strong>
234
+ </td>
235
+ <td><strong>72.69</strong>
236
+ </td>
237
+ <td><strong>98.6%</strong>
238
+ </td>
239
+ </tr>
240
+ </table>
241
+