nm-research commited on
Commit
fc920d8
·
verified ·
1 Parent(s): a9ee80e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +263 -3
README.md CHANGED
@@ -1,3 +1,263 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - w4a16
4
+ - vllm
5
+ language:
6
+ - en
7
+ - zh
8
+ pipeline_tag: text-generation
9
+ base_model: zai-org/GLM-4.6
10
+ ---
11
+
12
+ # GLM-4.6-quantized.w4a16
13
+
14
+ ## Model Overview
15
+ - **Model Architecture:** zai-org/GLM-4.6
16
+ - **Input:** Text
17
+ - **Output:** Text
18
+ - **Model Optimizations:**
19
+ - **Weight quantization:** INT8
20
+ - **Activation quantization:** INT8
21
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
22
+ - **Version:** 1.0
23
+ - **Model Developers:** RedHatAI
24
+
25
+ This model is a quantized version of [zai-org/GLM-4.6](https://huggingface.co/zai-org/GLM-4.6).
26
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.
27
+
28
+ ### Model Optimizations
29
+
30
+ This model was obtained by quantizing the weights of [zai-org/GLM-4.6](https://huggingface.co/zai-org/GLM-4.6) to INT4 data type, ready for inference with vLLM>=0.11.0.
31
+
32
+ Only the weights and activations of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).
33
+
34
+ ## Deployment
35
+
36
+ ### Use with vLLM
37
+
38
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
39
+
40
+ ```python
41
+ from vllm import LLM, SamplingParams
42
+ from transformers import AutoTokenizer
43
+
44
+ model_id = "RedHatAI/GLM-4.6-quantized.w4a16"
45
+ number_gpus = 4
46
+
47
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
50
+
51
+ messages = [
52
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
53
+ {"role": "user", "content": "Who are you?"},
54
+ ]
55
+
56
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
57
+
58
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
59
+
60
+ outputs = llm.generate(prompts, sampling_params)
61
+
62
+ generated_text = outputs[0].outputs[0].text
63
+ print(generated_text)
64
+ ```
65
+
66
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
67
+
68
+ ## Creation
69
+
70
+ This model was created by applying a script similar to [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/glm4_7_example.py), as presented in the code snipet below.
71
+
72
+ <details>
73
+
74
+ ```python
75
+ from datasets import load_dataset
76
+ from transformers import AutoModelForCausalLM, AutoTokenizer
77
+
78
+ from llmcompressor import oneshot
79
+ from llmcompressor.modifiers.quantization import GPTQModifier
80
+ from llmcompressor.utils import dispatch_for_generation
81
+
82
+ MODEL_ID = "zai-org/GLM-4.6"
83
+
84
+ # Load model.
85
+ model = AutoModelForCausalLM.from_pretrained(
86
+ MODEL_ID, torch_dtype="auto"
87
+ )
88
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
89
+
90
+ # Select calibration dataset.
91
+ DATASET_ID = "HuggingFaceH4/ultrachat_200k"
92
+ DATASET_SPLIT = "train_sft"
93
+
94
+ # Select number of samples.
95
+ # Increasing the number of samples can improve accuracy.
96
+ NUM_CALIBRATION_SAMPLES = 512
97
+ MAX_SEQUENCE_LENGTH = 2048
98
+
99
+ # Load dataset and preprocess.
100
+ ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
101
+ ds = ds.shuffle(seed=42)
102
+
103
+ def preprocess(example):
104
+ return {
105
+ "text": tokenizer.apply_chat_template(
106
+ example["messages"],
107
+ tokenize=False,
108
+ )
109
+ }
110
+
111
+ ds = ds.map(preprocess)
112
+
113
+ # Tokenize inputs.
114
+ def tokenize(sample):
115
+ return tokenizer(
116
+ sample["text"],
117
+ padding=False,
118
+ max_length=MAX_SEQUENCE_LENGTH,
119
+ truncation=True,
120
+ add_special_tokens=False,
121
+ )
122
+
123
+ ds = ds.map(tokenize, remove_columns=ds.column_names)
124
+
125
+ # Configure the quantization algorithm and scheme with explicit parameters.
126
+ recipe = GPTQModifier(
127
+ targets="Linear",
128
+ scheme="W4A16",
129
+ ignore=[
130
+ "lm_head",
131
+ "re:.*mlp.gate$"
132
+ ],
133
+ )
134
+
135
+ # Apply quantization.
136
+ oneshot(
137
+ model=model,
138
+ dataset=ds,
139
+ recipe=recipe,
140
+ max_seq_length=MAX_SEQUENCE_LENGTH,
141
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
142
+ pipeline="sequential",
143
+ sequential_targets=["Glm4MoeDecoderLayer"],
144
+ trust_remote_code_model=True,
145
+ )
146
+
147
+ SAVE_DIR = "./" + MODEL_ID.rstrip("/").split("/")[-1] + "-quantized.w4a16"
148
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
149
+ tokenizer.save_pretrained(SAVE_DIR)
150
+
151
+ ```
152
+ </details>
153
+
154
+ ## Evaluation
155
+
156
+ This model was evaluated on the well-known text benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval).
157
+
158
+ ### Accuracy
159
+
160
+ <table>
161
+ <thead>
162
+ <tr>
163
+ <th>Category</th>
164
+ <th>Metric</th>
165
+ <th>zai-org/GLM-4.6-FP8</th>
166
+ <th>RedHatAI/GLM-4.6-quantized.w4a16 (this model)</th>
167
+ <th>Recovery</th>
168
+ </tr>
169
+ </thead>
170
+ <tbody>
171
+ <!-- OpenLLM V1 -->
172
+ <tr>
173
+ <td rowspan="2"><b>Leaderboard</b></td>
174
+ <td>MMLU Pro</td>
175
+ <td>50.65%</td>
176
+ <td>53.22%</td>
177
+ <td>105.07%</td>
178
+ </tr>
179
+ <tr>
180
+ <td>IFEVAL</td>
181
+ <td>91.97%</td>
182
+ <td>92.21%</td>
183
+ <td>100.26%</td>
184
+ </tr>
185
+ <tr>
186
+ <td rowspan="6"><b>Reasoning</b></td>
187
+ <td>AIME25</td>
188
+ <td>96.67%</td>
189
+ <td>90.00%</td>
190
+ <td>93.10%</td>
191
+ </tr>
192
+ <tr>
193
+ <td>Math-500 (0-shot)</td>
194
+ <td>88.80%</td>
195
+ <td>88.00%</td>
196
+ <td>99.10%</td>
197
+ </tr>
198
+ <tr>
199
+ <td>GPQA (Diamond, 0-shot)</td>
200
+ <td>81.82%/td>
201
+ <td>80.30%</td>
202
+ <td>98.14%</td>
203
+ </tr>
204
+ </tbody>
205
+ </table>
206
+
207
+
208
+ ### Reproduction
209
+
210
+ The results were obtained using the following commands:
211
+
212
+ <details>
213
+
214
+ #### Leaderboard
215
+
216
+ ```
217
+ lm_eval --model local-chat-completions \
218
+ --tasks mmlu_pro \
219
+ --model_args "model=RedHatAI/GLM-4.6-quantized.w4a16,max_length=90000,base_url=http://0.0.0.0:3758/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
220
+ --num_fewshot 5 \
221
+ --apply_chat_template \
222
+ --fewshot_as_multiturn \
223
+ --output_path ./ \
224
+ --seed 42 \
225
+ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,max_gen_toks=64000"
226
+
227
+
228
+ lm_eval --model local-chat-completions \
229
+ --tasks leaderboard_ifeval \
230
+ --model_args "model=RedHatAI/GLM-4.6-quantized.w4a16,max_length=90000,base_url=http://0.0.0.0:3758/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
231
+ --num_fewshot 5 \
232
+ --apply_chat_template \
233
+ --fewshot_as_multiturn \
234
+ --output_path ./ \
235
+ --seed 42 \
236
+ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,max_gen_toks=64000"
237
+ ```
238
+
239
+
240
+ #### Reasoning
241
+ ```
242
+ litellm_config.yaml:
243
+
244
+ model_parameters:
245
+ provider: "hosted_vllm"
246
+ model_name: "hosted_vllm/redhatai-glm-4.6-W4A16"
247
+ base_url: "http://0.0.0.0:3759/v1"
248
+ api_key: ""
249
+ timeout: 3600
250
+ concurrent_requests: 128
251
+ generation_parameters:
252
+ temperature: 1.0
253
+ max_new_tokens: 131072
254
+ top_p: 0.95
255
+ seed: 0
256
+
257
+ lighteval endpoint litellm litellm_config.yaml \
258
+ "aime25|0,math_500|0,gpqa:diamond|0" \
259
+ --output-dir ./ \
260
+ --save-details
261
+ ```
262
+
263
+ </details>