nm-research commited on
Commit
9ca1a51
·
verified ·
1 Parent(s): 2719663

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +220 -3
README.md CHANGED
@@ -1,3 +1,220 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - fp8
4
+ - vllm
5
+ language:
6
+ - en
7
+ - zh
8
+ pipeline_tag: text-generation
9
+ base_model: zai-org/GLM-4.6
10
+ ---
11
+
12
+ # GLM-4.6-FP8-dynamic
13
+
14
+ ## Model Overview
15
+ - **Model Architecture:** zai-org/GLM-4.6
16
+ - **Input:** Text
17
+ - **Output:** Text
18
+ - **Model Optimizations:**
19
+ - **Weight quantization:** FP8
20
+ - **Activation quantization:** FP8
21
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
22
+ - **Version:** 1.0
23
+ - **Model Developers:** RedHatAI
24
+
25
+ This model is a quantized version of [zai-org/GLM-4.6](https://huggingface.co/zai-org/GLM-4.6).
26
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.
27
+
28
+ ### Model Optimizations
29
+
30
+ This model was obtained by quantizing the weights and activations of [zai-org/GLM-4.6](https://huggingface.co/zai-org/GLM-4.6) to FP8 data type, ready for inference with vLLM>=0.11.0
31
+
32
+ Only the weights and activations of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).
33
+
34
+ ## Deployment
35
+
36
+ ### Use with vLLM
37
+
38
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
39
+
40
+ ```python
41
+ from vllm import LLM, SamplingParams
42
+ from transformers import AutoTokenizer
43
+
44
+ model_id = "RedHatAI/GLM-4.6-FP8-dynamic"
45
+ number_gpus = 4
46
+
47
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
50
+
51
+ messages = [
52
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
53
+ {"role": "user", "content": "Who are you?"},
54
+ ]
55
+
56
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
57
+
58
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
59
+
60
+ outputs = llm.generate(prompts, sampling_params)
61
+
62
+ generated_text = outputs[0].outputs[0].text
63
+ print(generated_text)
64
+ ```
65
+
66
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
67
+
68
+ ## Creation
69
+
70
+ This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_example.py), as presented in the code snipet below.
71
+
72
+ <details>
73
+
74
+ ```python
75
+ from transformers import AutoModelForCausalLM, AutoTokenizer
76
+
77
+ from llmcompressor import oneshot
78
+ from llmcompressor.modifiers.quantization import QuantizationModifier
79
+ from llmcompressor.utils import dispatch_for_generation
80
+
81
+ MODEL_ID = "zai-org/GLM-4.6"
82
+
83
+ # Load model.
84
+ model = AutoModelForCausalLM.from_pretrained(
85
+ MODEL_ID, torch_dtype="auto", trust_remote_code=True, device_map=None
86
+ )
87
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
88
+
89
+ # Configure the quantization algorithm and scheme.
90
+ recipe = QuantizationModifier(
91
+ targets="Linear",
92
+ scheme="FP8_DYNAMIC",
93
+ ignore = [
94
+ "lm_head",
95
+ ]
96
+ )
97
+
98
+ # Apply quantization.
99
+ # FP8_DYNAMIC uses data-free quantization, so no calibration dataset needed
100
+ oneshot(model=model, recipe=recipe, trust_remote_code_model=True)
101
+
102
+ # Save to disk in compressed-tensors format.
103
+ SAVE_DIR = "./" + MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-dynamic"
104
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
105
+ tokenizer.save_pretrained(SAVE_DIR)
106
+
107
+ ```
108
+ </details>
109
+
110
+ ## Evaluation
111
+
112
+ This model was evaluated on the well-known text benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval).
113
+
114
+ ### Accuracy
115
+
116
+ <table>
117
+ <thead>
118
+ <tr>
119
+ <th>Category</th>
120
+ <th>Metric</th>
121
+ <th>zai-org/GLM-4.6-FP8</th>
122
+ <th>RedHatAI/GLM-4.6-FP8-dynamic (this model)</th>
123
+ <th>Recovery</th>
124
+ </tr>
125
+ </thead>
126
+ <tbody>
127
+ <!-- OpenLLM V1 -->
128
+ <tr>
129
+ <td rowspan="2"><b>Leaderboard</b></td>
130
+ <td>MMLU Pro</td>
131
+ <td>50.65%</td>
132
+ <td>50.25%</td>
133
+ <td>99.21%</td>
134
+ </tr>
135
+ <tr>
136
+ <td>IFEVAL</td>
137
+ <td>91.97</td>
138
+ <td>92.69%</td>
139
+ <td>100.78%</td>
140
+ </tr>
141
+ <tr>
142
+ <td rowspan="6"><b>Reasoning</b></td>
143
+ <td>AIME25</td>
144
+ <td>96.67%</td>
145
+ <td>93.33%</td>
146
+ <td>96.54%<td>
147
+ </tr>
148
+ <tr>
149
+ <td>Math-500 (0-shot)</td>
150
+ <td>88.80%</td>
151
+ <td>90.40%</td>
152
+ <td>101.80%</%</td>
153
+ </tr>
154
+ <tr>
155
+ <td>GPQA (Diamond, 0-shot)</td>
156
+ <td>81.82%</td>
157
+ <td>77.78%</td>
158
+ <td>95.06%</td>
159
+ </tr>
160
+ </tbody>
161
+ </table>
162
+
163
+
164
+
165
+ ### Reproduction
166
+
167
+ The results were obtained using the following commands:
168
+
169
+ <details>
170
+
171
+ #### Leaderboard
172
+
173
+ ```
174
+ lm_eval --model local-chat-completions \
175
+ --tasks mmlu_pro \
176
+ --model_args "model=RedHatAI/GLM-4.6-FP8-dynamic,max_length=90000,base_url=http://0.0.0.0:3758/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
177
+ --num_fewshot 5 \
178
+ --apply_chat_template \
179
+ --fewshot_as_multiturn \
180
+ --output_path ./ \
181
+ --seed 42 \
182
+ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,max_gen_toks=64000"
183
+
184
+
185
+ lm_eval --model local-chat-completions \
186
+ --tasks leaderboard_ifeval \
187
+ --model_args "model=RedHatAI/GLM-4.6-FP8-dynamic,max_length=90000,base_url=http://0.0.0.0:3758/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
188
+ --num_fewshot 5 \
189
+ --apply_chat_template \
190
+ --fewshot_as_multiturn \
191
+ --output_path ./ \
192
+ --seed 42 \
193
+ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,max_gen_toks=64000"
194
+ ```
195
+
196
+
197
+ #### Reasoning
198
+ ```
199
+ litellm_config.yaml:
200
+
201
+ model_parameters:
202
+ provider: "hosted_vllm"
203
+ model_name: "hosted_vllm/redhatai-glm-4.6-FP8-dynamic"
204
+ base_url: "http://0.0.0.0:3759/v1"
205
+ api_key: ""
206
+ timeout: 3600
207
+ concurrent_requests: 128
208
+ generation_parameters:
209
+ temperature: 1.0
210
+ max_new_tokens: 131072
211
+ top_p: 0.95
212
+ seed: 0
213
+
214
+ lighteval endpoint litellm litellm_config.yaml \
215
+ "aime25|0,math_500|0,gpqa:diamond|0" \
216
+ --output-dir ./ \
217
+ --save-details
218
+ ```
219
+
220
+ </details>