jiaxwang commited on
Commit
526f038
·
verified ·
1 Parent(s): 00cf3a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -2
README.md CHANGED
@@ -4,6 +4,125 @@ base_model:
4
  - zai-org/GLM-4.7
5
  ---
6
 
7
- # Disclaimer
8
 
9
- This model is provided for experimental purposes only. Its accuracy, stability, and suitability for deployment are not guaranteed. Users are advised to independently evaluate the model before any practical or production use.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - zai-org/GLM-4.7
5
  ---
6
 
7
+ # Model Overview
8
 
9
+ - **Model Architecture:** GLM-4.7
10
+ - **Input:** Text
11
+ - **Output:** Text
12
+ - **Supported Hardware Microarchitecture:** AMD MI350/MI355
13
+ - **ROCm:** 7.0
14
+ - **Operating System(s):** Linux
15
+ - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
16
+ - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
17
+ - **moe**
18
+ - **Weight quantization:** MOE-only, OCP MXFP4, Static
19
+ - **Activation quantization:** MOE-only, OCP MXFP4, Dynamic
20
+ - **KV cache quantization:** OCP FP8, Static
21
+ - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
22
+
23
+ This model was built with GLM-4.7 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
24
+
25
+ # Model Quantization
26
+
27
+ The model was quantized from [zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.
28
+ AMD-Quark has been installed from source code inside the Docker image `rocm/vllm-private:vllm_dev_base_mxfp4_20260122`.
29
+
30
+ **Quantization scripts:**
31
+
32
+ Step1: Creat the quantize_glm.py
33
+ ```
34
+ import runpy
35
+ from quark.torch import LLMTemplate
36
+
37
+ # Register GLM-4 MoE template
38
+ glm4_moe_template = LLMTemplate(
39
+ model_type="glm4_moe",
40
+ kv_layers_name=["*k_proj", "*v_proj"],
41
+ q_layer_name="*q_proj",
42
+ exclude_layers_name=["lm_head","*mlp.gate","*self_attn*","*shared_experts.*","*mlp.down_proj","*mlp.gate_proj","*mlp.up_proj"],
43
+ )
44
+ LLMTemplate.register_template(glm4_moe_template)
45
+ print(f"[INFO]: Registered template '{glm4_moe_template.model_type}'")
46
+
47
+ # Run quantize_quark.py
48
+ # Get the absolute path to the quantize_quark.py script
49
+ quantize_script = "/app/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py"
50
+
51
+ runpy.run_path(quantize_script, run_name="__main__")
52
+ ```
53
+ Step1: Quantize with the quantize_glm.py
54
+ ```
55
+ export CUDA_VISIBLE_DEVICES=0,1,2,3
56
+ export MODEL_DIR=zai-org/GLM-4.7
57
+ export output_dir=amd/GLM-4.7-MXFP4
58
+
59
+ exclude_layers="*self_attn* *mlp.gate lm_head *mlp.gate_proj *mlp.up_proj *mlp.down_proj *shared_experts.*"
60
+ python3 quantize_glm.py --model_dir $MODEL_DIR \
61
+ --quant_scheme mxfp4 \
62
+ --num_calib_data 128 \
63
+ --exclude_layers $exclude_layers \
64
+ --kv_cache_dtype fp8 \
65
+ --model_export hf_format \
66
+ --output_dir $output_dir \
67
+ --multi_gpu
68
+ ```
69
+
70
+ # Deployment
71
+ ### Use with vLLM
72
+
73
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
74
+
75
+ ## Evaluation
76
+ The model was evaluated on GSM8K benchmarks.
77
+
78
+ ### Accuracy
79
+
80
+ <table>
81
+ <tr>
82
+ <td><strong>Benchmark</strong>
83
+ </td>
84
+ <td><strong>GLM-4.7 </strong>
85
+ </td>
86
+ <td><strong>GLM-4.7-MXFP4(this model)</strong>
87
+ </td>
88
+ <td><strong>Recovery</strong>
89
+ </td>
90
+ </tr>
91
+ <tr>
92
+ <td>GSM8K
93
+ </td>
94
+ <td>94.16
95
+ </td>
96
+ <td>93.63
97
+ </td>
98
+ <td>99.44%
99
+ </td>
100
+ </tr>
101
+ </table>
102
+
103
+ ### Reproduction
104
+
105
+ The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `rocm/vllm-private:vllm_dev_base_mxfp4_20260122`, with vLLM, lm-eval and amd-quark compiled and installed from source inside the image.
106
+
107
+ #### Launching server
108
+ ```
109
+ vllm serve amd/GLM-4.7-MXFP4 \
110
+ --tensor-parallel-size 4 \
111
+ --tool-call-parser glm47 \
112
+ --reasoning-parser glm45 \
113
+ --enable-auto-tool-choice \
114
+ --kv_cache_dtype fp8
115
+ ```
116
+
117
+ #### Evaluating model in a new terminal
118
+ ```
119
+ lm_eval \
120
+ --model local-completions \
121
+ --model_args "model=amd/GLM-4.7-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
122
+ --tasks gsm8k \
123
+ --num_fewshot 5 \
124
+ --batch_size 1
125
+ ```
126
+
127
+ # License
128
+ Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.