File size: 4,187 Bytes
0c684cc
 
 
 
 
 
526f038
0c684cc
526f038
 
 
 
 
 
 
aa01a9a
526f038
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f112d70
 
4a7d359
526f038
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c16df18
526f038
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f7065e
526f038
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: mit
base_model:
- zai-org/GLM-4.7
---

# Model Overview

- **Model Architecture:** GLM-4.7
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm:** 7.0
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11)
  - **moe**
    - **Weight quantization:** MOE-only, OCP MXFP4, Static
    - **Activation quantization:** MOE-only, OCP MXFP4, Dynamic
  - **KV cache quantization:** OCP FP8, Static
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

This model was built with GLM-4.7 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

# Model Quantization

The model was quantized from [zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.
AMD-Quark has been installed from source code inside the Docker image `rocm/vllm-private:vllm_dev_base_mxfp4_20260122`.

**Quantization scripts:**

Note that GLM-4.7 is not in the built-in model template list in Quark V0.11, it has to be registered before quantization.

- **Step1:** Register model template: creat fle `Quark/examples/torch/language_modeling/llm_ptq/quantize_glm.py`
```
import runpy
from quark.torch import LLMTemplate

# Register GLM-4 MoE template
glm4_moe_template = LLMTemplate(
    model_type="glm4_moe",
    kv_layers_name=["*k_proj", "*v_proj"],
    q_layer_name="*q_proj",
    exclude_layers_name=["lm_head","*mlp.gate","*self_attn*","*shared_experts.*","*mlp.down_proj","*mlp.gate_proj","*mlp.up_proj"],
)
LLMTemplate.register_template(glm4_moe_template)
print(f"[INFO]: Registered template '{glm4_moe_template.model_type}'")

# Run quantize_quark.py
# Get the absolute path to the quantize_quark.py script
quantize_script = "/app/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py"

runpy.run_path(quantize_script, run_name="__main__")
```
- **Step2:** Quantize with the quantize_glm.py
```
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MODEL_DIR=zai-org/GLM-4.7
export output_dir=amd/GLM-4.7-MXFP4

exclude_layers="*self_attn* *mlp.gate lm_head *mlp.gate_proj *mlp.up_proj *mlp.down_proj *shared_experts.*"
python3 quantize_glm.py --model_dir $MODEL_DIR \
                        --quant_scheme mxfp4 \
                        --num_calib_data 128 \
                        --exclude_layers $exclude_layers \
                        --kv_cache_dtype fp8 \
                        --model_export hf_format \
                        --output_dir $output_dir \
                        --multi_gpu
```

# Deployment
### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

## Evaluation
The model was evaluated on GSM8K benchmarks. 

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>GLM-4.7 </strong>
   </td>
   <td><strong>GLM-4.7-MXFP4(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>GSM8K (strict-match)
   </td>
   <td>94.16
   </td>
   <td>93.63
   </td>
   <td>99.44%
   </td>
  </tr>
</table>

### Reproduction

The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `rocm/vllm-private:vllm_dev_base_mxfp4_20260122`, with vLLM, lm-eval and amd-quark compiled and installed from source inside the image.

#### Launching server
```
vllm serve amd/GLM-4.7-MXFP4 \
    --tensor-parallel-size 4 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --kv_cache_dtype fp8
```

#### Evaluating model in a new terminal
```
lm_eval \
  --model local-completions \
  --model_args "model=amd/GLM-4.7-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 1
```

# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.