File size: 3,724 Bytes
31ceb43
 
db9c18a
 
31ceb43
e60bef6
31ceb43
e60bef6
 
 
 
 
 
 
 
355dba7
 
e60bef6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b1785a6
e60bef6
 
 
 
 
 
 
 
 
 
 
 
 
 
46b5613
e60bef6
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: mit
base_model:
- zai-org/GLM-5
---
# Model Overview

- **Model Architecture:** GLM-5
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm:** 7.1.0
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11.1)
  - **Weight quantization:** MOE-only, OCP MXFP4, Static
  - **Activation quantization:** MOE-only, OCP MXFP4, Dynamic
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

This model was built with GLM-5 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

# Model Quantization

The model was quantized from [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.

**Quantization scripts:**

```python
from quark.torch import LLMTemplate, ModelQuantizer

# --- Register GLM-5 template ---
GLM5_template = LLMTemplate(
    model_type="glm_moe_dsa",
    kv_layers_name=["*kv_a_proj_with_mqa", "*kv_b_proj"],
    q_layer_name="*q_a_proj",
    exclude_layers_name=["lm_head"],
)
LLMTemplate.register_template(GLM5_template)
print(f"[INFO]: Registered template '{GLM5_template.model_type}'")

# --- Configuration ---
model_dir = "zai-org/GLM-5"
output_dir = "amd/GLM-5-MXFP4"
quant_scheme = "mxfp4"
exclude_layers = [
    "*self_attn*",
    "*mlp.gate",
    "*lm_head",
    "*mlp.gate_proj",
    "*mlp.up_proj",
    "*mlp.down_proj",
    "*shared_experts*",
]

# --- Build quant config from template ---
template = LLMTemplate.get("glm_moe_dsa")
quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)

# --- File-to-file quantization (memory-efficient, no full model loading) ---
quantizer = ModelQuantizer(quant_config)
quantizer.direct_quantize_checkpoint(
    pretrained_model_path=model_dir,
    save_path=output_dir,
)

print(f"[INFO]: Quantization complete. Output saved to {output_dir}")

```

# Deployment
### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

## Evaluation
The model was evaluated on GSM8K benchmarks. 

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>GLM-5 </strong>
   </td>
   <td><strong>GLM-5-MXFP4(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>GSM8K (flexible-extract)
   </td>
   <td>95.45
   </td>
   <td>95.00
   </td>
   <td>99.53%
   </td>
  </tr>
</table>

### Reproduction

The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `rocm/pytorch-private:vllm_glm5_0225`, with vLLM, lm-eval compiled and installed from source inside the image.
The Docker image contains the necessary vLLM code modifications to support this model.

#### Launching server
```
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_FP8BMM=0
export VLLM_ROCM_USE_AITER_FP4BMM=0
vllm serve amd/GLM-5-MXFP4 \
  -tp 8 \
  --block-size 1 \
  --trust-remote-code \
  --max-model-len 4096
```

#### Evaluating model in a new terminal
```
lm_eval \
  --model local-completions \
  --model_args '{"model": "amd/GLM-5-MXFP4", "base_url": "http://localhost:8000/v1/completions", "num_concurrent": 32, "max_retries": 10, "max_gen_toks": 2048, "tokenizer_backend":"None","tokenized_requests":"False" }' \
  --tasks gsm8k \
  --batch_size auto \
  --num_fewshot 5 \
  --trust_remote_code
```

# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.