File size: 4,187 Bytes
0c684cc 526f038 0c684cc 526f038 aa01a9a 526f038 f112d70 4a7d359 526f038 c16df18 526f038 2f7065e 526f038 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
license: mit
base_model:
- zai-org/GLM-4.7
---
# Model Overview
- **Model Architecture:** GLM-4.7
- **Input:** Text
- **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm:** 7.0
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11)
- **moe**
- **Weight quantization:** MOE-only, OCP MXFP4, Static
- **Activation quantization:** MOE-only, OCP MXFP4, Dynamic
- **KV cache quantization:** OCP FP8, Static
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
This model was built with GLM-4.7 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
# Model Quantization
The model was quantized from [zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.
AMD-Quark has been installed from source code inside the Docker image `rocm/vllm-private:vllm_dev_base_mxfp4_20260122`.
**Quantization scripts:**
Note that GLM-4.7 is not in the built-in model template list in Quark V0.11, it has to be registered before quantization.
- **Step1:** Register model template: creat fle `Quark/examples/torch/language_modeling/llm_ptq/quantize_glm.py`
```
import runpy
from quark.torch import LLMTemplate
# Register GLM-4 MoE template
glm4_moe_template = LLMTemplate(
model_type="glm4_moe",
kv_layers_name=["*k_proj", "*v_proj"],
q_layer_name="*q_proj",
exclude_layers_name=["lm_head","*mlp.gate","*self_attn*","*shared_experts.*","*mlp.down_proj","*mlp.gate_proj","*mlp.up_proj"],
)
LLMTemplate.register_template(glm4_moe_template)
print(f"[INFO]: Registered template '{glm4_moe_template.model_type}'")
# Run quantize_quark.py
# Get the absolute path to the quantize_quark.py script
quantize_script = "/app/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py"
runpy.run_path(quantize_script, run_name="__main__")
```
- **Step2:** Quantize with the quantize_glm.py
```
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MODEL_DIR=zai-org/GLM-4.7
export output_dir=amd/GLM-4.7-MXFP4
exclude_layers="*self_attn* *mlp.gate lm_head *mlp.gate_proj *mlp.up_proj *mlp.down_proj *shared_experts.*"
python3 quantize_glm.py --model_dir $MODEL_DIR \
--quant_scheme mxfp4 \
--num_calib_data 128 \
--exclude_layers $exclude_layers \
--kv_cache_dtype fp8 \
--model_export hf_format \
--output_dir $output_dir \
--multi_gpu
```
# Deployment
### Use with vLLM
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
## Evaluation
The model was evaluated on GSM8K benchmarks.
### Accuracy
<table>
<tr>
<td><strong>Benchmark</strong>
</td>
<td><strong>GLM-4.7 </strong>
</td>
<td><strong>GLM-4.7-MXFP4(this model)</strong>
</td>
<td><strong>Recovery</strong>
</td>
</tr>
<tr>
<td>GSM8K (strict-match)
</td>
<td>94.16
</td>
<td>93.63
</td>
<td>99.44%
</td>
</tr>
</table>
### Reproduction
The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `rocm/vllm-private:vllm_dev_base_mxfp4_20260122`, with vLLM, lm-eval and amd-quark compiled and installed from source inside the image.
#### Launching server
```
vllm serve amd/GLM-4.7-MXFP4 \
--tensor-parallel-size 4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--kv_cache_dtype fp8
```
#### Evaluating model in a new terminal
```
lm_eval \
--model local-completions \
--model_args "model=amd/GLM-4.7-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size 1
```
# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |