File size: 3,724 Bytes
31ceb43 db9c18a 31ceb43 e60bef6 31ceb43 e60bef6 355dba7 e60bef6 b1785a6 e60bef6 46b5613 e60bef6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | ---
license: mit
base_model:
- zai-org/GLM-5
---
# Model Overview
- **Model Architecture:** GLM-5
- **Input:** Text
- **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm:** 7.1.0
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11.1)
- **Weight quantization:** MOE-only, OCP MXFP4, Static
- **Activation quantization:** MOE-only, OCP MXFP4, Dynamic
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
This model was built with GLM-5 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
# Model Quantization
The model was quantized from [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.
**Quantization scripts:**
```python
from quark.torch import LLMTemplate, ModelQuantizer
# --- Register GLM-5 template ---
GLM5_template = LLMTemplate(
model_type="glm_moe_dsa",
kv_layers_name=["*kv_a_proj_with_mqa", "*kv_b_proj"],
q_layer_name="*q_a_proj",
exclude_layers_name=["lm_head"],
)
LLMTemplate.register_template(GLM5_template)
print(f"[INFO]: Registered template '{GLM5_template.model_type}'")
# --- Configuration ---
model_dir = "zai-org/GLM-5"
output_dir = "amd/GLM-5-MXFP4"
quant_scheme = "mxfp4"
exclude_layers = [
"*self_attn*",
"*mlp.gate",
"*lm_head",
"*mlp.gate_proj",
"*mlp.up_proj",
"*mlp.down_proj",
"*shared_experts*",
]
# --- Build quant config from template ---
template = LLMTemplate.get("glm_moe_dsa")
quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)
# --- File-to-file quantization (memory-efficient, no full model loading) ---
quantizer = ModelQuantizer(quant_config)
quantizer.direct_quantize_checkpoint(
pretrained_model_path=model_dir,
save_path=output_dir,
)
print(f"[INFO]: Quantization complete. Output saved to {output_dir}")
```
# Deployment
### Use with vLLM
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
## Evaluation
The model was evaluated on GSM8K benchmarks.
### Accuracy
<table>
<tr>
<td><strong>Benchmark</strong>
</td>
<td><strong>GLM-5 </strong>
</td>
<td><strong>GLM-5-MXFP4(this model)</strong>
</td>
<td><strong>Recovery</strong>
</td>
</tr>
<tr>
<td>GSM8K (flexible-extract)
</td>
<td>95.45
</td>
<td>95.00
</td>
<td>99.53%
</td>
</tr>
</table>
### Reproduction
The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `rocm/pytorch-private:vllm_glm5_0225`, with vLLM, lm-eval compiled and installed from source inside the image.
The Docker image contains the necessary vLLM code modifications to support this model.
#### Launching server
```
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_FP8BMM=0
export VLLM_ROCM_USE_AITER_FP4BMM=0
vllm serve amd/GLM-5-MXFP4 \
-tp 8 \
--block-size 1 \
--trust-remote-code \
--max-model-len 4096
```
#### Evaluating model in a new terminal
```
lm_eval \
--model local-completions \
--model_args '{"model": "amd/GLM-5-MXFP4", "base_url": "http://localhost:8000/v1/completions", "num_concurrent": 32, "max_retries": 10, "max_gen_toks": 2048, "tokenizer_backend":"None","tokenized_requests":"False" }' \
--tasks gsm8k \
--batch_size auto \
--num_fewshot 5 \
--trust_remote_code
```
# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |