| | --- |
| | license: mit |
| | --- |
| | # Model Overview |
| |
|
| | - **Model Architecture:** GLM-5 |
| | - **Input:** Text |
| | - **Output:** Text |
| | - **Supported Hardware Microarchitecture:** AMD MI350/MI355 |
| | - **ROCm:** 7.1.0 |
| | - **Operating System(s):** Linux |
| | - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) |
| | - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11.1) |
| | - **moe** |
| | - **Weight quantization:** MOE-only, OCP MXFP4, Static |
| | - **Activation quantization:** MOE-only, OCP MXFP4, Dynamic |
| | - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup) |
| |
|
| | This model was built with GLM-5 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization. |
| |
|
| | # Model Quantization |
| |
|
| | The model was quantized from [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4. |
| |
|
| | **Quantization scripts:** |
| |
|
| | ```python |
| | from quark.torch import LLMTemplate, ModelQuantizer |
| | |
| | # --- Register GLM-5 template --- |
| | GLM5_template = LLMTemplate( |
| | model_type="glm_moe_dsa", |
| | kv_layers_name=["*kv_a_proj_with_mqa", "*kv_b_proj"], |
| | q_layer_name="*q_a_proj", |
| | exclude_layers_name=["lm_head"], |
| | ) |
| | LLMTemplate.register_template(GLM5_template) |
| | print(f"[INFO]: Registered template '{GLM5_template.model_type}'") |
| | |
| | # --- Configuration --- |
| | model_dir = "zai-org/GLM-5" |
| | output_dir = "amd/GLM-5-MXFP4" |
| | quant_scheme = "mxfp4" |
| | exclude_layers = [ |
| | "*self_attn*", |
| | "*mlp.gate", |
| | "*lm_head", |
| | "*mlp.gate_proj", |
| | "*mlp.up_proj", |
| | "*mlp.down_proj", |
| | "*shared_experts*", |
| | ] |
| | |
| | # --- Build quant config from template --- |
| | template = LLMTemplate.get("glm_moe_dsa") |
| | quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers) |
| | |
| | # --- File-to-file quantization (memory-efficient, no full model loading) --- |
| | quantizer = ModelQuantizer(quant_config) |
| | quantizer.direct_quantize_checkpoint( |
| | pretrained_model_path=model_dir, |
| | save_path=output_dir, |
| | ) |
| | |
| | print(f"[INFO]: Quantization complete. Output saved to {output_dir}") |
| | |
| | ``` |
| |
|
| | # Deployment |
| | ### Use with vLLM |
| |
|
| | This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. |
| |
|
| | ## Evaluation |
| | The model was evaluated on GSM8K benchmarks. |
| |
|
| | ### Accuracy |
| |
|
| | <table> |
| | <tr> |
| | <td><strong>Benchmark</strong> |
| | </td> |
| | <td><strong>GLM-5 </strong> |
| | </td> |
| | <td><strong>GLM-5-MXFP4(this model)</strong> |
| | </td> |
| | <td><strong>Recovery</strong> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td>GSM8K (flexible-extract) |
| | </td> |
| | <td>95.45 |
| | </td> |
| | <td>95.00 |
| | </td> |
| | <td>99.53% |
| | </td> |
| | </tr> |
| | </table> |
| |
|
| | ### Reproduction |
| |
|
| | The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `rocm/pytorch-private:vllm_glm5_0225`, with vLLM, lm-eval compiled and installed from source inside the image. |
| | The Docker image contains the necessary vLLM code modifications to support this model. |
| |
|
| | #### Launching server |
| | ``` |
| | export VLLM_ROCM_USE_AITER=1 |
| | export VLLM_ROCM_USE_AITER_FP8BMM=0 |
| | export VLLM_ROCM_USE_AITER_FP4BMM=0 |
| | vllm serve amd/GLM-5-MXFP4 \ |
| | -tp 8 \ |
| | --block-size 1 \ |
| | --trust-remote-code \ |
| | --max-model-len 4096 |
| | ``` |
| |
|
| | #### Evaluating model in a new terminal |
| | ``` |
| | lm_eval \ |
| | --model local-completions \ |
| | --model_args '{"model": "amd/GLM-5-MXFP4", "base_url": "http://localhost:8000/v1/completions", "num_concurrent": 32, "max_retries": 10, "max_gen_toks": 2048, "tokenizer_backend":"None","tokenized_requests":"False" }' \ |
| | --tasks gsm8k \ |
| | --batch_size auto \ |
| | --num_fewshot 5 \ |
| | --trust_remote_code |
| | ``` |
| |
|
| | # License |
| | Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |