amd
/

GLM-4.7-MXFP4

@@ -4,6 +4,125 @@ base_model:
 - zai-org/GLM-4.7
 ---
-# Disclaimer
-This model is provided for experimental purposes only. Its accuracy, stability, and suitability for deployment are not guaranteed. Users are advised to independently evaluate the model before any practical or production use.

 - zai-org/GLM-4.7
 ---
+# Model Overview
+- **Model Architecture:** GLM-4.7
+  - **Input:** Text
+  - **Output:** Text
+- **Supported Hardware Microarchitecture:** AMD MI350/MI355
+- **ROCm:** 7.0
+- **Operating System(s):** Linux
+- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
+- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
+  - **moe**
+    - **Weight quantization:** MOE-only, OCP MXFP4, Static
+    - **Activation quantization:** MOE-only, OCP MXFP4, Dynamic
+  - **KV cache quantization:** OCP FP8, Static
+- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
+This model was built with GLM-4.7 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
+# Model Quantization
+The model was quantized from [zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.
+AMD-Quark has been installed from source code inside the Docker image `rocm/vllm-private:vllm_dev_base_mxfp4_20260122`.
+**Quantization scripts:**
+Step1: Creat the quantize_glm.py
+```
+import runpy
+from quark.torch import LLMTemplate
+# Register GLM-4 MoE template
+glm4_moe_template = LLMTemplate(
+    model_type="glm4_moe",
+    kv_layers_name=["*k_proj", "*v_proj"],
+    q_layer_name="*q_proj",
+    exclude_layers_name=["lm_head","*mlp.gate","*self_attn*","*shared_experts.*","*mlp.down_proj","*mlp.gate_proj","*mlp.up_proj"],
+)
+LLMTemplate.register_template(glm4_moe_template)
+print(f"[INFO]: Registered template '{glm4_moe_template.model_type}'")
+# Run quantize_quark.py
+# Get the absolute path to the quantize_quark.py script
+quantize_script = "/app/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py"
+runpy.run_path(quantize_script, run_name="__main__")
+```
+Step1: Quantize with the quantize_glm.py
+```
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+export MODEL_DIR=zai-org/GLM-4.7
+export output_dir=amd/GLM-4.7-MXFP4
+exclude_layers="*self_attn* *mlp.gate lm_head *mlp.gate_proj *mlp.up_proj *mlp.down_proj *shared_experts.*"
+python3 quantize_glm.py --model_dir $MODEL_DIR \
+                        --quant_scheme mxfp4 \
+                        --num_calib_data 128 \
+                        --exclude_layers $exclude_layers \
+                        --kv_cache_dtype fp8 \
+                        --model_export hf_format \
+                        --output_dir $output_dir \
+                        --multi_gpu
+```
+# Deployment
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
+## Evaluation
+The model was evaluated on GSM8K benchmarks.
+### Accuracy
+<table>
+  <tr>
+   <td><strong>Benchmark</strong>
+   </td>
+   <td><strong>GLM-4.7 </strong>
+   </td>
+   <td><strong>GLM-4.7-MXFP4(this model)</strong>
+   </td>
+   <td><strong>Recovery</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>GSM8K
+   </td>
+   <td>94.16
+   </td>
+   <td>93.63
+   </td>
+   <td>99.44%
+   </td>
+  </tr>
+</table>
+### Reproduction
+The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `rocm/vllm-private:vllm_dev_base_mxfp4_20260122`, with vLLM, lm-eval and amd-quark compiled and installed from source inside the image.
+#### Launching server
+```
+vllm serve amd/GLM-4.7-MXFP4 \
+    --tensor-parallel-size 4 \
+    --tool-call-parser glm47 \
+    --reasoning-parser glm45 \
+    --enable-auto-tool-choice \
+    --kv_cache_dtype fp8
+```
+#### Evaluating model in a new terminal
+```
+lm_eval \
+  --model local-completions \
+  --model_args "model=amd/GLM-4.7-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
+  --tasks gsm8k \
+  --num_fewshot 5 \
+  --batch_size 1
+```
+# License
+Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.