add model card and update tokenizer file

#3
Files changed (2) hide show
  1. README.md +115 -5
  2. tokenizer.json +2 -2
README.md CHANGED
@@ -1,9 +1,119 @@
1
  ---
2
- license: mit
 
 
 
 
3
  ---
4
 
5
- **Disclaimer**
6
 
7
- This model is provided for research and evaluation purposes only.
8
- Quantization may introduce accuracy or behavioral differences compared to the original model.
9
- Users are responsible for validating the model in their own environments and complying with the original model license.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: other
3
+ license_name: modified-mit
4
+ license_link: LICENSE
5
+ base_model:
6
+ - zai-org/GLM-5.1
7
  ---
8
 
9
+ # Model Overview
10
 
11
+ - **Model Architecture:** GLM-5.1
12
+ - **Input:** Text
13
+ - **Output:** Text
14
+ - **Supported Hardware Microarchitecture:** AMD MI300/MI350/MI355 (emulation)
15
+ - **ROCm:** 7.2.2
16
+ - **Operating System(s):** Linux
17
+ - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
18
+ - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.12)
19
+ - **Quantized layers:** `experts` and `shared_experts`
20
+ - **Weight quantization:** NVFP4, Static
21
+ - **Activation quantization:** NVFP4, Dynamic
22
+ - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
23
+
24
+ This model was built with GLM-5.1 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for NVFP4 quantization.
25
+ # Model Quantization
26
+
27
+ The model was quantized from [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to NVFP4.
28
+
29
+ **Quantization scripts:**
30
+ ```
31
+ sudo sysctl -w vm.max_map_count=4194304
32
+ cd Quark/examples/torch/language_modeling/llm_ptq/
33
+ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
34
+ export MODEL_DIR=/zai-org/GLM-5.1
35
+ export output_dir=/amd/GLM-5.1-NVFP4
36
+ exclude_layers="*self_attn* *mlp.gate lm_head *mlp.gate_proj *mlp.up_proj *mlp.down_proj"
37
+ python3 quantize_quark.py --model_dir $MODEL_DIR \
38
+ --quant_scheme nvfp4 \
39
+ --num_calib_data 128 \
40
+ --exclude_layers $exclude_layers \
41
+ --model_export hf_format \
42
+ --output_dir $output_dir \
43
+ --multi_gpu balanced
44
+ ```
45
+
46
+ # Deployment
47
+ ### Use with vLLM
48
+
49
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
50
+
51
+ ## Evaluation
52
+ The model was evaluated on GSM8K benchmarks.
53
+
54
+ ### Accuracy
55
+
56
+ <table>
57
+ <tr>
58
+ <td><strong>Benchmark</strong>
59
+ </td>
60
+ <td><strong>GLM-5.1 </strong>
61
+ </td>
62
+ <td><strong>GLM-5.1-NVFP4(this model) </strong>
63
+ </td>
64
+ <td><strong>Recovery</strong>
65
+ </td>
66
+ </tr>
67
+ <tr>
68
+ <td>GSM8K (flexible-extract)
69
+ </td>
70
+ <td>95.38
71
+ </td>
72
+ <td>95.68
73
+ </td>
74
+ <td>100.31%
75
+ </td>
76
+ </tr>
77
+
78
+ </tr>
79
+ </table>
80
+
81
+ ### Reproduction
82
+
83
+ The GSM8K result was obtained using the `lm-evaluation-harness` framework, based on the Docker image `rocm/vllm-dev:nightly_main_20260603`.
84
+
85
+ Install the lm-eval `(Version: 0.4.12)` in container first.
86
+ ```
87
+ pip install lm-eval
88
+ pip install lm-eval[api]
89
+ ```
90
+
91
+ #### Launching server
92
+ ```
93
+ export VLLM_ROCM_USE_AITER=1
94
+ export VLLM_ROCM_USE_AITER_FP8BMM=0
95
+ export VLLM_ROCM_USE_AITER_FP4BMM=0
96
+ HIP_VISIBLE_DEVICES=4,5,6,7 vllm serve /amd/GLM-5.1-NVFP4 \
97
+ -tp 4 \
98
+ --block-size 1 \
99
+ --trust-remote-code \
100
+ --max-model-len 4096 \
101
+ --port 8082
102
+ ```
103
+
104
+ #### Evaluating model in a new terminal
105
+ ```
106
+ lm_eval \
107
+ --model local-completions \
108
+ --model_args '{"model": "/amd/GLM-5.1-NVFP4", "base_url": "http://localhost:8082/v1/completions", "num_concurrent": 32, "max_retries": 10, "max_gen_toks": 2048, "tokenizer_backend": null, "tokenized_requests": false}' \
109
+ --tasks gsm8k \
110
+ --batch_size auto \
111
+ --num_fewshot 5 \
112
+ --trust_remote_code
113
+ ```
114
+ ```
115
+
116
+
117
+ # License
118
+ Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.
119
+
tokenizer.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:47757b9678da19e468edb3ae37a853996599945b5006914e5b088aff30002386
3
- size 20217707
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19e773648cb4e65de8660ea6365e10acca112d42a854923df93db4a6f333a82d
3
+ size 20217442