frida-a commited on
Commit
4be97d5
·
verified ·
1 Parent(s): 8e56369

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -7
README.md CHANGED
@@ -60,6 +60,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
60
  ## Software Integration:
61
  **Supported Runtime Engine(s):** <br>
62
  * SGLang <br>
 
63
 
64
  **Supported Hardware Microarchitecture Compatibility:** <br>
65
  * NVIDIA Blackwell <br>
@@ -100,22 +101,68 @@ We did not perform training or testing for this Model Optimizer release. The met
100
 
101
 
102
  ## Inference:
103
- **Acceleration Engine:** SGLang <br>
104
  **Test Hardware:** B300 <br>
105
 
106
  ## Post Training Quantization
107
- This model was obtained by quantizing the weights and activations of GLM-5.1 to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE experts are quantized. The shared expert is not quantized.
108
 
109
  ## Usage
110
 
 
 
111
  To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:dev-cu13` (the `cu13` variant is required for B300; for other GPUs, use the corresponding build) and run the sample command below:
112
 
113
  ```sh
114
- python3 -m sglang.launch_server --model nvidia/GLM-5.1-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code --chunked-prefill-size 131072 --mem-fraction-static 0.80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  ```
116
 
117
  ## Evaluation
118
- The accuracy benchmark results are presented in the table below:
119
  <table>
120
  <tr>
121
  <td><strong>Precision</strong>
@@ -162,7 +209,7 @@ The accuracy benchmark results are presented in the table below:
162
  </table>
163
 
164
  > Baseline: [GLM-5.1-FP8](https://huggingface.co/zai-org/GLM-5.1-FP8).
165
- > Benchmarked with temperature=1.0, top_p=0.96, max num tokens 131072
166
 
167
  ## Model Limitations:
168
  The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
@@ -172,5 +219,3 @@ The base model was trained on data that contains toxic language and societal bia
172
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
173
 
174
  Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
175
-
176
-
 
60
  ## Software Integration:
61
  **Supported Runtime Engine(s):** <br>
62
  * SGLang <br>
63
+ * vLLM <br>
64
 
65
  **Supported Hardware Microarchitecture Compatibility:** <br>
66
  * NVIDIA Blackwell <br>
 
101
 
102
 
103
  ## Inference:
104
+ **Acceleration Engine:** SGLang, vLLM <br>
105
  **Test Hardware:** B300 <br>
106
 
107
  ## Post Training Quantization
108
+ This model was obtained by quantizing the weights and activations of GLM-5.1 to NVFP4 data type, ready for inference with SGLang and vLLM. Only the weights and activations of the linear operators within transformer blocks in MoE experts are quantized. The shared expert is not quantized.
109
 
110
  ## Usage
111
 
112
+ ### SGLang
113
+
114
  To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:dev-cu13` (the `cu13` variant is required for B300; for other GPUs, use the corresponding build) and run the sample command below:
115
 
116
  ```sh
117
+ python3 -m sglang.launch_server \
118
+ --model nvidia/GLM-5.1-NVFP4 \
119
+ --tensor-parallel-size 8 \
120
+ --quantization modelopt_fp4 \
121
+ --tool-call-parser glm47 \
122
+ --reasoning-parser glm45 \
123
+ --trust-remote-code \
124
+ --chunked-prefill-size 131072 \
125
+ --mem-fraction-static 0.80
126
+ ```
127
+
128
+
129
+ ### vLLM
130
+
131
+ To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you can use the docker image `vllm/vllm-openai:v0.19.1` and run the sample command below:
132
+
133
+ ```sh
134
+ vllm serve nvidia/GLM-5.1-NVFP4 \
135
+ --tensor-parallel-size 8 \
136
+ --trust-remote-code \
137
+ --gpu-memory-utilization 0.95 \
138
+ --port 8000
139
+ ```
140
+
141
+ To enable expert parallel, reasoning, and tool calling:
142
+
143
+ ```sh
144
+ vllm serve nvidia/GLM-5.1-NVFP4 \
145
+ --tensor-parallel-size 8 \
146
+ --pipeline-parallel-size 1 \
147
+ --data-parallel-size 1 \
148
+ --enable-expert-parallel \
149
+ --trust-remote-code \
150
+ --gpu-memory-utilization 0.9 \
151
+ --reasoning-parser glm45 \
152
+ --tool-call-parser glm47 \
153
+ --enable-auto-tool-choice \
154
+ --enable-chunked-prefill \
155
+ --max-num-batched-tokens 8192 \
156
+ --max-num-seqs 1024 \
157
+ --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 128}' \
158
+ --chat-template-content-format string \
159
+ -cc.pass_config.fuse_allreduce_rms=False \
160
+ --host 0.0.0.0 \
161
+ --port 8000
162
  ```
163
 
164
  ## Evaluation
165
+ The accuracy benchmark results are presented in the table below (evaluated using vLLM):
166
  <table>
167
  <tr>
168
  <td><strong>Precision</strong>
 
209
  </table>
210
 
211
  > Baseline: [GLM-5.1-FP8](https://huggingface.co/zai-org/GLM-5.1-FP8).
212
+ > Benchmarked with vLLM (vllm/vllm-openai:v0.19.1), temperature=1.0, top_p=0.95, max num tokens 64000
213
 
214
  ## Model Limitations:
215
  The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
 
219
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
220
 
221
  Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).