nvidia
/

GLM-5.1-NVFP4

@@ -60,6 +60,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
 ## Software Integration:
 **Supported Runtime Engine(s):** <br>
 * SGLang <br>
 **Supported Hardware Microarchitecture Compatibility:** <br>
 * NVIDIA Blackwell <br>
@@ -100,22 +101,68 @@ We did not perform training or testing for this Model Optimizer release. The met
 ## Inference:
-**Acceleration Engine:** SGLang <br>
 **Test Hardware:** B300 <br>
 ## Post Training Quantization
-This model was obtained by quantizing the weights and activations of GLM-5.1 to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE experts are quantized. The shared expert is not quantized.
 ## Usage
 To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:dev-cu13` (the `cu13` variant is required for B300; for other GPUs, use the corresponding build) and run the sample command below:
 ```sh
-python3 -m sglang.launch_server --model nvidia/GLM-5.1-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code --chunked-prefill-size 131072  --mem-fraction-static 0.80
 ```
 ## Evaluation
-The accuracy benchmark results are presented in the table below:
 <table>
   <tr>
    <td><strong>Precision</strong>
@@ -162,7 +209,7 @@ The accuracy benchmark results are presented in the table below:
 </table>
 > Baseline: [GLM-5.1-FP8](https://huggingface.co/zai-org/GLM-5.1-FP8).
-> Benchmarked with temperature=1.0, top_p=0.96, max num tokens 131072
 ## Model Limitations:
 The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
@@ -172,5 +219,3 @@ The base model was trained on data that contains toxic language and societal bia
 NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).

 ## Software Integration:
 **Supported Runtime Engine(s):** <br>
 * SGLang <br>
+* vLLM <br>
 **Supported Hardware Microarchitecture Compatibility:** <br>
 * NVIDIA Blackwell <br>
 ## Inference:
+**Acceleration Engine:** SGLang, vLLM <br>
 **Test Hardware:** B300 <br>
 ## Post Training Quantization
+This model was obtained by quantizing the weights and activations of GLM-5.1 to NVFP4 data type, ready for inference with SGLang and vLLM. Only the weights and activations of the linear operators within transformer blocks in MoE experts are quantized. The shared expert is not quantized.
 ## Usage
+### SGLang
 To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:dev-cu13` (the `cu13` variant is required for B300; for other GPUs, use the corresponding build) and run the sample command below:
 ```sh
+python3 -m sglang.launch_server \
+    --model nvidia/GLM-5.1-NVFP4 \
+    --tensor-parallel-size 8 \
+    --quantization modelopt_fp4 \
+    --tool-call-parser glm47 \
+    --reasoning-parser glm45 \
+    --trust-remote-code \
+    --chunked-prefill-size 131072 \
+    --mem-fraction-static 0.80
+```
+### vLLM
+To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you can use the docker image `vllm/vllm-openai:v0.19.1` and run the sample command below:
+```sh
+vllm serve nvidia/GLM-5.1-NVFP4 \
+    --tensor-parallel-size 8 \
+    --trust-remote-code \
+    --gpu-memory-utilization 0.95 \
+    --port 8000
+```
+To enable expert parallel, reasoning, and tool calling:
+```sh
+vllm serve nvidia/GLM-5.1-NVFP4 \
+    --tensor-parallel-size 8 \
+    --pipeline-parallel-size 1 \
+    --data-parallel-size 1 \
+    --enable-expert-parallel \
+    --trust-remote-code \
+    --gpu-memory-utilization 0.9 \
+    --reasoning-parser glm45 \
+    --tool-call-parser glm47 \
+    --enable-auto-tool-choice \
+    --enable-chunked-prefill \
+    --max-num-batched-tokens 8192 \
+    --max-num-seqs 1024 \
+    --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 128}' \
+    --chat-template-content-format string \
+    -cc.pass_config.fuse_allreduce_rms=False \
+    --host 0.0.0.0 \
+    --port 8000
 ```
 ## Evaluation
+The accuracy benchmark results are presented in the table below (evaluated using vLLM):
 <table>
   <tr>
    <td><strong>Precision</strong>
 </table>
 > Baseline: [GLM-5.1-FP8](https://huggingface.co/zai-org/GLM-5.1-FP8).
+> Benchmarked with vLLM (vllm/vllm-openai:v0.19.1), temperature=1.0, top_p=0.95, max num tokens 64000
 ## Model Limitations:
 The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
 NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).