nvidia
/

GLM-5-NVFP4

Text Generation

Model Optimizer

8-bit precision

Model card Files Files and versions

Update README.md

#5

by kaihangj - opened Apr 6

base: refs/heads/main

←

from: refs/pr/5

Discussion Files changed

Files changed (1) hide show

README.md +9 -2

README.md CHANGED Viewed

@@ -59,6 +59,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
 ## Software Integration:
 **Supported Runtime Engine(s):** <br>
 * SGLang <br>
 **Supported Hardware Microarchitecture Compatibility:** <br>
@@ -98,14 +99,20 @@ The model version is NVFP4 1.0 version and is quantized with nvidia-modelopt **v
 ## Inference:
-**Acceleration Engine:** SGLang <br>
 **Test Hardware:** B300 <br>
 ## Post Training Quantization
-This model was obtained by quantizing the weights and activations of GLM-5 to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized.
 ## Usage
 To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:nightly-dev-cu13-20260305-33c92732` and run the sample command below (when the nightly docker becomes unavailable, use `lmsysorg/sglang:latest`):
 ```sh

 ## Software Integration:
 **Supported Runtime Engine(s):** <br>
+* vLLM <br>
 * SGLang <br>
 **Supported Hardware Microarchitecture Compatibility:** <br>
 ## Inference:
+**Acceleration Engine:** vLLM, SGLang <br>
 **Test Hardware:** B300 <br>
 ## Post Training Quantization
+This model was obtained by quantizing the weights and activations of GLM-5 to NVFP4 data type, ready for inference with vLLM and SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized.
 ## Usage
+To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you can start the docker `vllm/vllm-openai:latest` and run the sample command below:
+```sh
+vllm serve nvidia/GLM-5-NVFP4 --tensor-parallel-size 8 --trust-remote-code --enable-auto-tool-choice --tool-call-parser glm47 --reasoning-parser glm45 --enable-chunked-prefill --max-num-batched-tokens 131072 --gpu-memory-utilization 0.80
+```
 To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:nightly-dev-cu13-20260305-33c92732` and run the sample command below (when the nightly docker becomes unavailable, use `lmsysorg/sglang:latest`):
 ```sh