nvidia
/

GLM-5-NVFP4

Text Generation

Model Optimizer

8-bit precision

Model card Files Files and versions

zhiyucheng commited on 29 days ago

Commit

3642b8b

·

verified ·

1 Parent(s): 331c1b5

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -36,7 +36,7 @@ Global <br>
 Developers looking to take off-the-shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications. <br>
 ### Release Date:  <br>
-Huggingface 03/03/2026 via https://huggingface.co/nvidia/GLM-5-NVFP4 <br>
 ## Model Architecture:
 **Architecture Type:** Transformers  <br>
@@ -99,17 +99,17 @@ The model version is NVFP4 1.0 version and is quantized with nvidia-modelopt **v
 ## Inference:
 **Acceleration Engine:** SGLang <br>
-**Test Hardware:** B200 <br>
 ## Post Training Quantization
 This model was obtained by quantizing the weights and activations of GLM-5 to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized.
 ## Usage
-To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:glm5-blackwell` and run the sample command below:
 ```sh
-python3 -m sglang.launch_server --model nvidia/GLM-5-NVFP4 --tensor-parallel-size 4 --quantization modelopt_fp4 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code
 ```

 Developers looking to take off-the-shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications. <br>
 ### Release Date:  <br>
+Huggingface 03/06/2026 via https://huggingface.co/nvidia/GLM-5-NVFP4 <br>
 ## Model Architecture:
 **Architecture Type:** Transformers  <br>
 ## Inference:
 **Acceleration Engine:** SGLang <br>
+**Test Hardware:** B300 <br>
 ## Post Training Quantization
 This model was obtained by quantizing the weights and activations of GLM-5 to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized.
 ## Usage
+To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:nightly-dev-cu13-20260305-33c92732` and run the sample command below:
 ```sh
+python3 -m sglang.launch_server --model nvidia/GLM-5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code --chunked-prefill-size 131072  --mem-fraction-static 0.80
 ```