zhiyucheng commited on
Commit
3642b8b
·
verified ·
1 Parent(s): 331c1b5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -36,7 +36,7 @@ Global <br>
36
  Developers looking to take off-the-shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications. <br>
37
 
38
  ### Release Date: <br>
39
- Huggingface 03/03/2026 via https://huggingface.co/nvidia/GLM-5-NVFP4 <br>
40
 
41
  ## Model Architecture:
42
  **Architecture Type:** Transformers <br>
@@ -99,17 +99,17 @@ The model version is NVFP4 1.0 version and is quantized with nvidia-modelopt **v
99
 
100
  ## Inference:
101
  **Acceleration Engine:** SGLang <br>
102
- **Test Hardware:** B200 <br>
103
 
104
  ## Post Training Quantization
105
  This model was obtained by quantizing the weights and activations of GLM-5 to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized.
106
 
107
  ## Usage
108
 
109
- To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:glm5-blackwell` and run the sample command below:
110
 
111
  ```sh
112
- python3 -m sglang.launch_server --model nvidia/GLM-5-NVFP4 --tensor-parallel-size 4 --quantization modelopt_fp4 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code
113
  ```
114
 
115
 
 
36
  Developers looking to take off-the-shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications. <br>
37
 
38
  ### Release Date: <br>
39
+ Huggingface 03/06/2026 via https://huggingface.co/nvidia/GLM-5-NVFP4 <br>
40
 
41
  ## Model Architecture:
42
  **Architecture Type:** Transformers <br>
 
99
 
100
  ## Inference:
101
  **Acceleration Engine:** SGLang <br>
102
+ **Test Hardware:** B300 <br>
103
 
104
  ## Post Training Quantization
105
  This model was obtained by quantizing the weights and activations of GLM-5 to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized.
106
 
107
  ## Usage
108
 
109
+ To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:nightly-dev-cu13-20260305-33c92732` and run the sample command below:
110
 
111
  ```sh
112
+ python3 -m sglang.launch_server --model nvidia/GLM-5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code --chunked-prefill-size 131072 --mem-fraction-static 0.80
113
  ```
114
 
115