nvidia
/

GLM-5-NVFP4

@@ -36,7 +36,7 @@ Global <br>
 Developers looking to take off-the-shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications. <br>
 ### Release Date:  <br>
-Huggingface 03/06/2026 via https://huggingface.co/nvidia/GLM-5-NVFP4 <br>
 ## Model Architecture:
 **Architecture Type:** Transformers  <br>
@@ -112,6 +112,56 @@ To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), y
 python3 -m sglang.launch_server --model nvidia/GLM-5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code --chunked-prefill-size 131072  --mem-fraction-static 0.80
 ```
 ## Model Limitations:
 The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

 Developers looking to take off-the-shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications. <br>
 ### Release Date:  <br>
+Huggingface 03/16/2026 via https://huggingface.co/nvidia/GLM-5-NVFP4 <br>
 ## Model Architecture:
 **Architecture Type:** Transformers  <br>
 python3 -m sglang.launch_server --model nvidia/GLM-5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code --chunked-prefill-size 131072  --mem-fraction-static 0.80
 ```
+If you would like to enable expert parallel when launch the SGLang endpoint, please build docker with provided [dockerfile](https://huggingface.co/nvidia/GLM-5-NVFP4/blob/main/dockerfile).
+## Evaluation
+The accuracy benchmark results are presented in the table below:
+<table>
+  <tr>
+   <td><strong>Precision</strong>
+   </td>
+   <td><strong>MMLU Pro</strong>
+   </td>
+   <td><strong>GPQA Diamond</strong>
+   </td>
+   <td><strong>SciCode</strong>
+   </td>
+   <td><strong>IFBench</strong>
+   </td>
+   <td><strong>HLE</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>FP8
+   </td>
+   <td>0.858
+   </td>
+   <td>0.862
+   </td>
+   <td>0.488
+   </td>
+   <td>0.717
+   </td>
+   <td>0.274
+   </td>
+  </tr>
+  <tr>
+   <td>NVFP4
+   </td>
+   <td>0.861
+   </td>
+   <td>0.855
+   </td>
+   <td>0.478
+   </td>
+   <td>0.712
+   </td>
+   <td>0.275
+   </td>
+  </tr>
+  <tr>
+</table>
 ## Model Limitations:
 The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.