Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -36,7 +36,7 @@ Global <br>
|
|
| 36 |
Developers looking to take off-the-shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications. <br>
|
| 37 |
|
| 38 |
### Release Date: <br>
|
| 39 |
-
Huggingface 03/
|
| 40 |
|
| 41 |
## Model Architecture:
|
| 42 |
**Architecture Type:** Transformers <br>
|
|
@@ -99,17 +99,17 @@ The model version is NVFP4 1.0 version and is quantized with nvidia-modelopt **v
|
|
| 99 |
|
| 100 |
## Inference:
|
| 101 |
**Acceleration Engine:** SGLang <br>
|
| 102 |
-
**Test Hardware:**
|
| 103 |
|
| 104 |
## Post Training Quantization
|
| 105 |
This model was obtained by quantizing the weights and activations of GLM-5 to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized.
|
| 106 |
|
| 107 |
## Usage
|
| 108 |
|
| 109 |
-
To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:
|
| 110 |
|
| 111 |
```sh
|
| 112 |
-
python3 -m sglang.launch_server --model nvidia/GLM-5-NVFP4 --tensor-parallel-size
|
| 113 |
```
|
| 114 |
|
| 115 |
|
|
|
|
| 36 |
Developers looking to take off-the-shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications. <br>
|
| 37 |
|
| 38 |
### Release Date: <br>
|
| 39 |
+
Huggingface 03/06/2026 via https://huggingface.co/nvidia/GLM-5-NVFP4 <br>
|
| 40 |
|
| 41 |
## Model Architecture:
|
| 42 |
**Architecture Type:** Transformers <br>
|
|
|
|
| 99 |
|
| 100 |
## Inference:
|
| 101 |
**Acceleration Engine:** SGLang <br>
|
| 102 |
+
**Test Hardware:** B300 <br>
|
| 103 |
|
| 104 |
## Post Training Quantization
|
| 105 |
This model was obtained by quantizing the weights and activations of GLM-5 to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized.
|
| 106 |
|
| 107 |
## Usage
|
| 108 |
|
| 109 |
+
To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:nightly-dev-cu13-20260305-33c92732` and run the sample command below:
|
| 110 |
|
| 111 |
```sh
|
| 112 |
+
python3 -m sglang.launch_server --model nvidia/GLM-5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code --chunked-prefill-size 131072 --mem-fraction-static 0.80
|
| 113 |
```
|
| 114 |
|
| 115 |
|