Update README.md
#5
by kaihangj - opened
README.md
CHANGED
|
@@ -59,6 +59,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
|
|
| 59 |
|
| 60 |
## Software Integration:
|
| 61 |
**Supported Runtime Engine(s):** <br>
|
|
|
|
| 62 |
* SGLang <br>
|
| 63 |
|
| 64 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
|
@@ -98,14 +99,20 @@ The model version is NVFP4 1.0 version and is quantized with nvidia-modelopt **v
|
|
| 98 |
|
| 99 |
|
| 100 |
## Inference:
|
| 101 |
-
**Acceleration Engine:** SGLang <br>
|
| 102 |
**Test Hardware:** B300 <br>
|
| 103 |
|
| 104 |
## Post Training Quantization
|
| 105 |
-
This model was obtained by quantizing the weights and activations of GLM-5 to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized.
|
| 106 |
|
| 107 |
## Usage
|
| 108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:nightly-dev-cu13-20260305-33c92732` and run the sample command below (when the nightly docker becomes unavailable, use `lmsysorg/sglang:latest`):
|
| 110 |
|
| 111 |
```sh
|
|
|
|
| 59 |
|
| 60 |
## Software Integration:
|
| 61 |
**Supported Runtime Engine(s):** <br>
|
| 62 |
+
* vLLM <br>
|
| 63 |
* SGLang <br>
|
| 64 |
|
| 65 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
|
|
|
| 99 |
|
| 100 |
|
| 101 |
## Inference:
|
| 102 |
+
**Acceleration Engine:** vLLM, SGLang <br>
|
| 103 |
**Test Hardware:** B300 <br>
|
| 104 |
|
| 105 |
## Post Training Quantization
|
| 106 |
+
This model was obtained by quantizing the weights and activations of GLM-5 to NVFP4 data type, ready for inference with vLLM and SGLang. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized.
|
| 107 |
|
| 108 |
## Usage
|
| 109 |
|
| 110 |
+
To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you can start the docker `vllm/vllm-openai:latest` and run the sample command below:
|
| 111 |
+
|
| 112 |
+
```sh
|
| 113 |
+
vllm serve nvidia/GLM-5-NVFP4 --tensor-parallel-size 8 --trust-remote-code --enable-auto-tool-choice --tool-call-parser glm47 --reasoning-parser glm45 --enable-chunked-prefill --max-num-batched-tokens 131072 --gpu-memory-utilization 0.80
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:nightly-dev-cu13-20260305-33c92732` and run the sample command below (when the nightly docker becomes unavailable, use `lmsysorg/sglang:latest`):
|
| 117 |
|
| 118 |
```sh
|