Update to sgl
#1
by
vincentzed-hf - opened
README.md
CHANGED
|
@@ -69,7 +69,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
|
|
| 69 |
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment
|
| 70 |
|
| 71 |
## Model Version(s):
|
| 72 |
-
** The model is quantized with nvidia-modelopt **
|
| 73 |
|
| 74 |
## Training, Testing, and Evaluation Datasets:
|
| 75 |
|
|
@@ -95,18 +95,14 @@ The integration of foundation and fine-tuned models into AI systems requires add
|
|
| 95 |
|
| 96 |
|
| 97 |
## Inference:
|
| 98 |
-
**Acceleration Engine:**
|
| 99 |
-
**Test Hardware:**
|
| 100 |
|
| 101 |
## Post Training Quantization
|
| 102 |
This model was obtained by quantizing the weights and activations of Qwen3-Coder-Next to NVFP4 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformer blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 4x.
|
| 103 |
|
| 104 |
## Usage
|
| 105 |
|
| 106 |
-
### Deploy with TensorRT-LLM
|
| 107 |
-
|
| 108 |
-
<!-- TODO: Add TensorRT-LLM deployment instructions and sample code -->
|
| 109 |
-
|
| 110 |
### Deploy with SGLang
|
| 111 |
|
| 112 |
To serve the quantized NVFP4 checkpoint with [SGLang](https://github.com/sgl-project/sglang):
|
|
@@ -114,10 +110,12 @@ To serve the quantized NVFP4 checkpoint with [SGLang](https://github.com/sgl-pro
|
|
| 114 |
```bash
|
| 115 |
sglang serve --model-path vincentzed-hf/Qwen3-Coder-Next-NVFP4 --quantization modelopt_fp4
|
| 116 |
```
|
|
|
|
|
|
|
| 117 |
|
| 118 |
### Reproduce with ModelOpt
|
| 119 |
|
| 120 |
-
To reproduce the NVFP4 quantized checkpoint using [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer):
|
| 121 |
|
| 122 |
```bash
|
| 123 |
python3 examples/llm_ptq/hf_ptq.py \
|
|
|
|
| 69 |
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment
|
| 70 |
|
| 71 |
## Model Version(s):
|
| 72 |
+
** The model is quantized with nvidia-modelopt **latest** <br>
|
| 73 |
|
| 74 |
## Training, Testing, and Evaluation Datasets:
|
| 75 |
|
|
|
|
| 95 |
|
| 96 |
|
| 97 |
## Inference:
|
| 98 |
+
**Acceleration Engine:** SGLang <br>
|
| 99 |
+
**Test Hardware:** B300 <br>
|
| 100 |
|
| 101 |
## Post Training Quantization
|
| 102 |
This model was obtained by quantizing the weights and activations of Qwen3-Coder-Next to NVFP4 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformer blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 4x.
|
| 103 |
|
| 104 |
## Usage
|
| 105 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
### Deploy with SGLang
|
| 107 |
|
| 108 |
To serve the quantized NVFP4 checkpoint with [SGLang](https://github.com/sgl-project/sglang):
|
|
|
|
| 110 |
```bash
|
| 111 |
sglang serve --model-path vincentzed-hf/Qwen3-Coder-Next-NVFP4 --quantization modelopt_fp4
|
| 112 |
```
|
| 113 |
+
Please use this branch and install from source: https://github.com/sgl-project/sglang/pull/18224
|
| 114 |
+
Once the branch is cloned, do `pip install -e .` annd run the serve command.
|
| 115 |
|
| 116 |
### Reproduce with ModelOpt
|
| 117 |
|
| 118 |
+
You may want to produce this checkpoint yourself. To reproduce the NVFP4 quantized checkpoint using [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer):
|
| 119 |
|
| 120 |
```bash
|
| 121 |
python3 examples/llm_ptq/hf_ptq.py \
|