Files changed (1) hide show
  1. README.md +6 -8
README.md CHANGED
@@ -69,7 +69,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
69
  The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment
70
 
71
  ## Model Version(s):
72
- ** The model is quantized with nvidia-modelopt **v0.41.0** <br>
73
 
74
  ## Training, Testing, and Evaluation Datasets:
75
 
@@ -95,18 +95,14 @@ The integration of foundation and fine-tuned models into AI systems requires add
95
 
96
 
97
  ## Inference:
98
- **Acceleration Engine:** TensorRT-LLM <br>
99
- **Test Hardware:** B200 <br>
100
 
101
  ## Post Training Quantization
102
  This model was obtained by quantizing the weights and activations of Qwen3-Coder-Next to NVFP4 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformer blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 4x.
103
 
104
  ## Usage
105
 
106
- ### Deploy with TensorRT-LLM
107
-
108
- <!-- TODO: Add TensorRT-LLM deployment instructions and sample code -->
109
-
110
  ### Deploy with SGLang
111
 
112
  To serve the quantized NVFP4 checkpoint with [SGLang](https://github.com/sgl-project/sglang):
@@ -114,10 +110,12 @@ To serve the quantized NVFP4 checkpoint with [SGLang](https://github.com/sgl-pro
114
  ```bash
115
  sglang serve --model-path vincentzed-hf/Qwen3-Coder-Next-NVFP4 --quantization modelopt_fp4
116
  ```
 
 
117
 
118
  ### Reproduce with ModelOpt
119
 
120
- To reproduce the NVFP4 quantized checkpoint using [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer):
121
 
122
  ```bash
123
  python3 examples/llm_ptq/hf_ptq.py \
 
69
  The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment
70
 
71
  ## Model Version(s):
72
+ ** The model is quantized with nvidia-modelopt **latest** <br>
73
 
74
  ## Training, Testing, and Evaluation Datasets:
75
 
 
95
 
96
 
97
  ## Inference:
98
+ **Acceleration Engine:** SGLang <br>
99
+ **Test Hardware:** B300 <br>
100
 
101
  ## Post Training Quantization
102
  This model was obtained by quantizing the weights and activations of Qwen3-Coder-Next to NVFP4 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformer blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 4x.
103
 
104
  ## Usage
105
 
 
 
 
 
106
  ### Deploy with SGLang
107
 
108
  To serve the quantized NVFP4 checkpoint with [SGLang](https://github.com/sgl-project/sglang):
 
110
  ```bash
111
  sglang serve --model-path vincentzed-hf/Qwen3-Coder-Next-NVFP4 --quantization modelopt_fp4
112
  ```
113
+ Please use this branch and install from source: https://github.com/sgl-project/sglang/pull/18224
114
+ Once the branch is cloned, do `pip install -e .` annd run the serve command.
115
 
116
  ### Reproduce with ModelOpt
117
 
118
+ You may want to produce this checkpoint yourself. To reproduce the NVFP4 quantized checkpoint using [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer):
119
 
120
  ```bash
121
  python3 examples/llm_ptq/hf_ptq.py \