GRM-2.6-Plus-NVFP4
NVFP4 post-training quantization of OrionLLM/GRM-2.6-Plus produced with NVIDIA ModelOpt on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition.
Quantization
- Quant config:
ModelOpt NVFP4_DEFAULT_CFG - Scheme: ModelOpt standard NVFP4 dynamic 4-bit quantization with the preset's built-in exclusions for lm_head, output layers, routing gates, and convolutional linear-attention components.
- Tooling:
nvidia-modeloptviamtq.quantizeandexport_hf_checkpoint. - Calibration:
512samples fromcnn_dailymail, sequence length512, batch size2.
Runtime
Use a recent vLLM build with ModelOpt quantization support on NVIDIA Blackwell:
vllm serve rressl/GRM-2.6-Plus-NVFP4 \
--quantization modelopt \
--max-model-len 262144 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--trust-remote-code
NVFP4 requires Blackwell-class NVIDIA hardware for the fast path.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support