GRM-2.6-Plus-NVFP4

NVFP4 post-training quantization of OrionLLM/GRM-2.6-Plus produced with NVIDIA ModelOpt on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition.

Quantization

Quant config: ModelOpt NVFP4_DEFAULT_CFG
Scheme: ModelOpt standard NVFP4 dynamic 4-bit quantization with the preset's built-in exclusions for lm_head, output layers, routing gates, and convolutional linear-attention components.
Tooling: nvidia-modelopt via mtq.quantize and export_hf_checkpoint.
Calibration: 512 samples from cnn_dailymail, sequence length 512, batch size 2.

Runtime

Use a recent vLLM build with ModelOpt quantization support on NVIDIA Blackwell:

vllm serve rressl/GRM-2.6-Plus-NVFP4 \
  --quantization modelopt \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

NVFP4 requires Blackwell-class NVIDIA hardware for the fast path.

Downloads last month: -

Safetensors

Model size

15B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ressl/GRM-2.6-Plus-NVFP4

Base model

Qwen/Qwen3.6-27B

Finetuned

OrionLLM/GRM-2.6-Plus

Quantized

(9)

this model