NVFP4 Quantized Model

This model has been quantized using NVIDIA ModelOpt NVFP4 format.

Loading Instructions

import torch
from transformers import AutoModel, AutoTokenizer
from modelopt.torch.quantization.plugins import init_quantized_weights, set_quantizer_state_dict
import modelopt.torch.quantization as mtq

# Load quantizer state
quantizer_state = torch.load("quantizer_state.pt")

# Load model with quantized weights context
with init_quantized_weights(mtq.NVFP4_DEFAULT_CFG):
    model = AutoModel.from_pretrained(
        ".",
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True
    )

# Restore quantizer state
set_quantizer_state_dict(model, quantizer_state)

# Model is now ready for inference with NVFP4 compression (~75% VRAM savings)

Quantization Details

Backend: nvidia-modelopt 0.37.0
Preset: nvfp4_blockscale_w4a16
Method: NVFP4 (4-bit floating point)
Mode: CPU quantization (works on any system with sufficient RAM)

NVFP4 Characteristics

Checkpoint Size: Same as BF16 (no storage compression)
Runtime VRAM: ~75% reduction during inference
Use Case: Ship full-size checkpoints with quantizer state for VRAM-efficient inference

Downloads last month: 2

Safetensors

Model size

30B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support