YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

NVFP4 Quantized Model

This model has been quantized using NVIDIA ModelOpt NVFP4 format.

Loading Instructions

import torch
from transformers import AutoModel, AutoTokenizer
from modelopt.torch.quantization.plugins import init_quantized_weights, set_quantizer_state_dict
import modelopt.torch.quantization as mtq

# Load quantizer state
quantizer_state = torch.load("quantizer_state.pt")

# Load model with quantized weights context
with init_quantized_weights(mtq.NVFP4_DEFAULT_CFG):
    model = AutoModel.from_pretrained(
        ".",
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True
    )

# Restore quantizer state
set_quantizer_state_dict(model, quantizer_state)

# Model is now ready for inference with NVFP4 compression (~75% VRAM savings)

Quantization Details

  • Backend: nvidia-modelopt 0.37.0
  • Preset: nvfp4_blockscale_w4a16
  • Method: NVFP4 (4-bit floating point)
  • Mode: CPU quantization (works on any system with sufficient RAM)

NVFP4 Characteristics

  • Checkpoint Size: Same as BF16 (no storage compression)
  • Runtime VRAM: ~75% reduction during inference
  • Use Case: Ship full-size checkpoints with quantizer state for VRAM-efficient inference
Downloads last month
2
Safetensors
Model size
30B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support