YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
NVFP4 Quantized Model
This model has been quantized using NVIDIA ModelOpt NVFP4 format.
Loading Instructions
import torch
from transformers import AutoModel, AutoTokenizer
from modelopt.torch.quantization.plugins import init_quantized_weights, set_quantizer_state_dict
import modelopt.torch.quantization as mtq
# Load quantizer state
quantizer_state = torch.load("quantizer_state.pt")
# Load model with quantized weights context
with init_quantized_weights(mtq.NVFP4_DEFAULT_CFG):
model = AutoModel.from_pretrained(
".",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Restore quantizer state
set_quantizer_state_dict(model, quantizer_state)
# Model is now ready for inference with NVFP4 compression (~75% VRAM savings)
Quantization Details
- Backend: nvidia-modelopt 0.37.0
- Preset: nvfp4_blockscale_w4a16
- Method: NVFP4 (4-bit floating point)
- Mode: CPU quantization (works on any system with sufficient RAM)
NVFP4 Characteristics
- Checkpoint Size: Same as BF16 (no storage compression)
- Runtime VRAM: ~75% reduction during inference
- Use Case: Ship full-size checkpoints with quantizer state for VRAM-efficient inference
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support