RF-DETR Segmentation NVFP4 Experimental Pack

This repository contains experimental NVFP4-packed variants of Roboflow/rf-detr-segmentation, a Transformers RF-DETR instance segmentation checkpoint trained on COCO.

The goal was to test whether RF-DETR segmentation could get the "brrr factor" from FP4-style storage on Blackwell-class NVIDIA hardware. The answer from the current post-training pack is: storage compression works, but output fidelity does not pass yet.

Status

Do not treat this as production-ready. The artifacts load and run through a dequantizing loader, but source-vs-quant validation on an RTX 5080 shows large drift in bounding boxes and masks. The packed weights are useful as an experiment, a starting point for calibration-aware/native FP4 work, and a record of what naive NVFP4 packing does to RF-DETR.

Included Artifacts

The root of this repo contains the most compressed variant:

Variant Path Packed Params File Size Notes
Full NVFP4 ./ 31,841,152 / 34,153,555 (93.23%) 26 MB Most compressed; fails mask fidelity
Backbone-only NVFP4 variants/backbone-only/ 21,234,048 / 34,153,555 (62.17%) 61 MB Keeps heads high precision; still fails mask fidelity
Heads/decoder NVFP4 variants/heads-decoder/ 10,607,104 / 34,153,555 (31.06%) 96 MB Keeps ViT backbone high precision; still fails mask fidelity

Original source checkpoint size was about 130 MB for model.safetensors.

Validation

Validation was run on rezo@stallion:

  • GPU: NVIDIA GeForce RTX 5080 Laptop GPU, 16 GB
  • Driver: 580.159.03
  • Runtime: PyTorch 2.11.0+cu130
  • Transformers: 5.10.2
  • dtype: bfloat16
  • Test: one synthetic 432x432 RGB image, source model vs dequantized NVFP4 model

Full NVFP4

Output Shape Rel L2 Cosine Max Abs
logits [1, 200, 91] 0.1387 0.9906 4.7031
pred_boxes [1, 200, 4] 0.5793 0.8245 0.9744
pred_masks [1, 200, 108, 108] 1.0144 0.6042 115.1270

Backbone-only NVFP4

Output Shape Rel L2 Cosine Max Abs
logits [1, 200, 91] 0.1377 0.9905 4.5469
pred_boxes [1, 200, 4] 0.5668 0.8335 0.9606
pred_masks [1, 200, 108, 108] 0.9849 0.5992 108.3125

Heads/decoder NVFP4

Output Shape Rel L2 Cosine Max Abs
logits [1, 200, 91] 0.1223 0.9925 4.2031
pred_boxes [1, 200, 4] 0.5620 0.8394 0.9595
pred_masks [1, 200, 108, 108] 0.9607 0.5827 119.4375

The validation JSON files are included next to each variant.

Format

This is a custom storage pack, not a native Transformers quantization format:

  • 2D floating tensors are packed as NVFP4 E2M1 codes.
  • Per-block scales are stored as FP8 E4M3 bytes.
  • Block size is 16 along the reduction dimension.
  • Non-2D tensors, convolutional tensors, norms, biases, and incompatible shapes are kept in source dtype.

The root file is:

nvfp4_model.safetensors

Quantization metadata is in:

quantization_config.json
quant_error_report.json
validation.json

Loading

Use the included loader, which dequantizes the packed tensors into a temporary HF-style checkpoint and then lets the official Transformers RF-DETR loader perform its key conversion:

from scripts.load_rf_detr_nvfp4 import load_model

model, processor = load_model(".", dtype_name="bfloat16")
model = model.to("cuda")

This path is correct for validation, but it does not provide native FP4 inference speed. It dequantizes before running. Native Blackwell acceleration will require a runtime adapter using torchao/Transformer Engine/ModelOpt-style FP4 kernels and a calibration-aware export.

Recommended Next Step

For a production RF-DETR quant, do not continue with blind per-tensor packing. The validation results point to sensitivity in the segmentation path. The next useful path is:

  1. Use current Transformers RF-DETR support.
  2. Apply native FP4/NVFP4 quantization through torchao or NVIDIA tooling.
  3. Calibrate on real mukbang/food-delivery frames, not synthetic noise.
  4. Validate with actual post-processed masks and COCO-style detection metrics, not only raw tensor cosine.
  5. Keep mask/head layers in higher precision if needed.

Source Model

  • Base: Roboflow/rf-detr-segmentation
  • License: Apache 2.0
  • Task: COCO instance segmentation
  • Architecture: RF-DETR instance segmentation in Transformers

Citation

@misc{robinson2026rfdetrneuralarchitecturesearch,
      title={RF-DETR: Neural Architecture Search for Real-Time Detection Transformers},
      author={Isaac Robinson and Peter Robicheaux and Matvei Popov and Deva Ramanan and Neehar Peri},
      year={2026},
      eprint={2511.09554},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://huggingface.co/papers/2511.09554},
}
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Reza2kn/rf-detr-segmentation-NVFP4

Quantized
(1)
this model

Paper for Reza2kn/rf-detr-segmentation-NVFP4