Instructions to use Reza2kn/rf-detr-segmentation-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Reza2kn/rf-detr-segmentation-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="Reza2kn/rf-detr-segmentation-NVFP4")# Load model directly from transformers import AutoImageProcessor, RfDetrForInstanceSegmentation processor = AutoImageProcessor.from_pretrained("Reza2kn/rf-detr-segmentation-NVFP4") model = RfDetrForInstanceSegmentation.from_pretrained("Reza2kn/rf-detr-segmentation-NVFP4") - Notebooks
- Google Colab
- Kaggle
RF-DETR Segmentation NVFP4 Experimental Pack
This repository contains experimental NVFP4-packed variants of Roboflow/rf-detr-segmentation, a Transformers RF-DETR instance segmentation checkpoint trained on COCO.
The goal was to test whether RF-DETR segmentation could get the "brrr factor" from FP4-style storage on Blackwell-class NVIDIA hardware. The answer from the current post-training pack is: storage compression works, but output fidelity does not pass yet.
Status
Do not treat this as production-ready. The artifacts load and run through a dequantizing loader, but source-vs-quant validation on an RTX 5080 shows large drift in bounding boxes and masks. The packed weights are useful as an experiment, a starting point for calibration-aware/native FP4 work, and a record of what naive NVFP4 packing does to RF-DETR.
Included Artifacts
The root of this repo contains the most compressed variant:
| Variant | Path | Packed Params | File Size | Notes |
|---|---|---|---|---|
| Full NVFP4 | ./ |
31,841,152 / 34,153,555 (93.23%) | 26 MB | Most compressed; fails mask fidelity |
| Backbone-only NVFP4 | variants/backbone-only/ |
21,234,048 / 34,153,555 (62.17%) | 61 MB | Keeps heads high precision; still fails mask fidelity |
| Heads/decoder NVFP4 | variants/heads-decoder/ |
10,607,104 / 34,153,555 (31.06%) | 96 MB | Keeps ViT backbone high precision; still fails mask fidelity |
Original source checkpoint size was about 130 MB for model.safetensors.
Validation
Validation was run on rezo@stallion:
- GPU: NVIDIA GeForce RTX 5080 Laptop GPU, 16 GB
- Driver: 580.159.03
- Runtime: PyTorch
2.11.0+cu130 - Transformers:
5.10.2 - dtype:
bfloat16 - Test: one synthetic 432x432 RGB image, source model vs dequantized NVFP4 model
Full NVFP4
| Output | Shape | Rel L2 | Cosine | Max Abs |
|---|---|---|---|---|
| logits | [1, 200, 91] |
0.1387 | 0.9906 | 4.7031 |
| pred_boxes | [1, 200, 4] |
0.5793 | 0.8245 | 0.9744 |
| pred_masks | [1, 200, 108, 108] |
1.0144 | 0.6042 | 115.1270 |
Backbone-only NVFP4
| Output | Shape | Rel L2 | Cosine | Max Abs |
|---|---|---|---|---|
| logits | [1, 200, 91] |
0.1377 | 0.9905 | 4.5469 |
| pred_boxes | [1, 200, 4] |
0.5668 | 0.8335 | 0.9606 |
| pred_masks | [1, 200, 108, 108] |
0.9849 | 0.5992 | 108.3125 |
Heads/decoder NVFP4
| Output | Shape | Rel L2 | Cosine | Max Abs |
|---|---|---|---|---|
| logits | [1, 200, 91] |
0.1223 | 0.9925 | 4.2031 |
| pred_boxes | [1, 200, 4] |
0.5620 | 0.8394 | 0.9595 |
| pred_masks | [1, 200, 108, 108] |
0.9607 | 0.5827 | 119.4375 |
The validation JSON files are included next to each variant.
Format
This is a custom storage pack, not a native Transformers quantization format:
- 2D floating tensors are packed as NVFP4 E2M1 codes.
- Per-block scales are stored as FP8 E4M3 bytes.
- Block size is 16 along the reduction dimension.
- Non-2D tensors, convolutional tensors, norms, biases, and incompatible shapes are kept in source dtype.
The root file is:
nvfp4_model.safetensors
Quantization metadata is in:
quantization_config.json
quant_error_report.json
validation.json
Loading
Use the included loader, which dequantizes the packed tensors into a temporary HF-style checkpoint and then lets the official Transformers RF-DETR loader perform its key conversion:
from scripts.load_rf_detr_nvfp4 import load_model
model, processor = load_model(".", dtype_name="bfloat16")
model = model.to("cuda")
This path is correct for validation, but it does not provide native FP4 inference speed. It dequantizes before running. Native Blackwell acceleration will require a runtime adapter using torchao/Transformer Engine/ModelOpt-style FP4 kernels and a calibration-aware export.
Recommended Next Step
For a production RF-DETR quant, do not continue with blind per-tensor packing. The validation results point to sensitivity in the segmentation path. The next useful path is:
- Use current Transformers RF-DETR support.
- Apply native FP4/NVFP4 quantization through torchao or NVIDIA tooling.
- Calibrate on real mukbang/food-delivery frames, not synthetic noise.
- Validate with actual post-processed masks and COCO-style detection metrics, not only raw tensor cosine.
- Keep mask/head layers in higher precision if needed.
Source Model
- Base:
Roboflow/rf-detr-segmentation - License: Apache 2.0
- Task: COCO instance segmentation
- Architecture: RF-DETR instance segmentation in Transformers
Citation
@misc{robinson2026rfdetrneuralarchitecturesearch,
title={RF-DETR: Neural Architecture Search for Real-Time Detection Transformers},
author={Isaac Robinson and Peter Robicheaux and Matvei Popov and Deva Ramanan and Neehar Peri},
year={2026},
eprint={2511.09554},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://huggingface.co/papers/2511.09554},
}
- Downloads last month
- 21
Model tree for Reza2kn/rf-detr-segmentation-NVFP4
Base model
Roboflow/rf-detr-segmentation