File size: 3,805 Bytes
4b871e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: apache-2.0
tags:
- image-segmentation
- instance-segmentation
- vision
datasets:
- coco
pipeline_tag: image-segmentation
library_name: transformers
---

# RF-DETR (Segmentation)

RF-DETR is a real-time detection transformer family introduced in [RF-DETR: Neural Architecture Search for Real-Time Detection Transformers](https://huggingface.co/papers/2511.09554) by Robinson et al. and integrated in 🤗 Transformers via [PR #36895](https://github.com/huggingface/transformers/pull/36895).

## Model description

RF-DETR is an end-to-end instance segmentation model that combines ideas from LW-DETR and Deformable DETR: a DINOv2-with-registers style ViT backbone (with an RF-DETR windowing pattern for efficient attention), a multi-scale projector between encoder and decoder, and a multi-scale deformable DETR decoder extended with an instance-segmentation head.

Key Architectural Details:
- **Backbone:** DINOv2-with-registers style ViT with RF-DETR **windowed / full** attention alternation.
- **Multi-scale fusion:** **RF-DETR multi-scale projector** (C2f-style blocks in the LW-DETR lineage) to aggregate multi-level backbone features before the decoder.
- **Decoder:** **Deformable DETR**-style decoder with multi-scale deformable cross-attention; segmentation checkpoints add mask prediction on top of box/class outputs.
- **Queries:** DETR-style object queries with bipartite matching and auxiliary decoder losses.

Training Details:
- **Segmentation losses:** mask prediction losses (e.g. focal / dice style terms as configured) in addition to box and classification objectives, with auxiliary decoder supervision.
- **Group DETR:** parallel decoder copies during training for faster convergence.
- **NAS (family-level):** weight-sharing search over accuracy–latency knobs as in the RF-DETR paper, specialized to the target dataset distribution.

### How to use

You can use the raw model for instance segmentation; it predicts **per-instance masks** together with **bounding boxes and class scores**. See the [model hub](https://huggingface.co/models?search=stevenbucaille/rf-detr) to look for all available RF-DETR models.

Here is how to use this model:

```python
from transformers import AutoImageProcessor, RfDetrForInstanceSegmentation
import torch
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained("stevenbucaille/rf-detr-segmentation")
model = RfDetrForInstanceSegmentation.from_pretrained("stevenbucaille/rf-detr-segmentation")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

target_sizes = [image.size[::-1]]
results = processor.post_process_instance_segmentation(
    outputs, target_sizes=target_sizes, threshold=0.5
)
for item in results:
    for k, v in item.items():
        if hasattr(v, "shape"):
            print(k, tuple(v.shape))
        else:
            print(k, v)
```
This should output:
```
segmentation (480, 640)
segments_info []
```

## Training data

These checkpoints are trained on the standard [COCO 2017](https://cocodataset.org/#home) instance segmentation label space (80 thing categories) as reflected in `config.id2label`.

### BibTeX entry and citation info

```bibtex
@misc{robinson2026rfdetrneuralarchitecturesearch,
      title={RF-DETR: Neural Architecture Search for Real-Time Detection Transformers},
      author={Isaac Robinson and Peter Robicheaux and Matvei Popov and Deva Ramanan and Neehar Peri},
      year={2026},
      eprint={2511.09554},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://huggingface.co/papers/2511.09554},
}
```

This model was originally contributed by stevenbucaille in 🤗 transformers.