--- license: apache-2.0 tags: - image-segmentation - instance-segmentation - vision datasets: - coco pipeline_tag: image-segmentation library_name: transformers --- # RF-DETR (Segmentation) RF-DETR is a real-time detection transformer family introduced in [RF-DETR: Neural Architecture Search for Real-Time Detection Transformers](https://huggingface.co/papers/2511.09554) by Robinson et al. and integrated in 🤗 Transformers via [PR #36895](https://github.com/huggingface/transformers/pull/36895). ## Model description RF-DETR is an end-to-end instance segmentation model that combines ideas from LW-DETR and Deformable DETR: a DINOv2-with-registers style ViT backbone (with an RF-DETR windowing pattern for efficient attention), a multi-scale projector between encoder and decoder, and a multi-scale deformable DETR decoder extended with an instance-segmentation head. Key Architectural Details: - **Backbone:** DINOv2-with-registers style ViT with RF-DETR **windowed / full** attention alternation. - **Multi-scale fusion:** **RF-DETR multi-scale projector** (C2f-style blocks in the LW-DETR lineage) to aggregate multi-level backbone features before the decoder. - **Decoder:** **Deformable DETR**-style decoder with multi-scale deformable cross-attention; segmentation checkpoints add mask prediction on top of box/class outputs. - **Queries:** DETR-style object queries with bipartite matching and auxiliary decoder losses. Training Details: - **Segmentation losses:** mask prediction losses (e.g. focal / dice style terms as configured) in addition to box and classification objectives, with auxiliary decoder supervision. - **Group DETR:** parallel decoder copies during training for faster convergence. - **NAS (family-level):** weight-sharing search over accuracy–latency knobs as in the RF-DETR paper, specialized to the target dataset distribution. ### How to use You can use the raw model for instance segmentation; it predicts **per-instance masks** together with **bounding boxes and class scores**. See the [model hub](https://huggingface.co/models?search=stevenbucaille/rf-detr) to look for all available RF-DETR models. Here is how to use this model: ```python from transformers import AutoImageProcessor, RfDetrForInstanceSegmentation import torch from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained("stevenbucaille/rf-detr-segmentation") model = RfDetrForInstanceSegmentation.from_pretrained("stevenbucaille/rf-detr-segmentation") inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) target_sizes = [image.size[::-1]] results = processor.post_process_instance_segmentation( outputs, target_sizes=target_sizes, threshold=0.5 ) for item in results: for k, v in item.items(): if hasattr(v, "shape"): print(k, tuple(v.shape)) else: print(k, v) ``` This should output: ``` segmentation (480, 640) segments_info [] ``` ## Training data These checkpoints are trained on the standard [COCO 2017](https://cocodataset.org/#home) instance segmentation label space (80 thing categories) as reflected in `config.id2label`. ### BibTeX entry and citation info ```bibtex @misc{robinson2026rfdetrneuralarchitecturesearch, title={RF-DETR: Neural Architecture Search for Real-Time Detection Transformers}, author={Isaac Robinson and Peter Robicheaux and Matvei Popov and Deva Ramanan and Neehar Peri}, year={2026}, eprint={2511.09554}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://huggingface.co/papers/2511.09554}, } ``` This model was originally contributed by stevenbucaille in 🤗 transformers.