merve HF Staff

Initial commit

4b871e5 10 days ago

3.81 kB

	---
	license: apache-2.0
	tags:
	- image-segmentation
	- instance-segmentation
	- vision
	datasets:
	- coco
	pipeline_tag: image-segmentation
	library_name: transformers
	---

	# RF-DETR (Segmentation)

	RF-DETR is a real-time detection transformer family introduced in [RF-DETR: Neural Architecture Search for Real-Time Detection Transformers](https://huggingface.co/papers/2511.09554) by Robinson et al. and integrated in 🤗 Transformers via [PR #36895](https://github.com/huggingface/transformers/pull/36895).

	## Model description

	RF-DETR is an end-to-end instance segmentation model that combines ideas from LW-DETR and Deformable DETR: a DINOv2-with-registers style ViT backbone (with an RF-DETR windowing pattern for efficient attention), a multi-scale projector between encoder and decoder, and a multi-scale deformable DETR decoder extended with an instance-segmentation head.

	Key Architectural Details:
	- Backbone: DINOv2-with-registers style ViT with RF-DETR windowed / full attention alternation.
	- Multi-scale fusion: RF-DETR multi-scale projector (C2f-style blocks in the LW-DETR lineage) to aggregate multi-level backbone features before the decoder.
	- Decoder: Deformable DETR-style decoder with multi-scale deformable cross-attention; segmentation checkpoints add mask prediction on top of box/class outputs.
	- Queries: DETR-style object queries with bipartite matching and auxiliary decoder losses.

	Training Details:
	- Segmentation losses: mask prediction losses (e.g. focal / dice style terms as configured) in addition to box and classification objectives, with auxiliary decoder supervision.
	- Group DETR: parallel decoder copies during training for faster convergence.
	- NAS (family-level): weight-sharing search over accuracy–latency knobs as in the RF-DETR paper, specialized to the target dataset distribution.

	### How to use

	You can use the raw model for instance segmentation; it predicts per-instance masks together with bounding boxes and class scores. See the [model hub](https://huggingface.co/models?search=stevenbucaille/rf-detr) to look for all available RF-DETR models.

	Here is how to use this model:

	```python
	from transformers import AutoImageProcessor, RfDetrForInstanceSegmentation
	import torch
	from PIL import Image
	import requests

	url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	processor = AutoImageProcessor.from_pretrained("stevenbucaille/rf-detr-segmentation")
	model = RfDetrForInstanceSegmentation.from_pretrained("stevenbucaille/rf-detr-segmentation")

	inputs = processor(images=image, return_tensors="pt")
	outputs = model(**inputs)

	target_sizes = [image.size[::-1]]
	results = processor.post_process_instance_segmentation(
	outputs, target_sizes=target_sizes, threshold=0.5
	)
	for item in results:
	for k, v in item.items():
	if hasattr(v, "shape"):
	print(k, tuple(v.shape))
	else:
	print(k, v)
	```
	This should output:
	```
	segmentation (480, 640)
	segments_info []
	```

	## Training data

	These checkpoints are trained on the standard [COCO 2017](https://cocodataset.org/#home) instance segmentation label space (80 thing categories) as reflected in `config.id2label`.

	### BibTeX entry and citation info

	```bibtex
	@misc{robinson2026rfdetrneuralarchitecturesearch,
	title={RF-DETR: Neural Architecture Search for Real-Time Detection Transformers},
	author={Isaac Robinson and Peter Robicheaux and Matvei Popov and Deva Ramanan and Neehar Peri},
	year={2026},
	eprint={2511.09554},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://huggingface.co/papers/2511.09554},
	}
	```

	This model was originally contributed by stevenbucaille in 🤗 transformers.