--- license: apache-2.0 tags: - object-detection - vision datasets: - coco pipeline_tag: object-detection library_name: transformers --- # RF-DETR (Medium) RF-DETR is a real-time detection transformer family introduced in [RF-DETR: Neural Architecture Search for Real-Time Detection Transformers](https://huggingface.co/papers/2511.09554) by Robinson et al. and integrated in 🤗 Transformers via [PR #36895](https://github.com/huggingface/transformers/pull/36895). ## Model description RF-DETR is an end-to-end object detection model that combines ideas from LW-DETR and Deformable DETR: a DINOv2-with-registers style ViT backbone (with an RF-DETR windowing pattern for efficient attention), a multi-scale projector between encoder and decoder, and a multi-scale deformable DETR decoder for fast convergence and strong accuracy–latency tradeoffs. Key Architectural Details: - **Backbone:** DINOv2-with-registers style ViT with RF-DETR **windowed / full** attention alternation (instead of a purely convolutional encoder). - **Multi-scale fusion:** **RF-DETR multi-scale projector** (C2f-style blocks in the LW-DETR lineage) to aggregate multi-level backbone features before the decoder. - **Decoder:** **Deformable DETR**-style decoder with multi-scale deformable cross-attention; depth and input resolution vary by checkpoint (NAS frontier). - **Queries:** DETR-style object queries with bipartite matching and auxiliary decoder losses for training stability. Training Details: - **Detection losses:** classification plus bounding-box L1 and GIoU, with auxiliary losses on intermediate decoder layers. - **Group DETR:** parallel decoder copies during training for faster convergence (same high-level idea as LW-DETR's Group DETR). - **NAS (family-level):** the RF-DETR paper uses weight-sharing neural architecture search over practical accuracy–latency knobs after adapting a shared backbone on the target dataset, so many checkpoints correspond to different subnets without full independent retrains for every point on the frontier. ### How to use You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=stevenbucaille/rf-detr) to look for all available RF-DETR models. Here is how to use this model: ```python from transformers import AutoImageProcessor, RfDetrForObjectDetection import torch from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained("stevenbucaille/rf-detr-medium") model = RfDetrForObjectDetection.from_pretrained("stevenbucaille/rf-detr-medium") inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) # convert outputs (bounding boxes and class logits) to COCO API # let's only keep detections with score > 0.35 target_sizes = torch.tensor([image.size[::-1]]) results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.35)[0] for score, label, box in list(zip(results["scores"], results["labels"], results["boxes"]))[:8]: box = [round(i, 2) for i in box.tolist()] print( f"Detected {model.config.id2label[label.item()]} with confidence " f"{round(score.item(), 3)} at location {box}" ) ``` This should output: ``` Detected remote with confidence 0.988 at location [40.11, 73.16, 175.23, 118.2] Detected cat with confidence 0.988 at location [347.22, 23.4, 639.47, 374.62] Detected cat with confidence 0.987 at location [7.72, 55.88, 316.65, 473.55] Detected remote with confidence 0.98 at location [334.08, 76.82, 370.65, 188.08] Detected couch with confidence 0.414 at location [1.54, 0.42, 639.09, 475.48] Detected remote with confidence 0.345 at location [261.15, 54.76, 290.15, 78.09] Detected remote with confidence 0.117 at location [334.03, 77.05, 370.36, 188.02] Detected remote with confidence 0.283 at location [334.55, 124.55, 354.86, 187.27] ``` ## Training data These checkpoints are trained on the standard [COCO 2017 object detection dataset](https://cocodataset.org/#home) label space (80 categories) as reflected in `config.id2label`. ### BibTeX entry and citation info ```bibtex @misc{robinson2026rfdetrneuralarchitecturesearch, title={RF-DETR: Neural Architecture Search for Real-Time Detection Transformers}, author={Isaac Robinson and Peter Robicheaux and Matvei Popov and Deva Ramanan and Neehar Peri}, year={2026}, eprint={2511.09554}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://huggingface.co/papers/2511.09554}, } ``` This model was originally contributed by stevenbucaille in 🤗 transformers.