Initial commit

Browse files

Files changed (5) hide show

.gitattributes +35 -0
README.md +94 -0
config.json +285 -0
model.safetensors +3 -0
preprocessor_config.json +26 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,94 @@

+---
+license: apache-2.0
+tags:
+- object-detection
+- vision
+datasets:
+- coco
+pipeline_tag: object-detection
+library_name: transformers
+---
+# RF-DETR (Base 2)
+RF-DETR is a real-time detection transformer family introduced in [RF-DETR: Neural Architecture Search for Real-Time Detection Transformers](https://huggingface.co/papers/2511.09554) by Robinson et al. and integrated in 🤗 Transformers via [PR #36895](https://github.com/huggingface/transformers/pull/36895).
+## Model description
+RF-DETR is an end-to-end object detection model that combines ideas from LW-DETR and Deformable DETR: a DINOv2-with-registers style ViT backbone (with an RF-DETR windowing pattern for efficient attention), a multi-scale projector between encoder and decoder, and a multi-scale deformable DETR decoder for fast convergence and strong accuracy–latency tradeoffs.
+Key Architectural Details:
+- **Backbone:** DINOv2-with-registers style ViT with RF-DETR **windowed / full** attention alternation (instead of a purely convolutional encoder).
+- **Multi-scale fusion:** **RF-DETR multi-scale projector** (C2f-style blocks in the LW-DETR lineage) to aggregate multi-level backbone features before the decoder.
+- **Decoder:** **Deformable DETR**-style decoder with multi-scale deformable cross-attention; depth and input resolution vary by checkpoint (NAS frontier).
+- **Queries:** DETR-style object queries with bipartite matching and auxiliary decoder losses for training stability.
+Training Details:
+- **Detection losses:** classification plus bounding-box L1 and GIoU, with auxiliary losses on intermediate decoder layers.
+- **Group DETR:** parallel decoder copies during training for faster convergence (same high-level idea as LW-DETR's Group DETR).
+- **NAS (family-level):** the RF-DETR paper uses weight-sharing neural architecture search over practical accuracy–latency knobs after adapting a shared backbone on the target dataset, so many checkpoints correspond to different subnets without full independent retrains for every point on the frontier.
+### How to use
+You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=stevenbucaille/rf-detr) to look for all available RF-DETR models.
+Here is how to use this model:
+```python
+from transformers import AutoImageProcessor, RfDetrForObjectDetection
+import torch
+from PIL import Image
+import requests
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+processor = AutoImageProcessor.from_pretrained("stevenbucaille/rf-detr-base-2")
+model = RfDetrForObjectDetection.from_pretrained("stevenbucaille/rf-detr-base-2")
+inputs = processor(images=image, return_tensors="pt")
+outputs = model(**inputs)
+# convert outputs (bounding boxes and class logits) to COCO API
+# let's only keep detections with score > 0.35
+target_sizes = torch.tensor([image.size[::-1]])
+results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.35)[0]
+for score, label, box in list(zip(results["scores"], results["labels"], results["boxes"]))[:8]:
+    box = [round(i, 2) for i in box.tolist()]
+    print(
+            f"Detected {model.config.id2label[label.item()]} with confidence "
+            f"{round(score.item(), 3)} at location {box}"
+    )
+```
+This should output:
+```
+Detected remote with confidence 0.981 at location [40.78, 72.72, 175.68, 117.19]
+Detected cat with confidence 0.979 at location [7.45, 54.47, 316.23, 473.51]
+Detected cat with confidence 0.964 at location [343.26, 23.5, 636.68, 371.82]
+Detected remote with confidence 0.821 at location [333.94, 77.32, 370.25, 186.78]
+Detected couch with confidence 0.446 at location [0.62, 1.44, 639.34, 475.39]
+Detected chair with confidence 0.113 at location [2.55, 1.5, 640.67, 476.14]
+Detected bed with confidence 0.165 at location [5.88, 116.99, 638.36, 472.72]
+Detected couch with confidence 0.183 at location [0.57, 1.34, 639.31, 271.61]
+```
+## Training data
+These checkpoints are trained on the standard [COCO 2017 object detection dataset](https://cocodataset.org/#home) label space (80 categories) as reflected in `config.id2label`.
+### BibTeX entry and citation info
+```bibtex
+@misc{robinson2026rfdetrneuralarchitecturesearch,
+      title={RF-DETR: Neural Architecture Search for Real-Time Detection Transformers},
+      author={Isaac Robinson and Peter Robicheaux and Matvei Popov and Deva Ramanan and Neehar Peri},
+      year={2026},
+      eprint={2511.09554},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://huggingface.co/papers/2511.09554},
+}
+```
+This model was originally contributed by stevenbucaille in 🤗 transformers.

config.json ADDED Viewed

	@@ -0,0 +1,285 @@

+{
+  "activation_dropout": 0.0,
+  "activation_function": "silu",
+  "architectures": [
+    "RfDetrForObjectDetection"
+  ],
+  "attention_bias": true,
+  "attention_dropout": 0.0,
+  "auxiliary_loss": true,
+  "backbone_config": {
+    "apply_layernorm": true,
+    "attention_probs_dropout_prob": 0.0,
+    "drop_path_rate": 0.0,
+    "dtype": "float32",
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.0,
+    "hidden_size": 384,
+    "image_size": 518,
+    "initializer_range": 0.02,
+    "layer_norm_eps": 1e-06,
+    "layerscale_value": 1.0,
+    "mlp_ratio": 4,
+    "model_type": "rf_detr_dinov2",
+    "num_attention_heads": 6,
+    "num_channels": 3,
+    "num_hidden_layers": 12,
+    "num_windows": 4,
+    "out_features": [
+      "stage2",
+      "stage5",
+      "stage8",
+      "stage11"
+    ],
+    "out_indices": [
+      2,
+      5,
+      8,
+      11
+    ],
+    "patch_size": 14,
+    "qkv_bias": true,
+    "reshape_hidden_states": true,
+    "stage_names": [
+      "stem",
+      "stage1",
+      "stage2",
+      "stage3",
+      "stage4",
+      "stage5",
+      "stage6",
+      "stage7",
+      "stage8",
+      "stage9",
+      "stage10",
+      "stage11",
+      "stage12"
+    ],
+    "use_mask_token": true,
+    "use_swiglu_ffn": false,
+    "window_block_indexes": [
+      0,
+      1,
+      3,
+      4,
+      6,
+      7,
+      9,
+      10
+    ]
+  },
+  "bbox_cost": 5,
+  "bbox_loss_coefficient": 5,
+  "c2f_num_blocks": 3,
+  "class_cost": 2,
+  "class_loss_coefficient": 1,
+  "d_model": 256,
+  "decoder_activation_function": "relu",
+  "decoder_cross_attention_heads": 16,
+  "decoder_ffn_dim": 2048,
+  "decoder_layers": 3,
+  "decoder_n_points": 2,
+  "decoder_self_attention_heads": 8,
+  "dice_loss_coefficient": 1,
+  "disable_custom_kernels": true,
+  "dropout": 0.1,
+  "dtype": "float32",
+  "eos_coefficient": 0.1,
+  "focal_alpha": 0.25,
+  "giou_cost": 2,
+  "giou_loss_coefficient": 2,
+  "group_detr": 13,
+  "hidden_expansion": 0.5,
+  "id2label": {
+    "0": "N/A",
+    "1": "person",
+    "2": "bicycle",
+    "3": "car",
+    "4": "motorcycle",
+    "5": "airplane",
+    "6": "bus",
+    "7": "train",
+    "8": "truck",
+    "9": "boat",
+    "10": "traffic light",
+    "11": "fire hydrant",
+    "12": "N/A",
+    "13": "stop sign",
+    "14": "parking meter",
+    "15": "bench",
+    "16": "bird",
+    "17": "cat",
+    "18": "dog",
+    "19": "horse",
+    "20": "sheep",
+    "21": "cow",
+    "22": "elephant",
+    "23": "bear",
+    "24": "zebra",
+    "25": "giraffe",
+    "26": "N/A",
+    "27": "backpack",
+    "28": "umbrella",
+    "29": "N/A",
+    "30": "N/A",
+    "31": "handbag",
+    "32": "tie",
+    "33": "suitcase",
+    "34": "frisbee",
+    "35": "skis",
+    "36": "snowboard",
+    "37": "sports ball",
+    "38": "kite",
+    "39": "baseball bat",
+    "40": "baseball glove",
+    "41": "skateboard",
+    "42": "surfboard",
+    "43": "tennis racket",
+    "44": "bottle",
+    "45": "N/A",
+    "46": "wine glass",
+    "47": "cup",
+    "48": "fork",
+    "49": "knife",
+    "50": "spoon",
+    "51": "bowl",
+    "52": "banana",
+    "53": "apple",
+    "54": "sandwich",
+    "55": "orange",
+    "56": "broccoli",
+    "57": "carrot",
+    "58": "hot dog",
+    "59": "pizza",
+    "60": "donut",
+    "61": "cake",
+    "62": "chair",
+    "63": "couch",
+    "64": "potted plant",
+    "65": "bed",
+    "66": "N/A",
+    "67": "dining table",
+    "68": "N/A",
+    "69": "N/A",
+    "70": "toilet",
+    "71": "N/A",
+    "72": "tv",
+    "73": "laptop",
+    "74": "mouse",
+    "75": "remote",
+    "76": "keyboard",
+    "77": "cell phone",
+    "78": "microwave",
+    "79": "oven",
+    "80": "toaster",
+    "81": "sink",
+    "82": "refrigerator",
+    "83": "N/A",
+    "84": "book",
+    "85": "clock",
+    "86": "vase",
+    "87": "scissors",
+    "88": "teddy bear",
+    "89": "hair drier",
+    "90": "toothbrush"
+  },
+  "init_std": 0.02,
+  "intermediate_size": 1024,
+  "label2id": {
+    "N/A": 83,
+    "airplane": 5,
+    "apple": 53,
+    "backpack": 27,
+    "banana": 52,
+    "baseball bat": 39,
+    "baseball glove": 40,
+    "bear": 23,
+    "bed": 65,
+    "bench": 15,
+    "bicycle": 2,
+    "bird": 16,
+    "boat": 9,
+    "book": 84,
+    "bottle": 44,
+    "bowl": 51,
+    "broccoli": 56,
+    "bus": 6,
+    "cake": 61,
+    "car": 3,
+    "carrot": 57,
+    "cat": 17,
+    "cell phone": 77,
+    "chair": 62,
+    "clock": 85,
+    "couch": 63,
+    "cow": 21,
+    "cup": 47,
+    "dining table": 67,
+    "dog": 18,
+    "donut": 60,
+    "elephant": 22,
+    "fire hydrant": 11,
+    "fork": 48,
+    "frisbee": 34,
+    "giraffe": 25,
+    "hair drier": 89,
+    "handbag": 31,
+    "horse": 19,
+    "hot dog": 58,
+    "keyboard": 76,
+    "kite": 38,
+    "knife": 49,
+    "laptop": 73,
+    "microwave": 78,
+    "motorcycle": 4,
+    "mouse": 74,
+    "orange": 55,
+    "oven": 79,
+    "parking meter": 14,
+    "person": 1,
+    "pizza": 59,
+    "potted plant": 64,
+    "refrigerator": 82,
+    "remote": 75,
+    "sandwich": 54,
+    "scissors": 87,
+    "sheep": 20,
+    "sink": 81,
+    "skateboard": 41,
+    "skis": 35,
+    "snowboard": 36,
+    "spoon": 50,
+    "sports ball": 37,
+    "stop sign": 13,
+    "suitcase": 33,
+    "surfboard": 42,
+    "teddy bear": 88,
+    "tennis racket": 43,
+    "tie": 32,
+    "toaster": 80,
+    "toilet": 70,
+    "toothbrush": 90,
+    "traffic light": 10,
+    "train": 7,
+    "truck": 8,
+    "tv": 72,
+    "umbrella": 28,
+    "vase": 86,
+    "wine glass": 46,
+    "zebra": 24
+  },
+  "layer_norm_eps": 1e-05,
+  "mask_class_loss_coefficient": 5.0,
+  "mask_dice_loss_coefficient": 5.0,
+  "mask_downsample_ratio": 4,
+  "mask_loss_coefficient": 1,
+  "mask_point_sample_ratio": 16,
+  "model_type": "rf_detr",
+  "num_feature_levels": 1,
+  "num_queries": 300,
+  "projector_scale_factors": [
+    1.0
+  ],
+  "segmentation_head_activation_function": "gelu",
+  "transformers_version": "5.8.0.dev0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d84a769cbbe85f8be86702fb6c27ef26d679606dac384736d879e6cbf968bea4
+size 128757944

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "do_convert_annotations": true,
+  "do_normalize": true,
+  "do_pad": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "format": "coco_detection",
+  "image_mean": [
+    0.485,
+    0.456,
+    0.406
+  ],
+  "image_processor_type": "DetrImageProcessor",
+  "image_std": [
+    0.229,
+    0.224,
+    0.225
+  ],
+  "resample": 2,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "height": 560,
+    "width": 560
+  },
+  "use_fast": true
+}