Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +92 -3
config.json +286 -0
model.safetensors +3 -0
preprocessor_config.json +26 -0

README.md CHANGED Viewed

@@ -1,3 +1,92 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+tags:
+- object-detection
+- vision
+datasets:
+- coco
+pipeline_tag: object-detection
+library_name: transformers
+---
+# LW-DETR (Light-Weight Detection Transformer)
+LW-DETR, a Light-Weight DEtection TRansformer model, is designed to be a real-time object detection alternative that outperforms conventional convolutional (YOLO-style) and earlier transformer-based (DETR) methods in terms of speed and accuracy trade-off. It was introduced in the paper [LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection](https://huggingface.co/papers/2406.03459) by Chen et al. and first released in this repository.
+Disclaimer: This model was originally contributed by [stevenbucaille](https://huggingface.co/stevenbucaille) in 🤗 transformers.
+## Model description
+LW-DETR is an end-to-end object detection model that uses a Vision Transformer (ViT) backbone as its encoder, a simple convolutional projector, and a shallow DETR decoder. The core philosophy is to leverage the power of transformers while implementing several efficiency-focused techniques to achieve real-time performance.
+Key Architectural Details:
+- ViT Encoder: Uses a plain ViT architecture. To reduce the quadratic complexity of global self-attention, it adopts interleaved window and global attentions.
+- Window-Major Organization: It employs a highly efficient window-major feature map organization scheme for attention computation, which drastically reduces the costly memory permutation operations required when transitioning between global and window attention modes, leading to lower inference latency.
+- Feature Aggregation: It aggregates features from multiple levels (intermediate and final layers) of the ViT encoder to create richer input for the decoder.
+- Projector: A C2f block (from YOLOv8) connects the encoder and decoder. For larger versions (large/xlarge), it outputs two-scale features ($1/8$ and $1/32$) to the decoder.
+- Shallow DETR Decoder: It uses a computationally efficient 3-layer transformer decoder (instead of the standard 6 layers), incorporating deformable cross-attention for faster convergence and lower latency.
+- Object Queries: It uses a mixed-query selection scheme to form the object queries from both learnable content queries and generated spatial queries (based on top-K features from the Projector).
+Training Details:
+- IoU-aware Classification Loss (IA-BCE loss): Enhances the classification branch by incorporating IoU information into the target score $t=s^{\alpha}u^{1-\alpha}$.
+- Group DETR: Uses a Group DETR strategy (13 parallel weight-sharing decoders) for faster training convergence without affecting inference speed.
+- Pretraining: Uses a two-stage pretraining strategy: first, ViT is pretrained on Objects365 using a Masked Image Modeling (MIM) method (CAEv2), followed by supervised retraining of the encoder and training of the projector and decoder on Objects365. This provides a significant performance boost (average of $\approx 5.5\text{ mAP}$).
+### How to use
+You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=stevenbucaille/lw-detr) to look for all available LW DETR models.
+Here is how to use this model:
+```python
+from transformers import AutoImageProcessor, LwDetrForObjectDetection
+import torch
+from PIL import Image
+import requests
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+processor = AutoImageProcessor.from_pretrained("stevenbucaille/lwdetr_small_60e_coco")
+model = LwDetrForObjectDetection.from_pretrained("stevenbucaille/lwdetr_small_60e_coco")
+inputs = processor(images=image, return_tensors="pt")
+outputs = model(**inputs)
+# convert outputs (bounding boxes and class logits) to COCO API
+# let's only keep detections with score > 0.7
+target_sizes = torch.tensor([image.size[::-1]])
+results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.7)[0]
+for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
+    box = [round(i, 2) for i in box.tolist()]
+    print(
+            f"Detected {model.config.id2label[label.item()]} with confidence "
+            f"{round(score.item(), 3)} at location {box}"
+    )
+```
+This should output:
+```
+Detected cat with confidence 0.944 at location [343.19, 24.52, 640.4, 372.93]
+Detected cat with confidence 0.937 at location [9.79, 53.67, 317.63, 472.49]
+Detected remote with confidence 0.913 at location [40.47, 73.09, 176.19, 117.61]
+Detected couch with confidence 0.78 at location [1.26, 1.01, 639.71, 471.57]
+```
+Currently, both the feature extractor and model support PyTorch.
+## Training data
+The LW-DETR models are trained/finetuned on the following datasets:
+- Pretraining: Primarily conducted on [Objects365](https://www.objects365.org/overview.html), a large-scale, high-quality dataset for object detection.
+- Finetuning: Final training is performed on the standard [COCO 2017 object detection dataset](https://cocodataset.org/#home).
+### BibTeX entry and citation info
+```bibtex
+@article{chen2024lw,
+        title={LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection},
+        author={Chen, Qiang and Su, Xiangbo and Zhang, Xinyu and Wang, Jian and Chen, Jiahui and Shen, Yunpeng and Han, Chuchu and Chen, Ziliang and Xu, Weixiang and Li, Fanrong and others},
+        journal={arXiv preprint arXiv:2406.03459},
+        year={2024}
+    }
+```

config.json ADDED Viewed

	@@ -0,0 +1,286 @@

+{
+  "activation_dropout": 0.0,
+  "activation_function": "silu",
+  "architectures": [
+    "LwDetrForObjectDetection"
+  ],
+  "attention_bias": true,
+  "attention_dropout": 0.0,
+  "auxiliary_loss": true,
+  "backbone": null,
+  "backbone_config": {
+    "cae_init_values": 0.1,
+    "dropout_prob": 0.0,
+    "hidden_act": "gelu",
+    "hidden_size": 192,
+    "image_size": 1024,
+    "initializer_range": 0.02,
+    "layer_norm_eps": 1e-06,
+    "mlp_ratio": 4,
+    "model_type": "lw_detr_vit",
+    "num_attention_heads": 12,
+    "num_channels": 3,
+    "num_hidden_layers": 10,
+    "num_windows": 16,
+    "num_windows_side": 4,
+    "out_features": [
+      "stage3",
+      "stage5",
+      "stage6",
+      "stage10"
+    ],
+    "out_indices": [
+      3,
+      5,
+      6,
+      10
+    ],
+    "patch_size": 16,
+    "pretrain_image_size": 224,
+    "qkv_bias": true,
+    "stage_names": [
+      "stem",
+      "stage1",
+      "stage2",
+      "stage3",
+      "stage4",
+      "stage5",
+      "stage6",
+      "stage7",
+      "stage8",
+      "stage9",
+      "stage10"
+    ],
+    "use_absolute_position_embeddings": true,
+    "window_block_indices": [
+      0,
+      1,
+      3,
+      6,
+      7,
+      9
+    ]
+  },
+  "backbone_kwargs": null,
+  "batch_norm_eps": 1e-05,
+  "bbox_cost": 5,
+  "bbox_loss_coefficient": 5,
+  "class_cost": 2,
+  "d_model": 256,
+  "decoder_activation_function": "relu",
+  "decoder_cross_attention_heads": 16,
+  "decoder_ffn_dim": 2048,
+  "decoder_layers": 3,
+  "decoder_n_points": 2,
+  "decoder_self_attention_heads": 8,
+  "dice_loss_coefficient": 1,
+  "disable_custom_kernels": true,
+  "dropout": 0.1,
+  "dtype": "float32",
+  "eos_coefficient": 0.1,
+  "focal_alpha": 0.25,
+  "giou_cost": 2,
+  "giou_loss_coefficient": 2,
+  "group_detr": 13,
+  "hidden_expansion": 0.5,
+  "id2label": {
+    "0": "N/A",
+    "1": "person",
+    "10": "traffic light",
+    "11": "fire hydrant",
+    "12": "street sign",
+    "13": "stop sign",
+    "14": "parking meter",
+    "15": "bench",
+    "16": "bird",
+    "17": "cat",
+    "18": "dog",
+    "19": "horse",
+    "2": "bicycle",
+    "20": "sheep",
+    "21": "cow",
+    "22": "elephant",
+    "23": "bear",
+    "24": "zebra",
+    "25": "giraffe",
+    "26": "hat",
+    "27": "backpack",
+    "28": "umbrella",
+    "29": "shoe",
+    "3": "car",
+    "30": "eye glasses",
+    "31": "handbag",
+    "32": "tie",
+    "33": "suitcase",
+    "34": "frisbee",
+    "35": "skis",
+    "36": "snowboard",
+    "37": "sports ball",
+    "38": "kite",
+    "39": "baseball bat",
+    "4": "motorcycle",
+    "40": "baseball glove",
+    "41": "skateboard",
+    "42": "surfboard",
+    "43": "tennis racket",
+    "44": "bottle",
+    "45": "plate",
+    "46": "wine glass",
+    "47": "cup",
+    "48": "fork",
+    "49": "knife",
+    "5": "airplane",
+    "50": "spoon",
+    "51": "bowl",
+    "52": "banana",
+    "53": "apple",
+    "54": "sandwich",
+    "55": "orange",
+    "56": "broccoli",
+    "57": "carrot",
+    "58": "hot dog",
+    "59": "pizza",
+    "6": "bus",
+    "60": "donut",
+    "61": "cake",
+    "62": "chair",
+    "63": "couch",
+    "64": "potted plant",
+    "65": "bed",
+    "66": "mirror",
+    "67": "dining table",
+    "68": "window",
+    "69": "desk",
+    "7": "train",
+    "70": "toilet",
+    "71": "door",
+    "72": "tv",
+    "73": "laptop",
+    "74": "mouse",
+    "75": "remote",
+    "76": "keyboard",
+    "77": "cell phone",
+    "78": "microwave",
+    "79": "oven",
+    "8": "truck",
+    "80": "toaster",
+    "81": "sink",
+    "82": "refrigerator",
+    "83": "blender",
+    "84": "book",
+    "85": "clock",
+    "86": "vase",
+    "87": "scissors",
+    "88": "teddy bear",
+    "89": "hair drier",
+    "9": "boat",
+    "90": "toothbrush"
+  },
+  "init_std": 0.02,
+  "label2id": {
+    "N/A": 0,
+    "airplane": 5,
+    "apple": 53,
+    "backpack": 27,
+    "banana": 52,
+    "baseball bat": 39,
+    "baseball glove": 40,
+    "bear": 23,
+    "bed": 65,
+    "bench": 15,
+    "bicycle": 2,
+    "bird": 16,
+    "blender": 83,
+    "boat": 9,
+    "book": 84,
+    "bottle": 44,
+    "bowl": 51,
+    "broccoli": 56,
+    "bus": 6,
+    "cake": 61,
+    "car": 3,
+    "carrot": 57,
+    "cat": 17,
+    "cell phone": 77,
+    "chair": 62,
+    "clock": 85,
+    "couch": 63,
+    "cow": 21,
+    "cup": 47,
+    "desk": 69,
+    "dining table": 67,
+    "dog": 18,
+    "donut": 60,
+    "door": 71,
+    "elephant": 22,
+    "eye glasses": 30,
+    "fire hydrant": 11,
+    "fork": 48,
+    "frisbee": 34,
+    "giraffe": 25,
+    "hair drier": 89,
+    "handbag": 31,
+    "hat": 26,
+    "horse": 19,
+    "hot dog": 58,
+    "keyboard": 76,
+    "kite": 38,
+    "knife": 49,
+    "laptop": 73,
+    "microwave": 78,
+    "mirror": 66,
+    "motorcycle": 4,
+    "mouse": 74,
+    "orange": 55,
+    "oven": 79,
+    "parking meter": 14,
+    "person": 1,
+    "pizza": 59,
+    "plate": 45,
+    "potted plant": 64,
+    "refrigerator": 82,
+    "remote": 75,
+    "sandwich": 54,
+    "scissors": 87,
+    "sheep": 20,
+    "shoe": 29,
+    "sink": 81,
+    "skateboard": 41,
+    "skis": 35,
+    "snowboard": 36,
+    "spoon": 50,
+    "sports ball": 37,
+    "stop sign": 13,
+    "street sign": 12,
+    "suitcase": 33,
+    "surfboard": 42,
+    "teddy bear": 88,
+    "tennis racket": 43,
+    "tie": 32,
+    "toaster": 80,
+    "toilet": 70,
+    "toothbrush": 90,
+    "traffic light": 10,
+    "train": 7,
+    "truck": 8,
+    "tv": 72,
+    "umbrella": 28,
+    "vase": 86,
+    "window": 68,
+    "wine glass": 46,
+    "zebra": 24
+  },
+  "model_type": "lw_detr",
+  "num_feature_levels": 1,
+  "num_queries": 300,
+  "projector_in_channels": [
+    256
+  ],
+  "projector_out_channels": 256,
+  "projector_scale_factors": [
+    1.0
+  ],
+  "transformers_version": "5.0.0.dev0",
+  "use_pretrained_backbone": false,
+  "use_timm_backbone": false
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1fe4c143db36025ee8e560784ae675d158e12f1efaa514c142778182285b1a5a
+size 58296488

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "do_convert_annotations": true,
+  "do_normalize": true,
+  "do_pad": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "format": "coco_detection",
+  "image_mean": [
+    0.485,
+    0.456,
+    0.406
+  ],
+  "image_processor_type": "DeformableDetrImageProcessor",
+  "image_std": [
+    0.229,
+    0.224,
+    0.225
+  ],
+  "pad_size": null,
+  "resample": 2,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "height": 640,
+    "width": 640
+  }
+}