AnnaZhang commited on
Commit
150b510
·
verified ·
1 Parent(s): 45cafe9

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +92 -3
  2. config.json +286 -0
  3. model.safetensors +3 -0
  4. preprocessor_config.json +26 -0
README.md CHANGED
@@ -1,3 +1,92 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - object-detection
5
+ - vision
6
+ datasets:
7
+ - coco
8
+ pipeline_tag: object-detection
9
+ library_name: transformers
10
+ ---
11
+
12
+ # LW-DETR (Light-Weight Detection Transformer)
13
+
14
+ LW-DETR, a Light-Weight DEtection TRansformer model, is designed to be a real-time object detection alternative that outperforms conventional convolutional (YOLO-style) and earlier transformer-based (DETR) methods in terms of speed and accuracy trade-off. It was introduced in the paper [LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection](https://huggingface.co/papers/2406.03459) by Chen et al. and first released in this repository.
15
+ Disclaimer: This model was originally contributed by [stevenbucaille](https://huggingface.co/stevenbucaille) in 🤗 transformers.
16
+
17
+ ## Model description
18
+
19
+ LW-DETR is an end-to-end object detection model that uses a Vision Transformer (ViT) backbone as its encoder, a simple convolutional projector, and a shallow DETR decoder. The core philosophy is to leverage the power of transformers while implementing several efficiency-focused techniques to achieve real-time performance.
20
+
21
+ Key Architectural Details:
22
+ - ViT Encoder: Uses a plain ViT architecture. To reduce the quadratic complexity of global self-attention, it adopts interleaved window and global attentions.
23
+ - Window-Major Organization: It employs a highly efficient window-major feature map organization scheme for attention computation, which drastically reduces the costly memory permutation operations required when transitioning between global and window attention modes, leading to lower inference latency.
24
+ - Feature Aggregation: It aggregates features from multiple levels (intermediate and final layers) of the ViT encoder to create richer input for the decoder.
25
+ - Projector: A C2f block (from YOLOv8) connects the encoder and decoder. For larger versions (large/xlarge), it outputs two-scale features ($1/8$ and $1/32$) to the decoder.
26
+ - Shallow DETR Decoder: It uses a computationally efficient 3-layer transformer decoder (instead of the standard 6 layers), incorporating deformable cross-attention for faster convergence and lower latency.
27
+ - Object Queries: It uses a mixed-query selection scheme to form the object queries from both learnable content queries and generated spatial queries (based on top-K features from the Projector).
28
+
29
+ Training Details:
30
+ - IoU-aware Classification Loss (IA-BCE loss): Enhances the classification branch by incorporating IoU information into the target score $t=s^{\alpha}u^{1-\alpha}$.
31
+ - Group DETR: Uses a Group DETR strategy (13 parallel weight-sharing decoders) for faster training convergence without affecting inference speed.
32
+ - Pretraining: Uses a two-stage pretraining strategy: first, ViT is pretrained on Objects365 using a Masked Image Modeling (MIM) method (CAEv2), followed by supervised retraining of the encoder and training of the projector and decoder on Objects365. This provides a significant performance boost (average of $\approx 5.5\text{ mAP}$).
33
+
34
+ ### How to use
35
+
36
+ You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=stevenbucaille/lw-detr) to look for all available LW DETR models.
37
+
38
+ Here is how to use this model:
39
+
40
+ ```python
41
+ from transformers import AutoImageProcessor, LwDetrForObjectDetection
42
+ import torch
43
+ from PIL import Image
44
+ import requests
45
+
46
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
47
+ image = Image.open(requests.get(url, stream=True).raw)
48
+
49
+ processor = AutoImageProcessor.from_pretrained("stevenbucaille/lwdetr_small_60e_coco")
50
+ model = LwDetrForObjectDetection.from_pretrained("stevenbucaille/lwdetr_small_60e_coco")
51
+
52
+ inputs = processor(images=image, return_tensors="pt")
53
+ outputs = model(**inputs)
54
+
55
+ # convert outputs (bounding boxes and class logits) to COCO API
56
+ # let's only keep detections with score > 0.7
57
+ target_sizes = torch.tensor([image.size[::-1]])
58
+ results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.7)[0]
59
+
60
+ for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
61
+ box = [round(i, 2) for i in box.tolist()]
62
+ print(
63
+ f"Detected {model.config.id2label[label.item()]} with confidence "
64
+ f"{round(score.item(), 3)} at location {box}"
65
+ )
66
+ ```
67
+ This should output:
68
+ ```
69
+ Detected cat with confidence 0.944 at location [343.19, 24.52, 640.4, 372.93]
70
+ Detected cat with confidence 0.937 at location [9.79, 53.67, 317.63, 472.49]
71
+ Detected remote with confidence 0.913 at location [40.47, 73.09, 176.19, 117.61]
72
+ Detected couch with confidence 0.78 at location [1.26, 1.01, 639.71, 471.57]
73
+ ```
74
+
75
+ Currently, both the feature extractor and model support PyTorch.
76
+
77
+ ## Training data
78
+
79
+ The LW-DETR models are trained/finetuned on the following datasets:
80
+ - Pretraining: Primarily conducted on [Objects365](https://www.objects365.org/overview.html), a large-scale, high-quality dataset for object detection.
81
+ - Finetuning: Final training is performed on the standard [COCO 2017 object detection dataset](https://cocodataset.org/#home).
82
+
83
+ ### BibTeX entry and citation info
84
+
85
+ ```bibtex
86
+ @article{chen2024lw,
87
+ title={LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection},
88
+ author={Chen, Qiang and Su, Xiangbo and Zhang, Xinyu and Wang, Jian and Chen, Jiahui and Shen, Yunpeng and Han, Chuchu and Chen, Ziliang and Xu, Weixiang and Li, Fanrong and others},
89
+ journal={arXiv preprint arXiv:2406.03459},
90
+ year={2024}
91
+ }
92
+ ```
config.json ADDED
@@ -0,0 +1,286 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.0,
3
+ "activation_function": "silu",
4
+ "architectures": [
5
+ "LwDetrForObjectDetection"
6
+ ],
7
+ "attention_bias": true,
8
+ "attention_dropout": 0.0,
9
+ "auxiliary_loss": true,
10
+ "backbone": null,
11
+ "backbone_config": {
12
+ "cae_init_values": 0.1,
13
+ "dropout_prob": 0.0,
14
+ "hidden_act": "gelu",
15
+ "hidden_size": 192,
16
+ "image_size": 1024,
17
+ "initializer_range": 0.02,
18
+ "layer_norm_eps": 1e-06,
19
+ "mlp_ratio": 4,
20
+ "model_type": "lw_detr_vit",
21
+ "num_attention_heads": 12,
22
+ "num_channels": 3,
23
+ "num_hidden_layers": 10,
24
+ "num_windows": 16,
25
+ "num_windows_side": 4,
26
+ "out_features": [
27
+ "stage3",
28
+ "stage5",
29
+ "stage6",
30
+ "stage10"
31
+ ],
32
+ "out_indices": [
33
+ 3,
34
+ 5,
35
+ 6,
36
+ 10
37
+ ],
38
+ "patch_size": 16,
39
+ "pretrain_image_size": 224,
40
+ "qkv_bias": true,
41
+ "stage_names": [
42
+ "stem",
43
+ "stage1",
44
+ "stage2",
45
+ "stage3",
46
+ "stage4",
47
+ "stage5",
48
+ "stage6",
49
+ "stage7",
50
+ "stage8",
51
+ "stage9",
52
+ "stage10"
53
+ ],
54
+ "use_absolute_position_embeddings": true,
55
+ "window_block_indices": [
56
+ 0,
57
+ 1,
58
+ 3,
59
+ 6,
60
+ 7,
61
+ 9
62
+ ]
63
+ },
64
+ "backbone_kwargs": null,
65
+ "batch_norm_eps": 1e-05,
66
+ "bbox_cost": 5,
67
+ "bbox_loss_coefficient": 5,
68
+ "class_cost": 2,
69
+ "d_model": 256,
70
+ "decoder_activation_function": "relu",
71
+ "decoder_cross_attention_heads": 16,
72
+ "decoder_ffn_dim": 2048,
73
+ "decoder_layers": 3,
74
+ "decoder_n_points": 2,
75
+ "decoder_self_attention_heads": 8,
76
+ "dice_loss_coefficient": 1,
77
+ "disable_custom_kernels": true,
78
+ "dropout": 0.1,
79
+ "dtype": "float32",
80
+ "eos_coefficient": 0.1,
81
+ "focal_alpha": 0.25,
82
+ "giou_cost": 2,
83
+ "giou_loss_coefficient": 2,
84
+ "group_detr": 13,
85
+ "hidden_expansion": 0.5,
86
+ "id2label": {
87
+ "0": "N/A",
88
+ "1": "person",
89
+ "10": "traffic light",
90
+ "11": "fire hydrant",
91
+ "12": "street sign",
92
+ "13": "stop sign",
93
+ "14": "parking meter",
94
+ "15": "bench",
95
+ "16": "bird",
96
+ "17": "cat",
97
+ "18": "dog",
98
+ "19": "horse",
99
+ "2": "bicycle",
100
+ "20": "sheep",
101
+ "21": "cow",
102
+ "22": "elephant",
103
+ "23": "bear",
104
+ "24": "zebra",
105
+ "25": "giraffe",
106
+ "26": "hat",
107
+ "27": "backpack",
108
+ "28": "umbrella",
109
+ "29": "shoe",
110
+ "3": "car",
111
+ "30": "eye glasses",
112
+ "31": "handbag",
113
+ "32": "tie",
114
+ "33": "suitcase",
115
+ "34": "frisbee",
116
+ "35": "skis",
117
+ "36": "snowboard",
118
+ "37": "sports ball",
119
+ "38": "kite",
120
+ "39": "baseball bat",
121
+ "4": "motorcycle",
122
+ "40": "baseball glove",
123
+ "41": "skateboard",
124
+ "42": "surfboard",
125
+ "43": "tennis racket",
126
+ "44": "bottle",
127
+ "45": "plate",
128
+ "46": "wine glass",
129
+ "47": "cup",
130
+ "48": "fork",
131
+ "49": "knife",
132
+ "5": "airplane",
133
+ "50": "spoon",
134
+ "51": "bowl",
135
+ "52": "banana",
136
+ "53": "apple",
137
+ "54": "sandwich",
138
+ "55": "orange",
139
+ "56": "broccoli",
140
+ "57": "carrot",
141
+ "58": "hot dog",
142
+ "59": "pizza",
143
+ "6": "bus",
144
+ "60": "donut",
145
+ "61": "cake",
146
+ "62": "chair",
147
+ "63": "couch",
148
+ "64": "potted plant",
149
+ "65": "bed",
150
+ "66": "mirror",
151
+ "67": "dining table",
152
+ "68": "window",
153
+ "69": "desk",
154
+ "7": "train",
155
+ "70": "toilet",
156
+ "71": "door",
157
+ "72": "tv",
158
+ "73": "laptop",
159
+ "74": "mouse",
160
+ "75": "remote",
161
+ "76": "keyboard",
162
+ "77": "cell phone",
163
+ "78": "microwave",
164
+ "79": "oven",
165
+ "8": "truck",
166
+ "80": "toaster",
167
+ "81": "sink",
168
+ "82": "refrigerator",
169
+ "83": "blender",
170
+ "84": "book",
171
+ "85": "clock",
172
+ "86": "vase",
173
+ "87": "scissors",
174
+ "88": "teddy bear",
175
+ "89": "hair drier",
176
+ "9": "boat",
177
+ "90": "toothbrush"
178
+ },
179
+ "init_std": 0.02,
180
+ "label2id": {
181
+ "N/A": 0,
182
+ "airplane": 5,
183
+ "apple": 53,
184
+ "backpack": 27,
185
+ "banana": 52,
186
+ "baseball bat": 39,
187
+ "baseball glove": 40,
188
+ "bear": 23,
189
+ "bed": 65,
190
+ "bench": 15,
191
+ "bicycle": 2,
192
+ "bird": 16,
193
+ "blender": 83,
194
+ "boat": 9,
195
+ "book": 84,
196
+ "bottle": 44,
197
+ "bowl": 51,
198
+ "broccoli": 56,
199
+ "bus": 6,
200
+ "cake": 61,
201
+ "car": 3,
202
+ "carrot": 57,
203
+ "cat": 17,
204
+ "cell phone": 77,
205
+ "chair": 62,
206
+ "clock": 85,
207
+ "couch": 63,
208
+ "cow": 21,
209
+ "cup": 47,
210
+ "desk": 69,
211
+ "dining table": 67,
212
+ "dog": 18,
213
+ "donut": 60,
214
+ "door": 71,
215
+ "elephant": 22,
216
+ "eye glasses": 30,
217
+ "fire hydrant": 11,
218
+ "fork": 48,
219
+ "frisbee": 34,
220
+ "giraffe": 25,
221
+ "hair drier": 89,
222
+ "handbag": 31,
223
+ "hat": 26,
224
+ "horse": 19,
225
+ "hot dog": 58,
226
+ "keyboard": 76,
227
+ "kite": 38,
228
+ "knife": 49,
229
+ "laptop": 73,
230
+ "microwave": 78,
231
+ "mirror": 66,
232
+ "motorcycle": 4,
233
+ "mouse": 74,
234
+ "orange": 55,
235
+ "oven": 79,
236
+ "parking meter": 14,
237
+ "person": 1,
238
+ "pizza": 59,
239
+ "plate": 45,
240
+ "potted plant": 64,
241
+ "refrigerator": 82,
242
+ "remote": 75,
243
+ "sandwich": 54,
244
+ "scissors": 87,
245
+ "sheep": 20,
246
+ "shoe": 29,
247
+ "sink": 81,
248
+ "skateboard": 41,
249
+ "skis": 35,
250
+ "snowboard": 36,
251
+ "spoon": 50,
252
+ "sports ball": 37,
253
+ "stop sign": 13,
254
+ "street sign": 12,
255
+ "suitcase": 33,
256
+ "surfboard": 42,
257
+ "teddy bear": 88,
258
+ "tennis racket": 43,
259
+ "tie": 32,
260
+ "toaster": 80,
261
+ "toilet": 70,
262
+ "toothbrush": 90,
263
+ "traffic light": 10,
264
+ "train": 7,
265
+ "truck": 8,
266
+ "tv": 72,
267
+ "umbrella": 28,
268
+ "vase": 86,
269
+ "window": 68,
270
+ "wine glass": 46,
271
+ "zebra": 24
272
+ },
273
+ "model_type": "lw_detr",
274
+ "num_feature_levels": 1,
275
+ "num_queries": 300,
276
+ "projector_in_channels": [
277
+ 256
278
+ ],
279
+ "projector_out_channels": 256,
280
+ "projector_scale_factors": [
281
+ 1.0
282
+ ],
283
+ "transformers_version": "5.0.0.dev0",
284
+ "use_pretrained_backbone": false,
285
+ "use_timm_backbone": false
286
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1fe4c143db36025ee8e560784ae675d158e12f1efaa514c142778182285b1a5a
3
+ size 58296488
preprocessor_config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_annotations": true,
3
+ "do_normalize": true,
4
+ "do_pad": true,
5
+ "do_rescale": true,
6
+ "do_resize": true,
7
+ "format": "coco_detection",
8
+ "image_mean": [
9
+ 0.485,
10
+ 0.456,
11
+ 0.406
12
+ ],
13
+ "image_processor_type": "DeformableDetrImageProcessor",
14
+ "image_std": [
15
+ 0.229,
16
+ 0.224,
17
+ 0.225
18
+ ],
19
+ "pad_size": null,
20
+ "resample": 2,
21
+ "rescale_factor": 0.00392156862745098,
22
+ "size": {
23
+ "height": 640,
24
+ "width": 640
25
+ }
26
+ }