File size: 5,413 Bytes
76e2046
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: apache-2.0
tags:
- object-detection
- vision
datasets:
- coco
pipeline_tag: object-detection
library_name: transformers
---

# LW-DETR (Light-Weight Detection Transformer)

LW-DETR, a Light-Weight DEtection TRansformer model, is designed to be a real-time object detection alternative that outperforms conventional convolutional (YOLO-style) and earlier transformer-based (DETR) methods in terms of speed and accuracy trade-off. It was introduced in the paper [LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection](https://huggingface.co/papers/2406.03459) by Chen et al. and first released in this repository.
Disclaimer: This model was originally contributed by [stevenbucaille](https://huggingface.co/stevenbucaille) in 🤗 transformers.

## Model description

LW-DETR is an end-to-end object detection model that uses a Vision Transformer (ViT) backbone as its encoder, a simple convolutional projector, and a shallow DETR decoder. The core philosophy is to leverage the power of transformers while implementing several efficiency-focused techniques to achieve real-time performance.

Key Architectural Details:
- ViT Encoder: Uses a plain ViT architecture. To reduce the quadratic complexity of global self-attention, it adopts interleaved window and global attentions.
- Window-Major Organization: It employs a highly efficient window-major feature map organization scheme for attention computation, which drastically reduces the costly memory permutation operations required when transitioning between global and window attention modes, leading to lower inference latency.
- Feature Aggregation: It aggregates features from multiple levels (intermediate and final layers) of the ViT encoder to create richer input for the decoder.
- Projector: A C2f block (from YOLOv8) connects the encoder and decoder. For larger versions (large/xlarge), it outputs two-scale features ($1/8$ and $1/32$) to the decoder.
- Shallow DETR Decoder: It uses a computationally efficient 3-layer transformer decoder (instead of the standard 6 layers), incorporating deformable cross-attention for faster convergence and lower latency.
- Object Queries: It uses a mixed-query selection scheme to form the object queries from both learnable content queries and generated spatial queries (based on top-K features from the Projector).

Training Details:
- IoU-aware Classification Loss (IA-BCE loss): Enhances the classification branch by incorporating IoU information into the target score $t=s^{\alpha}u^{1-\alpha}$.
- Group DETR: Uses a Group DETR strategy (13 parallel weight-sharing decoders) for faster training convergence without affecting inference speed.
- Pretraining: Uses a two-stage pretraining strategy: first, ViT is pretrained on Objects365 using a Masked Image Modeling (MIM) method (CAEv2), followed by supervised retraining of the encoder and training of the projector and decoder on Objects365. This provides a significant performance boost (average of $\approx 5.5\text{ mAP}$).
  
### How to use

You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=stevenbucaille/lw-detr) to look for all available LW DETR models.

Here is how to use this model:

```python
from transformers import AutoImageProcessor, LwDetrForObjectDetection
import torch
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained("stevenbucaille/lwdetr_tiny_60e_coco")
model = LwDetrForObjectDetection.from_pretrained("stevenbucaille/lwdetr_tiny_60e_coco")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

# convert outputs (bounding boxes and class logits) to COCO API
# let's only keep detections with score > 0.7
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.7)[0]

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
            f"Detected {model.config.id2label[label.item()]} with confidence "
            f"{round(score.item(), 3)} at location {box}"
    )
```
This should output:
```
Detected cat with confidence 0.945 at location [7.88, 55.28, 318.25, 470.77]
Detected cat with confidence 0.867 at location [340.48, 25.23, 640.26, 373.63]
Detected remote with confidence 0.815 at location [40.64, 72.26, 176.47, 118.28]
```

Currently, both the feature extractor and model support PyTorch. 

## Training data

The LW-DETR models are trained/finetuned on the following datasets:
- Pretraining: Primarily conducted on [Objects365](https://www.objects365.org/overview.html), a large-scale, high-quality dataset for object detection.
- Finetuning: Final training is performed on the standard [COCO 2017 object detection dataset](https://cocodataset.org/#home).

### BibTeX entry and citation info

```bibtex
@article{chen2024lw,
        title={LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection},
        author={Chen, Qiang and Su, Xiangbo and Zhang, Xinyu and Wang, Jian and Chen, Jiahui and Shen, Yunpeng and Han, Chuchu and Chen, Ziliang and Xu, Weixiang and Li, Fanrong and others},
        journal={arXiv preprint arXiv:2406.03459},
        year={2024}
    }
```