| --- |
| license: apache-2.0 |
| tags: |
| - object-detection |
| - vision |
| datasets: |
| - coco |
| pipeline_tag: object-detection |
| library_name: transformers |
| --- |
| |
| # LW-DETR (Light-Weight Detection Transformer) |
|
|
| LW-DETR, a Light-Weight DEtection TRansformer model, is designed to be a real-time object detection alternative that outperforms conventional convolutional (YOLO-style) and earlier transformer-based (DETR) methods in terms of speed and accuracy trade-off. It was introduced in the paper [LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection](https://huggingface.co/papers/2406.03459) by Chen et al. and first released in this repository. |
| Disclaimer: This model was originally contributed by [stevenbucaille](https://huggingface.co/stevenbucaille) in 🤗 transformers. |
|
|
| ## Model description |
|
|
| LW-DETR is an end-to-end object detection model that uses a Vision Transformer (ViT) backbone as its encoder, a simple convolutional projector, and a shallow DETR decoder. The core philosophy is to leverage the power of transformers while implementing several efficiency-focused techniques to achieve real-time performance. |
|
|
| Key Architectural Details: |
| - ViT Encoder: Uses a plain ViT architecture. To reduce the quadratic complexity of global self-attention, it adopts interleaved window and global attentions. |
| - Window-Major Organization: It employs a highly efficient window-major feature map organization scheme for attention computation, which drastically reduces the costly memory permutation operations required when transitioning between global and window attention modes, leading to lower inference latency. |
| - Feature Aggregation: It aggregates features from multiple levels (intermediate and final layers) of the ViT encoder to create richer input for the decoder. |
| - Projector: A C2f block (from YOLOv8) connects the encoder and decoder. For larger versions (large/xlarge), it outputs two-scale features ($1/8$ and $1/32$) to the decoder. |
| - Shallow DETR Decoder: It uses a computationally efficient 3-layer transformer decoder (instead of the standard 6 layers), incorporating deformable cross-attention for faster convergence and lower latency. |
| - Object Queries: It uses a mixed-query selection scheme to form the object queries from both learnable content queries and generated spatial queries (based on top-K features from the Projector). |
|
|
| Training Details: |
| - IoU-aware Classification Loss (IA-BCE loss): Enhances the classification branch by incorporating IoU information into the target score $t=s^{\alpha}u^{1-\alpha}$. |
| - Group DETR: Uses a Group DETR strategy (13 parallel weight-sharing decoders) for faster training convergence without affecting inference speed. |
| - Pretraining: Uses a two-stage pretraining strategy: first, ViT is pretrained on Objects365 using a Masked Image Modeling (MIM) method (CAEv2), followed by supervised retraining of the encoder and training of the projector and decoder on Objects365. This provides a significant performance boost (average of $\approx 5.5\text{ mAP}$). |
| |
| ### How to use |
|
|
| You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=AnnaZhang/lw-detr) to look for all available LW DETR models. |
|
|
| Here is how to use this model: |
|
|
| ```python |
| from transformers import AutoImageProcessor, LwDetrForObjectDetection |
| import torch |
| from PIL import Image |
| import requests |
| |
| url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
| image = Image.open(requests.get(url, stream=True).raw) |
| |
| processor = AutoImageProcessor.from_pretrained("AnnaZhang/lwdetr_xlarge_60e_coco") |
| model = LwDetrForObjectDetection.from_pretrained("AnnaZhang/lwdetr_xlarge_60e_coco") |
| |
| inputs = processor(images=image, return_tensors="pt") |
| outputs = model(**inputs) |
| |
| # convert outputs (bounding boxes and class logits) to COCO API |
| # let's only keep detections with score > 0.7 |
| target_sizes = torch.tensor([image.size[::-1]]) |
| results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.7)[0] |
| |
| for score, label, box in zip(results["scores"], results["labels"], results["boxes"]): |
| box = [round(i, 2) for i in box.tolist()] |
| print( |
| f"Detected {model.config.id2label[label.item()]} with confidence " |
| f"{round(score.item(), 3)} at location {box}" |
| ) |
| ``` |
| This should output: |
| ``` |
| Detected cat with confidence 0.945 at location [7.89, 54.09, 314.97, 472.48] |
| Detected cat with confidence 0.94 at location [345.47, 22.71, 640.24, 372.13] |
| Detected couch with confidence 0.912 at location [1.56, 1.29, 639.97, 474.73] |
| Detected remote with confidence 0.893 at location [39.96, 73.65, 175.91, 117.11] |
| Detected remote with confidence 0.783 at location [333.79, 77.38, 370.37, 186.66] |
| ``` |
|
|
| Currently, both the feature extractor and model support PyTorch. |
|
|
| ## Training data |
|
|
| The LW-DETR models are trained/finetuned on the following datasets: |
| - Pretraining: Primarily conducted on [Objects365](https://www.objects365.org/overview.html), a large-scale, high-quality dataset for object detection. |
| - Finetuning: Final training is performed on the standard [COCO 2017 object detection dataset](https://cocodataset.org/#home). |
|
|
| ### BibTeX entry and citation info |
|
|
| ```bibtex |
| @article{chen2024lw, |
| title={LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection}, |
| author={Chen, Qiang and Su, Xiangbo and Zhang, Xinyu and Wang, Jian and Chen, Jiahui and Shen, Yunpeng and Han, Chuchu and Chen, Ziliang and Xu, Weixiang and Li, Fanrong and others}, |
| journal={arXiv preprint arXiv:2406.03459}, |
| year={2024} |
| } |
| ``` |