MBadran commited on
Commit
b7ce701
Β·
verified Β·
1 Parent(s): 914bd22

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +298 -3
README.md CHANGED
@@ -1,3 +1,298 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - object-detection
5
+ - computer-vision
6
+ - detr
7
+ - diffusion
8
+ - pytorch
9
+ datasets:
10
+ - coco
11
+ - lvis
12
+ pipeline_tag: object-detection
13
+ ---
14
+
15
+ <div align="center">
16
+ <div align="center"> <img src="https://raw.githubusercontent.com/MBadran2000/DiffuDETR/main/docs/logo.png" width="50%"> </div>
17
+
18
+ [![Paper](https://img.shields.io/badge/πŸ“„_Paper-ICLR_2026-4F46E5?style=for-the-badge)](https://iclr.cc/virtual/2026/poster/10007459)
19
+ [![GitHub](https://img.shields.io/badge/πŸ’»_Code-GitHub-black?style=for-the-badge)](https://github.com/MBadran2000/DiffuDETR)
20
+ [![Project Page](https://img.shields.io/badge/🌐_Project-Page-0F766E?style=for-the-badge)](https://mbadran2000.github.io/DiffuDETR/)
21
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue?style=for-the-badge)](https://github.com/MBadran2000/DiffuDETR/blob/main/LICENSE)
22
+
23
+
24
+ # DiffuDETR: Rethinking Detection Transformers with Denoising Diffusion Process
25
+
26
+ ### [ICLR 2026](https://iclr.cc/virtual/2026/poster/10007459)
27
+
28
+ **[Youssef Nawar](https://scholar.google.com/citations?hl=en&user=HQWsM2gAAAAJ)\*&nbsp;&nbsp;&nbsp;[Mohamed Badran](https://scholar.google.com/citations?hl=en&user=HkQmlHoAAAAJ)\*&nbsp;&nbsp;&nbsp;[Marwan Torki](https://scholar.google.com/citations?hl=en&user=aYLNZT4AAAAJ)**
29
+
30
+ Alexandria University &nbsp;Β·&nbsp; Technical University of Munich &nbsp;Β·&nbsp; Applied Innovation Center
31
+
32
+ <sub>* Equal Contribution</sub>
33
+
34
+
35
+ <br>
36
+
37
+ <img src="https://raw.githubusercontent.com/MBadran2000/DiffuDETR/main/docs/figures/framework.png" alt="DiffuDETR Framework" width="100%"/>
38
+
39
+ <p align="center"><em>DiffuDETR reformulates object detection as a <strong>conditional query generation task</strong> using denoising diffusion, achieving state-of-the-art results on COCO, LVIS, and V3Det.</em></p>
40
+
41
+ </div>
42
+
43
+ ---
44
+
45
+ ## πŸ”₯ Highlights
46
+
47
+ <table>
48
+ <tr>
49
+ <td align="center"><h2>51.9</h2><sub>mAP on COCO</sub><br><code>+1.0 over DINO</code></td>
50
+ <td align="center"><h2>28.9</h2><sub>AP on LVIS</sub><br><code>+2.4 over DINO</code></td>
51
+ <td align="center"><h2>50.3</h2><sub>AP on V3Det</sub><br><code>+8.3 over DINO</code></td>
52
+ <td align="center"><h2>3Γ—</h2><sub>Decoder Passes</sub><br><code>Only ~17% Extra FLOPs</code></td>
53
+ </tr>
54
+ </table>
55
+
56
+ - 🎯 **Diffusion-Based Query Generation** β€” Reformulates object detection in DETR as a denoising diffusion process, progressively denoising queries' reference points from Gaussian noise to precise object locations
57
+ - πŸ—οΈ **Two Powerful Variants** β€” DiffuDETR (built on Deformable DETR) and DiffuDINO (built on DINO with contrastive denoising queries), demonstrating the generality of our approach
58
+ - ⚑ **Efficient Inference** β€” Only the lightweight decoder runs multiple times; backbone and encoder execute once, adding just ~17% extra FLOPs with 3 decoder passes
59
+ - πŸ“Š **Consistent Gains Across Benchmarks** β€” Improvements on COCO 2017, LVIS, and V3Det across multiple backbones (ResNet-50, ResNet-101, Swin-B) with high multi-seed stability (Β±0.2 AP)
60
+
61
+ ---
62
+
63
+ ## πŸ“₯ Model Weights
64
+
65
+ > **Note for Hugging Face Users:** The pre-trained model weights (`.pth` files) for DiffuDETR and DiffuDINO can be found in the [checkpoints tab](https://huggingface.co/MBadran/DiffuDETR/tree/main/checkpoints) to download for evaluating or finetuning the models on your custom datasets. Please visit our [GitHub Code Repository](https://github.com/MBadran2000/DiffuDETR) for complete documentation on architecture and data preparation.
66
+
67
+ ---
68
+
69
+ ## πŸ“‹ Abstract
70
+
71
+ We present **DiffuDETR**, a novel approach that formulates object detection as a **conditional object query generation task**, conditioned on the image and a set of noisy reference points. We integrate DETR-based models with denoising diffusion training to generate object queries' reference points from a prior Gaussian distribution. We propose two variants: **DiffuDETR**, built on top of the Deformable DETR decoder, and **DiffuDINO**, based on DINO's decoder with contrastive denoising queries. To improve inference efficiency, we further introduce a lightweight sampling scheme that requires only multiple forward passes through the decoder.
72
+
73
+ Our method demonstrates consistent improvements across multiple backbones and datasets, including **COCO 2017**, **LVIS**, and **V3Det**, surpassing the performance of their respective baselines, with notable gains in complex and crowded scenes.
74
+
75
+ ---
76
+
77
+ ## πŸ›οΈ Method
78
+
79
+ <div align="center">
80
+ <img src="https://raw.githubusercontent.com/MBadran2000/DiffuDETR/main/docs/figures/decoder-arch.png" alt="Decoder Architecture" width="85%"/>
81
+ <br>
82
+ <sub><b>Decoder Architecture</b> β€” Timestep embeddings are injected after self-attention, followed by multi-scale deformable cross-attention with noisy reference points attending to encoded image features.</sub>
83
+ </div>
84
+
85
+ <br>
86
+
87
+ ### How It Works
88
+
89
+ | Step | Description |
90
+ |:---:|:---|
91
+ | **Feature Extraction** | A backbone (ResNet / Swin) + transformer encoder extracts multi-scale image features |
92
+ | **Forward Diffusion** *(training)* | Ground-truth box coordinates are corrupted with Gaussian noise at a random timestep $t \sim U(0, 100)$ via a cosine noise schedule |
93
+ | **Reverse Denoising** *(inference)* | Reference points start as pure Gaussian noise and are iteratively denoised using DDIM sampling with only **3 decoder forward passes** |
94
+ | **Timestep Conditioning** | The decoder integrates timestep embeddings after self-attention: $q_n = \text{FFN}(\text{MSDA}(\text{SA}(q_{n-1}) + t), r_t, O_{\text{enc}})$ |
95
+
96
+ ---
97
+
98
+ ## πŸ“Š Main Results
99
+
100
+ ### COCO 2017 val β€” Object Detection
101
+
102
+ | Model | Backbone | Epochs | AP | APβ‚…β‚€ | AP₇₅ | APβ‚› | APβ‚˜ | APβ‚— |
103
+ |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
104
+ | Pix2Seq | R50 | 300 | 43.2 | 61.0 | 46.1 | 26.6 | 47.0 | 58.6 |
105
+ | DiffusionDet | R50 | β€” | 46.8 | 65.3 | 51.8 | 29.6 | 49.3 | 62.2 |
106
+ | Deformable DETR | R50 | 50 | 48.2 | 67.0 | 52.2 | 30.7 | 51.4 | 63.0 |
107
+ | Align-DETR | R50 | 24 | 51.4 | 69.1 | 55.8 | 35.5 | 54.6 | 65.7 |
108
+ | DINO | R50 | 36 | 50.9 | 69.0 | 55.3 | 34.6 | 54.1 | 64.6 |
109
+ | **DiffuDETR (Ours)** | R50 | 50 | **50.2** *(+2.0)* | 66.8 | 55.2 | 33.3 | 53.9 | 65.8 |
110
+ | **DiffuAlignDETR (Ours)** | R50 | 24 | **51.9** *(+0.5)* | 69.2 | 56.4 | 34.9 | 55.6 | 66.2 |
111
+ | **DiffuDINO (Ours)** | R50 | 50 | **51.9** *(+1.0)* | 69.4 | 55.7 | 35.8 | 55.7 | 67.1 |
112
+ | Pix2Seq | R101 | 300 | 44.5 | 62.8 | 47.5 | 26.0 | 48.2 | 60.3 |
113
+ | DiffusionDet | R101 | β€” | 47.5 | 65.7 | 52.0 | 30.8 | 50.4 | 63.1 |
114
+ | Align-DETR | R101 | 12 | 51.2 | 68.8 | 55.7 | 32.9 | 55.1 | 66.6 |
115
+ | DINO | R101 | 12 | 50.0 | 67.7 | 54.4 | 32.2 | 53.4 | 64.3 |
116
+ | **DiffuAlignDETR (Ours)** | R101 | 12 | **51.7** *(+0.5)* | 69.3 | 56.1 | 34.0 | 55.6 | 67.0 |
117
+ | **DiffuDINO (Ours)** | R101 | 12 | **51.2** *(+1.2)* | 68.6 | 55.8 | 33.2 | 55.6 | 67.2 |
118
+
119
+ ### LVIS val β€” Large Vocabulary Detection
120
+
121
+ | Model | Backbone | AP | APβ‚…β‚€ | APr | APc | APf |
122
+ |:---|:---:|:---:|:---:|:---:|:---:|:---:|
123
+ | DINO | R50 | 26.5 | 35.9 | 9.2 | 24.6 | 36.2 |
124
+ | **DiffuDINO (Ours)** | R50 | **28.9** *(+2.4)* | 38.5 | **13.7** *(+4.5)* | 27.6 | 36.9 |
125
+ | DINO | R101 | 30.9 | 40.4 | 13.9 | 29.7 | 39.7 |
126
+ | **DiffuDINO (Ours)** | R101 | **32.5** *(+1.6)* | 42.4 | 13.5 | 32.0 | 41.5 |
127
+
128
+ ### V3Det val β€” Vast Vocabulary Detection (13,204 categories)
129
+
130
+ | Model | Backbone | AP | APβ‚…β‚€ | AP₇₅ |
131
+ |:---|:---:|:---:|:---:|:---:|
132
+ | DINO | R50 | 33.5 | 37.7 | 35.0 |
133
+ | **DiffuDINO (Ours)** | R50 | **35.7** *(+2.2)* | 41.4 | 37.7 |
134
+ | DINO | Swin-B | 42.0 | 46.8 | 43.9 |
135
+ | **DiffuDINO (Ours)** | Swin-B | **50.3** *(+8.3)* | 56.6 | 52.9 |
136
+
137
+ ---
138
+
139
+ ## πŸ“ˆ Convergence & Qualitative Results
140
+
141
+ <div align="center">
142
+ <img src="https://raw.githubusercontent.com/MBadran2000/DiffuDETR/main/docs/figures/Converage-Comparsion.png" alt="Convergence Comparison" width="80%"/>
143
+ <br>
144
+ <sub><b>Training Convergence</b> β€” COCO val2017 AP (%) vs. training epochs. DiffuDINO converges to the highest AP, surpassing all baseline methods.</sub>
145
+ </div>
146
+
147
+ <br>
148
+
149
+ <div align="center">
150
+ <img src="https://raw.githubusercontent.com/MBadran2000/DiffuDETR/main/docs/figures/comparsion-withBaseline.png" alt="Qualitative Comparison" width="95%"/>
151
+ <br>
152
+ <sub><b>Qualitative Comparison</b> β€” Deformable DETR vs. DiffuDETR and DINO vs. DiffuDINO on COCO 2017 val. Our models produce more accurate and complete detections, especially in crowded scenes.</sub>
153
+ </div>
154
+
155
+ ---
156
+
157
+ ## πŸ”¬ Ablation Studies
158
+
159
+ > All ablations on COCO 2017 val with DiffuDINO (R50 backbone).
160
+
161
+ | Ablation | Setting | AP |
162
+ |:---|:---|:---:|
163
+ | **Noise Distribution** | Gaussian *(best)* | **51.9** |
164
+ | | Sigmoid | 50.4 |
165
+ | | Beta | 49.5 |
166
+ | **Noise Scheduler** | Cosine *(best)* | **51.9** |
167
+ | | Linear | 51.6 |
168
+ | | Sqrt | 51.4 |
169
+ | **Decoder Evaluations** | 1 eval | 51.6 |
170
+ | | **3 evals** *(best)* | **51.9** |
171
+ | | 5 evals | 51.8 |
172
+ | | 10 evals | 51.4 |
173
+ | **FLOPs** | 1 eval β†’ 244.5G | β€” |
174
+ | | 3 evals β†’ 285.2G *(+17%)* | β€” |
175
+ | | 5 evals β†’ 326.0G | β€” |
176
+
177
+ > πŸ›‘οΈ **Multi-Seed Robustness:** Across 5 random seeds, standard deviation remains **below Β±0.2 AP** in all settings.
178
+
179
+ ---
180
+
181
+ ## πŸ› οΈ Installation
182
+
183
+ DiffuDETR is built on top of [detrex](https://github.com/IDEA-Research/detrex) and [detectron2](https://github.com/facebookresearch/detectron2). For a complete local setup, see below:
184
+
185
+ ### Prerequisites
186
+
187
+ - Linux with Python β‰₯ 3.11
188
+ - PyTorch β‰₯ 2.3.1 and corresponding torchvision
189
+ - CUDA 12.x
190
+
191
+ ### Step-by-Step Setup
192
+
193
+ ```bash
194
+ # 1. Create and activate conda environment
195
+ conda create -n diffudetr python=3.11 -y
196
+ conda activate diffudetr
197
+
198
+ # 2. Install PyTorch
199
+ pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
200
+
201
+ # 3. Clone and install detrex
202
+ git clone https://github.com/IDEA-Research/detrex.git
203
+ cd detrex
204
+ git submodule init
205
+ git submodule update
206
+
207
+ # 4. Install detectron2
208
+ python -m pip install -e detectron2 --no-build-isolation
209
+
210
+ # 5. Install detrex
211
+ pip install -e . --no-build-isolation
212
+
213
+ # 6. Fix setuptools compatibility
214
+ pip uninstall setuptools -y
215
+ pip install "setuptools<81"
216
+
217
+ # 7. Install additional dependencies
218
+ pip install pytorch_metric_learning lvis
219
+
220
+ # 8. Add DiffuDETR to PYTHONPATH
221
+ export PYTHONPATH="/path/to/DiffuDETR/:$PYTHONPATH"
222
+
223
+ # 9. Set dataset path
224
+ export DETECTRON2_DATASETS=/path/to/datasets/
225
+ ```
226
+
227
+ ---
228
+
229
+ ## πŸš€ Usage
230
+
231
+ Checkpoints downloaded from this Hugging Face repository can be used dynamically using your configured path `/path/to/checkpoint.pth`.
232
+
233
+ ### Evaluation
234
+
235
+ ```bash
236
+ python /path/to/detrex/tools/train_net.py \
237
+ --num-gpus 2 \
238
+ --eval-only \
239
+ --config-file projects/diffu_dino/configs/dino-resnet/coco-r50-4scales-50ep.py \
240
+ train.init_checkpoint=/path/to/checkpoint.pth
241
+ ```
242
+
243
+ ### Training
244
+
245
+ ```bash
246
+ # DiffuDINO with ResNet-50 on COCO
247
+ python /path/to/detrex/tools/train_net.py \
248
+ --num-gpus 2 \
249
+ --config-file projects/diffu_dino/configs/dino-resnet/coco-r50-4scales-50ep.py
250
+
251
+ # DiffuDINO with ResNet-101 on COCO
252
+ python /path/to/detrex/tools/train_net.py \
253
+ --num-gpus 2 \
254
+ --config-file projects/diffu_dino/configs/dino-resnet/coco-r101-4scales-12ep.py
255
+
256
+ # DiffuDINO on V3Det
257
+ python /path/to/detrex/tools/train_net.py \
258
+ --num-gpus 2 \
259
+ --config-file projects/diffu_dino/configs/dino-resnet/v3det-r50-4scales-24ep.py
260
+
261
+ # DiffuAlignDETR on COCO
262
+ python /path/to/detrex/tools/train_net.py \
263
+ --num-gpus 2 \
264
+ --config-file projects/diffu_align_detr/configs/coco-r50-4scales-24ep.py
265
+ ```
266
+
267
+ ---
268
+
269
+ ## πŸ“ Citation
270
+
271
+ If you find DiffuDETR useful in your research, please consider citing our paper:
272
+
273
+ ```bibtex
274
+ @inproceedings{nawar2026diffudetr,
275
+ title = {DiffuDETR: Rethinking Detection Transformers with Denoising Diffusion Process},
276
+ author = {Nawar, Youssef and Badran, Mohamed and Torki, Marwan},
277
+ booktitle = {International Conference on Learning Representations (ICLR)},
278
+ year = {2026}
279
+ }
280
+ ```
281
+
282
+ ---
283
+
284
+ ## πŸ™ Acknowledgements
285
+
286
+ This project is built upon the following open-source works:
287
+
288
+ - [detrex](https://github.com/IDEA-Research/detrex) β€” Benchmarking Detection Transformers
289
+ - [detectron2](https://github.com/facebookresearch/detectron2) β€” Facebook AI Research's detection library
290
+ - [DINO](https://github.com/IDEA-Research/DINO) β€” DETR with Improved DeNoising Anchor Boxes
291
+ - [AlignDETR](https://github.com/FelixCaae/AlignDETR) β€” Improving DETR with IoU-Aware BCE Loss
292
+ - [DiffusionDet](https://github.com/ShoufaChen/DiffusionDet) β€” Diffusion Model for Object Detection
293
+
294
+ ---
295
+
296
+ ## πŸ“„ License
297
+
298
+ This project is released under the [Apache 2.0 License](LICENSE).