File size: 16,497 Bytes
3c70745
 
 
d2ad231
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
---
library_name: keras-hub
---
### Model Overview
# Model Summary

D-FINE is a family of lightweight, real-time object detection models built on the DETR (DEtection TRansformer) architecture. It achieves outstanding localization precision by redefining the bounding box regression task. D-FINE is a powerful object detector designed for a wide range of computer vision tasks. It's trained on massive image datasets, enabling it to excel at identifying and localizing objects with high accuracy and speed. D-FINE offers a balance of high performance and computational efficiency, making it suitable for both research and deployment in various real-time applications.

Key Features:

  * Transformer-based Architecture: A modern, efficient design based on the DETR framework for direct set prediction of objects.
  * Open Source Code: Code is publicly available, promoting accessibility and innovation.
  * Strong Performance: Achieves state-of-the-art results on object detection benchmarks like COCO for its size.
  * Multiple Sizes: Comes in various sizes (e.g., Nano, Small, Large, X-Large) to fit different hardware capabilities.
  * Advanced Bounding Box Refinement: Instead of predicting fixed coordinates, it iteratively refines probability distributions for precise object localization using Fine-grained Distribution Refinement (FDR).

Training Strategies:

D-FINE is pre-trained on large and diverse datasets like COCO and Objects365. The training process utilizes Global Optimal Localization Self-Distillation (GO-LSD), a bidirectional optimization strategy that transfers localization knowledge from refined distributions in deeper layers to shallower layers. This accelerates convergence and improves the overall performance of the model.

Weights are released under the [Apache 2.0 License](https://www.google.com/search?q=https://github.com/Peterande/D-FINE/blob/main/LICENSE).

## Links

  * [D-FINE Quickstart Notebook](https://www.kaggle.com/code/harshaljanjani/d-fine-quickstart-notebook)
  * [D-FINE API Documentation](https://keras.io/keras_hub/api/models/d_fine/)
  * [D-FINE Model Card](https://arxiv.org/abs/2410.13842)
  * [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
  * [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

## Installation

Keras and KerasHub can be installed with:

```
pip install -U -q keras-hub
pip install -U -q keras
```

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.

## Available D-FINE Presets
The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
| Preset |  Parameters | Description |
|--------|------------|-------------|
| dfine_nano_coco | 3.79M | D-FINE Nano model, the smallest variant in the family, pretrained on the COCO dataset. Ideal for applications where computational resources are limited. |
| dfine_small_coco | 10.33M | D-FINE Small model pretrained on the COCO dataset. Offers a balance between performance and computational efficiency. |
| dfine_medium_coco | 19.62M | D-FINE Medium model pretrained on the COCO dataset. A solid baseline with strong performance for general-purpose object detection. |
| dfine_large_coco | 31.34M | D-FINE Large model pretrained on the COCO dataset. Provides high accuracy and is suitable for more demanding tasks. |
| dfine_xlarge_coco | 62.83M | D-FINE X-Large model, the largest COCO-pretrained variant, designed for state-of-the-art performance where accuracy is the top priority. |
| dfine_small_obj365 | 10.62M | D-FINE Small model pretrained on the large-scale Objects365 dataset, enhancing its ability to recognize a wider variety of objects. |
| dfine_medium_obj365 | 19.99M | D-FINE Medium model pretrained on the Objects365 dataset. Benefits from a larger and more diverse pretraining corpus. |
| dfine_large_obj365 | 31.86M | D-FINE Large model pretrained on the Objects365 dataset for improved generalization and performance on diverse object categories. |
| dfine_xlarge_obj365 | 63.35M | D-FINE X-Large model pretrained on the Objects365 dataset, offering maximum performance by leveraging a vast number of object categories during pretraining. |
| dfine_small_obj2coco | 10.33M | D-FINE Small model first pretrained on Objects365 and then fine-tuned on COCO, combining broad feature learning with benchmark-specific adaptation. |
| dfine_medium_obj2coco | 19.62M | D-FINE Medium model using a two-stage training process: pretraining on Objects365 followed by fine-tuning on COCO. |
| dfine_large_obj2coco_e25 | 31.34M | D-FINE Large model pretrained on Objects365 and then fine-tuned on COCO for 25 epochs. A high-performance model with specialized tuning. |
| dfine_xlarge_obj2coco | 62.83M | D-FINE X-Large model, pretrained on Objects365 and fine-tuned on COCO, representing the most powerful model in this series for COCO-style tasks. |

## Example Usage
### Imports
```python
import keras
import keras_hub
import numpy as np
from keras_hub.models import DFineBackbone
from keras_hub.models import DFineObjectDetector
from keras_hub.models import HGNetV2Backbone
```

### Load a Pretrained Model
Use `from_preset()` to load a D-FINE model with pretrained weights.
```python
object_detector = DFineObjectDetector.from_preset(
    "dfine_xlarge_coco"
)
```

### Make a Prediction
Call `predict()` on a batch of images. The images will be automatically preprocessed.
```python
# Create a random image.
image = np.random.uniform(size=(1, 256, 256, 3)).astype("float32")

# Make predictions.
predictions = object_detector.predict(image)

# The output is a dictionary containing boxes, labels, confidence scores,
# and the number of detections.
print(predictions["boxes"].shape)
print(predictions["labels"].shape)
print(predictions["confidence"].shape)
print(predictions["num_detections"])
```

### Fine-Tune a Pre-trained Model
You can load a pretrained backbone and attach a new detection head for a different number of classes.
```python
# Load a pretrained backbone.
backbone = DFineBackbone.from_preset(
    "dfine_xlarge_coco"
)

# Create a new detector with a different number of classes for fine-tuning.
finetuning_detector = DFineObjectDetector(
    backbone=backbone,
    num_classes=10  # Example: fine-tuning on a new dataset with 10 classes
)

# The `finetuning_detector` is now ready to be compiled and trained on a new dataset.
```

### Create a Model From Scratch
You can also build a D-FINE detector by first creating its components, such as the underlying `HGNetV2Backbone`.
```python
# 1. Define a base backbone for feature extraction.
hgnetv2_backbone = HGNetV2Backbone(
    stem_channels=[3, 16, 16],
    stackwise_stage_filters=[
        [16, 16, 64, 1, 3, 3],
        [64, 32, 256, 1, 3, 3],
        [256, 64, 512, 2, 3, 5],
        [512, 128, 1024, 1, 3, 5],
    ],
    apply_downsample=[False, True, True, True],
    use_lightweight_conv_block=[False, False, True, True],
    depths=[1, 1, 2, 1],
    hidden_sizes=[64, 256, 512, 1024],
    embedding_size=16,
    image_shape=(256, 256, 3),
    out_features=["stage3", "stage4"],
)

# 2. Create the D-FINE backbone, which includes the hybrid encoder and decoder.
d_fine_backbone = DFineBackbone(
    backbone=hgnetv2_backbone,
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=0, # Denoising is off
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# 3. Create the final object detector model.
object_detector_scratch = DFineObjectDetector(
    backbone=d_fine_backbone,
    num_classes=80,
    bounding_box_format="yxyx",
)
```

### Train the Model
Call `fit()` on a batch of images and ground truth bounding boxes. The `compute_loss` method from the detector handles the complex loss calculations.
```python
# Prepare sample training data.
images = np.random.uniform(
    low=0, high=255, size=(2, 256, 256, 3)
).astype("float32")
bounding_boxes = {
    "boxes": [
        np.array([[0.1, 0.1, 0.3, 0.3], [0.5, 0.5, 0.8, 0.8]], dtype="float32"),
        np.array([[0.2, 0.2, 0.4, 0.4]], dtype="float32"),
    ],
    "labels": [
        np.array([1, 10], dtype="int32"),
        np.array([20], dtype="int32"),
    ],
}

# Compile the model with the built-in loss function.
object_detector_scratch.compile(
    optimizer="adam",
    loss=object_detector_scratch.compute_loss,
)

# Train the model.
object_detector_scratch.fit(x=images, y=bounding_boxes, epochs=1)
```

### Train with Contrastive Denoising
To enable contrastive denoising for training, provide ground truth `labels` when initializing the `DFineBackbone`.
```python
# Sample ground truth labels for initializing the denoising generator.
labels_for_denoising = [
    {
        "boxes": np.array([[0.5, 0.5, 0.2, 0.2]]), "labels": np.array([1])
    },
    {
        "boxes": np.array([[0.6, 0.6, 0.3, 0.3]]), "labels": np.array([2])
    },
]

# Create a D-FINE backbone with denoising enabled.
d_fine_backbone_denoising = DFineBackbone(
    backbone=hgnetv2_backbone, # Using the hgnetv2_backbone from before
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=100,  # Number of denoising queries
    label_noise_ratio=0.5,
    box_noise_scale=1.0,
    labels=labels_for_denoising, # Provide labels at initialization
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# Create the final detector.
object_detector_denoising = DFineObjectDetector(
    backbone=d_fine_backbone_denoising,
    num_classes=80
)

# This model can now be compiled and trained as shown in the previous example.
```

## Example Usage with Hugging Face URI

### Imports
```python
import keras
import keras_hub
import numpy as np
from keras_hub.models import DFineBackbone
from keras_hub.models import DFineObjectDetector
from keras_hub.models import HGNetV2Backbone
```

### Load a Pretrained Model
Use `from_preset()` to load a D-FINE model with pretrained weights.
```python
object_detector = DFineObjectDetector.from_preset(
    "hf://keras/dfine_xlarge_coco"
)
```

### Make a Prediction
Call `predict()` on a batch of images. The images will be automatically preprocessed.
```python
# Create a random image.
image = np.random.uniform(size=(1, 256, 256, 3)).astype("float32")

# Make predictions.
predictions = object_detector.predict(image)

# The output is a dictionary containing boxes, labels, confidence scores,
# and the number of detections.
print(predictions["boxes"].shape)
print(predictions["labels"].shape)
print(predictions["confidence"].shape)
print(predictions["num_detections"])
```

### Fine-Tune a Pre-trained Model
You can load a pretrained backbone and attach a new detection head for a different number of classes.
```python
# Load a pretrained backbone.
backbone = DFineBackbone.from_preset(
    "hf://keras/dfine_xlarge_coco"
)

# Create a new detector with a different number of classes for fine-tuning.
finetuning_detector = DFineObjectDetector(
    backbone=backbone,
    num_classes=10  # Example: fine-tuning on a new dataset with 10 classes
)

# The `finetuning_detector` is now ready to be compiled and trained on a new dataset.
```

### Create a Model From Scratch
You can also build a D-FINE detector by first creating its components, such as the underlying `HGNetV2Backbone`.
```python
# 1. Define a base backbone for feature extraction.
hgnetv2_backbone = HGNetV2Backbone(
    stem_channels=[3, 16, 16],
    stackwise_stage_filters=[
        [16, 16, 64, 1, 3, 3],
        [64, 32, 256, 1, 3, 3],
        [256, 64, 512, 2, 3, 5],
        [512, 128, 1024, 1, 3, 5],
    ],
    apply_downsample=[False, True, True, True],
    use_lightweight_conv_block=[False, False, True, True],
    depths=[1, 1, 2, 1],
    hidden_sizes=[64, 256, 512, 1024],
    embedding_size=16,
    image_shape=(256, 256, 3),
    out_features=["stage3", "stage4"],
)

# 2. Create the D-FINE backbone, which includes the hybrid encoder and decoder.
d_fine_backbone = DFineBackbone(
    backbone=hgnetv2_backbone,
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=0, # Denoising is off
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# 3. Create the final object detector model.
object_detector_scratch = DFineObjectDetector(
    backbone=d_fine_backbone,
    num_classes=80,
    bounding_box_format="yxyx",
)
```

### Train the Model
Call `fit()` on a batch of images and ground truth bounding boxes. The `compute_loss` method from the detector handles the complex loss calculations.
```python
# Prepare sample training data.
images = np.random.uniform(
    low=0, high=255, size=(2, 256, 256, 3)
).astype("float32")
bounding_boxes = {
    "boxes": [
        np.array([[0.1, 0.1, 0.3, 0.3], [0.5, 0.5, 0.8, 0.8]], dtype="float32"),
        np.array([[0.2, 0.2, 0.4, 0.4]], dtype="float32"),
    ],
    "labels": [
        np.array([1, 10], dtype="int32"),
        np.array([20], dtype="int32"),
    ],
}

# Compile the model with the built-in loss function.
object_detector_scratch.compile(
    optimizer="adam",
    loss=object_detector_scratch.compute_loss,
)

# Train the model.
object_detector_scratch.fit(x=images, y=bounding_boxes, epochs=1)
```

### Train with Contrastive Denoising
To enable contrastive denoising for training, provide ground truth `labels` when initializing the `DFineBackbone`.
```python
# Sample ground truth labels for initializing the denoising generator.
labels_for_denoising = [
    {
        "boxes": np.array([[0.5, 0.5, 0.2, 0.2]]), "labels": np.array([1])
    },
    {
        "boxes": np.array([[0.6, 0.6, 0.3, 0.3]]), "labels": np.array([2])
    },
]

# Create a D-FINE backbone with denoising enabled.
d_fine_backbone_denoising = DFineBackbone(
    backbone=hgnetv2_backbone, # Using the hgnetv2_backbone from before
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=100,  # Number of denoising queries
    label_noise_ratio=0.5,
    box_noise_scale=1.0,
    labels=labels_for_denoising, # Provide labels at initialization
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# Create the final detector.
object_detector_denoising = DFineObjectDetector(
    backbone=d_fine_backbone_denoising,
    num_classes=80
)

# This model can now be compiled and trained as shown in the previous example.
```