File size: 16,497 Bytes
3c70745 d2ad231 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 | ---
library_name: keras-hub
---
### Model Overview
# Model Summary
D-FINE is a family of lightweight, real-time object detection models built on the DETR (DEtection TRansformer) architecture. It achieves outstanding localization precision by redefining the bounding box regression task. D-FINE is a powerful object detector designed for a wide range of computer vision tasks. It's trained on massive image datasets, enabling it to excel at identifying and localizing objects with high accuracy and speed. D-FINE offers a balance of high performance and computational efficiency, making it suitable for both research and deployment in various real-time applications.
Key Features:
* Transformer-based Architecture: A modern, efficient design based on the DETR framework for direct set prediction of objects.
* Open Source Code: Code is publicly available, promoting accessibility and innovation.
* Strong Performance: Achieves state-of-the-art results on object detection benchmarks like COCO for its size.
* Multiple Sizes: Comes in various sizes (e.g., Nano, Small, Large, X-Large) to fit different hardware capabilities.
* Advanced Bounding Box Refinement: Instead of predicting fixed coordinates, it iteratively refines probability distributions for precise object localization using Fine-grained Distribution Refinement (FDR).
Training Strategies:
D-FINE is pre-trained on large and diverse datasets like COCO and Objects365. The training process utilizes Global Optimal Localization Self-Distillation (GO-LSD), a bidirectional optimization strategy that transfers localization knowledge from refined distributions in deeper layers to shallower layers. This accelerates convergence and improves the overall performance of the model.
Weights are released under the [Apache 2.0 License](https://www.google.com/search?q=https://github.com/Peterande/D-FINE/blob/main/LICENSE).
## Links
* [D-FINE Quickstart Notebook](https://www.kaggle.com/code/harshaljanjani/d-fine-quickstart-notebook)
* [D-FINE API Documentation](https://keras.io/keras_hub/api/models/d_fine/)
* [D-FINE Model Card](https://arxiv.org/abs/2410.13842)
* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)
## Installation
Keras and KerasHub can be installed with:
```
pip install -U -q keras-hub
pip install -U -q keras
```
Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.
## Available D-FINE Presets
The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
| Preset | Parameters | Description |
|--------|------------|-------------|
| dfine_nano_coco | 3.79M | D-FINE Nano model, the smallest variant in the family, pretrained on the COCO dataset. Ideal for applications where computational resources are limited. |
| dfine_small_coco | 10.33M | D-FINE Small model pretrained on the COCO dataset. Offers a balance between performance and computational efficiency. |
| dfine_medium_coco | 19.62M | D-FINE Medium model pretrained on the COCO dataset. A solid baseline with strong performance for general-purpose object detection. |
| dfine_large_coco | 31.34M | D-FINE Large model pretrained on the COCO dataset. Provides high accuracy and is suitable for more demanding tasks. |
| dfine_xlarge_coco | 62.83M | D-FINE X-Large model, the largest COCO-pretrained variant, designed for state-of-the-art performance where accuracy is the top priority. |
| dfine_small_obj365 | 10.62M | D-FINE Small model pretrained on the large-scale Objects365 dataset, enhancing its ability to recognize a wider variety of objects. |
| dfine_medium_obj365 | 19.99M | D-FINE Medium model pretrained on the Objects365 dataset. Benefits from a larger and more diverse pretraining corpus. |
| dfine_large_obj365 | 31.86M | D-FINE Large model pretrained on the Objects365 dataset for improved generalization and performance on diverse object categories. |
| dfine_xlarge_obj365 | 63.35M | D-FINE X-Large model pretrained on the Objects365 dataset, offering maximum performance by leveraging a vast number of object categories during pretraining. |
| dfine_small_obj2coco | 10.33M | D-FINE Small model first pretrained on Objects365 and then fine-tuned on COCO, combining broad feature learning with benchmark-specific adaptation. |
| dfine_medium_obj2coco | 19.62M | D-FINE Medium model using a two-stage training process: pretraining on Objects365 followed by fine-tuning on COCO. |
| dfine_large_obj2coco_e25 | 31.34M | D-FINE Large model pretrained on Objects365 and then fine-tuned on COCO for 25 epochs. A high-performance model with specialized tuning. |
| dfine_xlarge_obj2coco | 62.83M | D-FINE X-Large model, pretrained on Objects365 and fine-tuned on COCO, representing the most powerful model in this series for COCO-style tasks. |
## Example Usage
### Imports
```python
import keras
import keras_hub
import numpy as np
from keras_hub.models import DFineBackbone
from keras_hub.models import DFineObjectDetector
from keras_hub.models import HGNetV2Backbone
```
### Load a Pretrained Model
Use `from_preset()` to load a D-FINE model with pretrained weights.
```python
object_detector = DFineObjectDetector.from_preset(
"dfine_xlarge_coco"
)
```
### Make a Prediction
Call `predict()` on a batch of images. The images will be automatically preprocessed.
```python
# Create a random image.
image = np.random.uniform(size=(1, 256, 256, 3)).astype("float32")
# Make predictions.
predictions = object_detector.predict(image)
# The output is a dictionary containing boxes, labels, confidence scores,
# and the number of detections.
print(predictions["boxes"].shape)
print(predictions["labels"].shape)
print(predictions["confidence"].shape)
print(predictions["num_detections"])
```
### Fine-Tune a Pre-trained Model
You can load a pretrained backbone and attach a new detection head for a different number of classes.
```python
# Load a pretrained backbone.
backbone = DFineBackbone.from_preset(
"dfine_xlarge_coco"
)
# Create a new detector with a different number of classes for fine-tuning.
finetuning_detector = DFineObjectDetector(
backbone=backbone,
num_classes=10 # Example: fine-tuning on a new dataset with 10 classes
)
# The `finetuning_detector` is now ready to be compiled and trained on a new dataset.
```
### Create a Model From Scratch
You can also build a D-FINE detector by first creating its components, such as the underlying `HGNetV2Backbone`.
```python
# 1. Define a base backbone for feature extraction.
hgnetv2_backbone = HGNetV2Backbone(
stem_channels=[3, 16, 16],
stackwise_stage_filters=[
[16, 16, 64, 1, 3, 3],
[64, 32, 256, 1, 3, 3],
[256, 64, 512, 2, 3, 5],
[512, 128, 1024, 1, 3, 5],
],
apply_downsample=[False, True, True, True],
use_lightweight_conv_block=[False, False, True, True],
depths=[1, 1, 2, 1],
hidden_sizes=[64, 256, 512, 1024],
embedding_size=16,
image_shape=(256, 256, 3),
out_features=["stage3", "stage4"],
)
# 2. Create the D-FINE backbone, which includes the hybrid encoder and decoder.
d_fine_backbone = DFineBackbone(
backbone=hgnetv2_backbone,
decoder_in_channels=[128, 128],
encoder_hidden_dim=128,
num_denoising=0, # Denoising is off
num_labels=80,
hidden_dim=128,
learn_initial_query=False,
num_queries=300,
anchor_image_size=(256, 256),
feat_strides=[16, 32],
num_feature_levels=2,
encoder_in_channels=[512, 1024],
encode_proj_layers=[1],
num_attention_heads=8,
encoder_ffn_dim=512,
num_encoder_layers=1,
hidden_expansion=0.34,
depth_multiplier=0.5,
eval_idx=-1,
num_decoder_layers=3,
decoder_attention_heads=8,
decoder_ffn_dim=512,
decoder_n_points=[6, 6],
lqe_hidden_dim=64,
num_lqe_layers=2,
image_shape=(256, 256, 3),
)
# 3. Create the final object detector model.
object_detector_scratch = DFineObjectDetector(
backbone=d_fine_backbone,
num_classes=80,
bounding_box_format="yxyx",
)
```
### Train the Model
Call `fit()` on a batch of images and ground truth bounding boxes. The `compute_loss` method from the detector handles the complex loss calculations.
```python
# Prepare sample training data.
images = np.random.uniform(
low=0, high=255, size=(2, 256, 256, 3)
).astype("float32")
bounding_boxes = {
"boxes": [
np.array([[0.1, 0.1, 0.3, 0.3], [0.5, 0.5, 0.8, 0.8]], dtype="float32"),
np.array([[0.2, 0.2, 0.4, 0.4]], dtype="float32"),
],
"labels": [
np.array([1, 10], dtype="int32"),
np.array([20], dtype="int32"),
],
}
# Compile the model with the built-in loss function.
object_detector_scratch.compile(
optimizer="adam",
loss=object_detector_scratch.compute_loss,
)
# Train the model.
object_detector_scratch.fit(x=images, y=bounding_boxes, epochs=1)
```
### Train with Contrastive Denoising
To enable contrastive denoising for training, provide ground truth `labels` when initializing the `DFineBackbone`.
```python
# Sample ground truth labels for initializing the denoising generator.
labels_for_denoising = [
{
"boxes": np.array([[0.5, 0.5, 0.2, 0.2]]), "labels": np.array([1])
},
{
"boxes": np.array([[0.6, 0.6, 0.3, 0.3]]), "labels": np.array([2])
},
]
# Create a D-FINE backbone with denoising enabled.
d_fine_backbone_denoising = DFineBackbone(
backbone=hgnetv2_backbone, # Using the hgnetv2_backbone from before
decoder_in_channels=[128, 128],
encoder_hidden_dim=128,
num_denoising=100, # Number of denoising queries
label_noise_ratio=0.5,
box_noise_scale=1.0,
labels=labels_for_denoising, # Provide labels at initialization
num_labels=80,
hidden_dim=128,
learn_initial_query=False,
num_queries=300,
anchor_image_size=(256, 256),
feat_strides=[16, 32],
num_feature_levels=2,
encoder_in_channels=[512, 1024],
encode_proj_layers=[1],
num_attention_heads=8,
encoder_ffn_dim=512,
num_encoder_layers=1,
hidden_expansion=0.34,
depth_multiplier=0.5,
eval_idx=-1,
num_decoder_layers=3,
decoder_attention_heads=8,
decoder_ffn_dim=512,
decoder_n_points=[6, 6],
lqe_hidden_dim=64,
num_lqe_layers=2,
image_shape=(256, 256, 3),
)
# Create the final detector.
object_detector_denoising = DFineObjectDetector(
backbone=d_fine_backbone_denoising,
num_classes=80
)
# This model can now be compiled and trained as shown in the previous example.
```
## Example Usage with Hugging Face URI
### Imports
```python
import keras
import keras_hub
import numpy as np
from keras_hub.models import DFineBackbone
from keras_hub.models import DFineObjectDetector
from keras_hub.models import HGNetV2Backbone
```
### Load a Pretrained Model
Use `from_preset()` to load a D-FINE model with pretrained weights.
```python
object_detector = DFineObjectDetector.from_preset(
"hf://keras/dfine_xlarge_coco"
)
```
### Make a Prediction
Call `predict()` on a batch of images. The images will be automatically preprocessed.
```python
# Create a random image.
image = np.random.uniform(size=(1, 256, 256, 3)).astype("float32")
# Make predictions.
predictions = object_detector.predict(image)
# The output is a dictionary containing boxes, labels, confidence scores,
# and the number of detections.
print(predictions["boxes"].shape)
print(predictions["labels"].shape)
print(predictions["confidence"].shape)
print(predictions["num_detections"])
```
### Fine-Tune a Pre-trained Model
You can load a pretrained backbone and attach a new detection head for a different number of classes.
```python
# Load a pretrained backbone.
backbone = DFineBackbone.from_preset(
"hf://keras/dfine_xlarge_coco"
)
# Create a new detector with a different number of classes for fine-tuning.
finetuning_detector = DFineObjectDetector(
backbone=backbone,
num_classes=10 # Example: fine-tuning on a new dataset with 10 classes
)
# The `finetuning_detector` is now ready to be compiled and trained on a new dataset.
```
### Create a Model From Scratch
You can also build a D-FINE detector by first creating its components, such as the underlying `HGNetV2Backbone`.
```python
# 1. Define a base backbone for feature extraction.
hgnetv2_backbone = HGNetV2Backbone(
stem_channels=[3, 16, 16],
stackwise_stage_filters=[
[16, 16, 64, 1, 3, 3],
[64, 32, 256, 1, 3, 3],
[256, 64, 512, 2, 3, 5],
[512, 128, 1024, 1, 3, 5],
],
apply_downsample=[False, True, True, True],
use_lightweight_conv_block=[False, False, True, True],
depths=[1, 1, 2, 1],
hidden_sizes=[64, 256, 512, 1024],
embedding_size=16,
image_shape=(256, 256, 3),
out_features=["stage3", "stage4"],
)
# 2. Create the D-FINE backbone, which includes the hybrid encoder and decoder.
d_fine_backbone = DFineBackbone(
backbone=hgnetv2_backbone,
decoder_in_channels=[128, 128],
encoder_hidden_dim=128,
num_denoising=0, # Denoising is off
num_labels=80,
hidden_dim=128,
learn_initial_query=False,
num_queries=300,
anchor_image_size=(256, 256),
feat_strides=[16, 32],
num_feature_levels=2,
encoder_in_channels=[512, 1024],
encode_proj_layers=[1],
num_attention_heads=8,
encoder_ffn_dim=512,
num_encoder_layers=1,
hidden_expansion=0.34,
depth_multiplier=0.5,
eval_idx=-1,
num_decoder_layers=3,
decoder_attention_heads=8,
decoder_ffn_dim=512,
decoder_n_points=[6, 6],
lqe_hidden_dim=64,
num_lqe_layers=2,
image_shape=(256, 256, 3),
)
# 3. Create the final object detector model.
object_detector_scratch = DFineObjectDetector(
backbone=d_fine_backbone,
num_classes=80,
bounding_box_format="yxyx",
)
```
### Train the Model
Call `fit()` on a batch of images and ground truth bounding boxes. The `compute_loss` method from the detector handles the complex loss calculations.
```python
# Prepare sample training data.
images = np.random.uniform(
low=0, high=255, size=(2, 256, 256, 3)
).astype("float32")
bounding_boxes = {
"boxes": [
np.array([[0.1, 0.1, 0.3, 0.3], [0.5, 0.5, 0.8, 0.8]], dtype="float32"),
np.array([[0.2, 0.2, 0.4, 0.4]], dtype="float32"),
],
"labels": [
np.array([1, 10], dtype="int32"),
np.array([20], dtype="int32"),
],
}
# Compile the model with the built-in loss function.
object_detector_scratch.compile(
optimizer="adam",
loss=object_detector_scratch.compute_loss,
)
# Train the model.
object_detector_scratch.fit(x=images, y=bounding_boxes, epochs=1)
```
### Train with Contrastive Denoising
To enable contrastive denoising for training, provide ground truth `labels` when initializing the `DFineBackbone`.
```python
# Sample ground truth labels for initializing the denoising generator.
labels_for_denoising = [
{
"boxes": np.array([[0.5, 0.5, 0.2, 0.2]]), "labels": np.array([1])
},
{
"boxes": np.array([[0.6, 0.6, 0.3, 0.3]]), "labels": np.array([2])
},
]
# Create a D-FINE backbone with denoising enabled.
d_fine_backbone_denoising = DFineBackbone(
backbone=hgnetv2_backbone, # Using the hgnetv2_backbone from before
decoder_in_channels=[128, 128],
encoder_hidden_dim=128,
num_denoising=100, # Number of denoising queries
label_noise_ratio=0.5,
box_noise_scale=1.0,
labels=labels_for_denoising, # Provide labels at initialization
num_labels=80,
hidden_dim=128,
learn_initial_query=False,
num_queries=300,
anchor_image_size=(256, 256),
feat_strides=[16, 32],
num_feature_levels=2,
encoder_in_channels=[512, 1024],
encode_proj_layers=[1],
num_attention_heads=8,
encoder_ffn_dim=512,
num_encoder_layers=1,
hidden_expansion=0.34,
depth_multiplier=0.5,
eval_idx=-1,
num_decoder_layers=3,
decoder_attention_heads=8,
decoder_ffn_dim=512,
decoder_n_points=[6, 6],
lqe_hidden_dim=64,
num_lqe_layers=2,
image_shape=(256, 256, 3),
)
# Create the final detector.
object_detector_denoising = DFineObjectDetector(
backbone=d_fine_backbone_denoising,
num_classes=80
)
# This model can now be compiled and trained as shown in the previous example.
```
|