File size: 16,497 Bytes

3c70745
 
 
d2ad231

---
library_name: keras-hub
---
### Model Overview
# Model Summary

D-FINE is a family of lightweight, real-time object detection models built on the DETR (DEtection TRansformer) architecture. It achieves outstanding localization precision by redefining the bounding box regression task. D-FINE is a powerful object detector designed for a wide range of computer vision tasks. It's trained on massive image datasets, enabling it to excel at identifying and localizing objects with high accuracy and speed. D-FINE offers a balance of high performance and computational efficiency, making it suitable for both research and deployment in various real-time applications.

Key Features:

  * Transformer-based Architecture: A modern, efficient design based on the DETR framework for direct set prediction of objects.
  * Open Source Code: Code is publicly available, promoting accessibility and innovation.
  * Strong Performance: Achieves state-of-the-art results on object detection benchmarks like COCO for its size.
  * Multiple Sizes: Comes in various sizes (e.g., Nano, Small, Large, X-Large) to fit different hardware capabilities.
  * Advanced Bounding Box Refinement: Instead of predicting fixed coordinates, it iteratively refines probability distributions for precise object localization using Fine-grained Distribution Refinement (FDR).

Training Strategies:

D-FINE is pre-trained on large and diverse datasets like COCO and Objects365. The training process utilizes Global Optimal Localization Self-Distillation (GO-LSD), a bidirectional optimization strategy that transfers localization knowledge from refined distributions in deeper layers to shallower layers. This accelerates convergence and improves the overall performance of the model.

Weights are released under the [Apache 2.0 License](https://www.google.com/search?q=https://github.com/Peterande/D-FINE/blob/main/LICENSE).

## Links

  * [D-FINE Quickstart Notebook](https://www.kaggle.com/code/harshaljanjani/d-fine-quickstart-notebook)
  * [D-FINE API Documentation](https://keras.io/keras_hub/api/models/d_fine/)
  * [D-FINE Model Card](https://arxiv.org/abs/2410.13842)
  * [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
  * [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

## Installation

Keras and KerasHub can be installed with:

```
pip install -U -q keras-hub
pip install -U -q keras
```

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.

## Available D-FINE Presets
The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
| Preset |&nbsp; Parameters | Description |
|--------|------------|-------------|
| dfine_nano_coco | 3.79M | D-FINE Nano model, the smallest variant in the family, pretrained on the COCO dataset. Ideal for applications where computational resources are limited. |
| dfine_small_coco | 10.33M | D-FINE Small model pretrained on the COCO dataset. Offers a balance between performance and computational efficiency. |
| dfine_medium_coco | 19.62M | D-FINE Medium model pretrained on the COCO dataset. A solid baseline with strong performance for general-purpose object detection. |
| dfine_large_coco | 31.34M | D-FINE Large model pretrained on the COCO dataset. Provides high accuracy and is suitable for more demanding tasks. |
| dfine_xlarge_coco | 62.83M | D-FINE X-Large model, the largest COCO-pretrained variant, designed for state-of-the-art performance where accuracy is the top priority. |
| dfine_small_obj365 | 10.62M | D-FINE Small model pretrained on the large-scale Objects365 dataset, enhancing its ability to recognize a wider variety of objects. |
| dfine_medium_obj365 | 19.99M | D-FINE Medium model pretrained on the Objects365 dataset. Benefits from a larger and more diverse pretraining corpus. |
| dfine_large_obj365 | 31.86M | D-FINE Large model pretrained on the Objects365 dataset for improved generalization and performance on diverse object categories. |
| dfine_xlarge_obj365 | 63.35M | D-FINE X-Large model pretrained on the Objects365 dataset, offering maximum performance by leveraging a vast number of object categories during pretraining. |
| dfine_small_obj2coco | 10.33M | D-FINE Small model first pretrained on Objects365 and then fine-tuned on COCO, combining broad feature learning with benchmark-specific adaptation. |
| dfine_medium_obj2coco | 19.62M | D-FINE Medium model using a two-stage training process: pretraining on Objects365 followed by fine-tuning on COCO. |
| dfine_large_obj2coco_e25 | 31.34M | D-FINE Large model pretrained on Objects365 and then fine-tuned on COCO for 25 epochs. A high-performance model with specialized tuning. |
| dfine_xlarge_obj2coco | 62.83M | D-FINE X-Large model, pretrained on Objects365 and fine-tuned on COCO, representing the most powerful model in this series for COCO-style tasks. |

## Example Usage
### Imports
```python
import keras
import keras_hub
import numpy as np
from keras_hub.models import DFineBackbone
from keras_hub.models import DFineObjectDetector
from keras_hub.models import HGNetV2Backbone
```

### Load a Pretrained Model
Use `from_preset()` to load a D-FINE model with pretrained weights.
```python
object_detector = DFineObjectDetector.from_preset(
    "dfine_xlarge_coco"
)
```

### Make a Prediction
Call `predict()` on a batch of images. The images will be automatically preprocessed.
```python
# Create a random image.
image = np.random.uniform(size=(1, 256, 256, 3)).astype("float32")

# Make predictions.
predictions = object_detector.predict(image)

# The output is a dictionary containing boxes, labels, confidence scores,
# and the number of detections.
print(predictions["boxes"].shape)
print(predictions["labels"].shape)
print(predictions["confidence"].shape)
print(predictions["num_detections"])
```

### Fine-Tune a Pre-trained Model
You can load a pretrained backbone and attach a new detection head for a different number of classes.
```python
# Load a pretrained backbone.
backbone = DFineBackbone.from_preset(
    "dfine_xlarge_coco"
)

# Create a new detector with a different number of classes for fine-tuning.
finetuning_detector = DFineObjectDetector(
    backbone=backbone,
    num_classes=10  # Example: fine-tuning on a new dataset with 10 classes
)

# The `finetuning_detector` is now ready to be compiled and trained on a new dataset.
```

### Create a Model From Scratch
You can also build a D-FINE detector by first creating its components, such as the underlying `HGNetV2Backbone`.
```python
# 1. Define a base backbone for feature extraction.
hgnetv2_backbone = HGNetV2Backbone(
    stem_channels=[3, 16, 16],
    stackwise_stage_filters=[
        [16, 16, 64, 1, 3, 3],
        [64, 32, 256, 1, 3, 3],
        [256, 64, 512, 2, 3, 5],
        [512, 128, 1024, 1, 3, 5],
    ],
    apply_downsample=[False, True, True, True],
    use_lightweight_conv_block=[False, False, True, True],
    depths=[1, 1, 2, 1],
    hidden_sizes=[64, 256, 512, 1024],
    embedding_size=16,
    image_shape=(256, 256, 3),
    out_features=["stage3", "stage4"],
)

# 2. Create the D-FINE backbone, which includes the hybrid encoder and decoder.
d_fine_backbone = DFineBackbone(
    backbone=hgnetv2_backbone,
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=0, # Denoising is off
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# 3. Create the final object detector model.
object_detector_scratch = DFineObjectDetector(
    backbone=d_fine_backbone,
    num_classes=80,
    bounding_box_format="yxyx",
)
```

### Train the Model
Call `fit()` on a batch of images and ground truth bounding boxes. The `compute_loss` method from the detector handles the complex loss calculations.
```python
# Prepare sample training data.
images = np.random.uniform(
    low=0, high=255, size=(2, 256, 256, 3)
).astype("float32")
bounding_boxes = {
    "boxes": [
        np.array([[0.1, 0.1, 0.3, 0.3], [0.5, 0.5, 0.8, 0.8]], dtype="float32"),
        np.array([[0.2, 0.2, 0.4, 0.4]], dtype="float32"),
    ],
    "labels": [
        np.array([1, 10], dtype="int32"),
        np.array([20], dtype="int32"),
    ],
}

# Compile the model with the built-in loss function.
object_detector_scratch.compile(
    optimizer="adam",
    loss=object_detector_scratch.compute_loss,
)

# Train the model.
object_detector_scratch.fit(x=images, y=bounding_boxes, epochs=1)
```

### Train with Contrastive Denoising
To enable contrastive denoising for training, provide ground truth `labels` when initializing the `DFineBackbone`.
```python
# Sample ground truth labels for initializing the denoising generator.
labels_for_denoising = [
    {
        "boxes": np.array([[0.5, 0.5, 0.2, 0.2]]), "labels": np.array([1])
    },
    {
        "boxes": np.array([[0.6, 0.6, 0.3, 0.3]]), "labels": np.array([2])
    },
]

# Create a D-FINE backbone with denoising enabled.
d_fine_backbone_denoising = DFineBackbone(
    backbone=hgnetv2_backbone, # Using the hgnetv2_backbone from before
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=100,  # Number of denoising queries
    label_noise_ratio=0.5,
    box_noise_scale=1.0,
    labels=labels_for_denoising, # Provide labels at initialization
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# Create the final detector.
object_detector_denoising = DFineObjectDetector(
    backbone=d_fine_backbone_denoising,
    num_classes=80
)

# This model can now be compiled and trained as shown in the previous example.
```

## Example Usage with Hugging Face URI

### Imports
```python
import keras
import keras_hub
import numpy as np
from keras_hub.models import DFineBackbone
from keras_hub.models import DFineObjectDetector
from keras_hub.models import HGNetV2Backbone
```

### Load a Pretrained Model
Use `from_preset()` to load a D-FINE model with pretrained weights.
```python
object_detector = DFineObjectDetector.from_preset(
    "hf://keras/dfine_xlarge_coco"
)
```

### Make a Prediction
Call `predict()` on a batch of images. The images will be automatically preprocessed.
```python
# Create a random image.
image = np.random.uniform(size=(1, 256, 256, 3)).astype("float32")

# Make predictions.
predictions = object_detector.predict(image)

# The output is a dictionary containing boxes, labels, confidence scores,
# and the number of detections.
print(predictions["boxes"].shape)
print(predictions["labels"].shape)
print(predictions["confidence"].shape)
print(predictions["num_detections"])
```

### Fine-Tune a Pre-trained Model
You can load a pretrained backbone and attach a new detection head for a different number of classes.
```python
# Load a pretrained backbone.
backbone = DFineBackbone.from_preset(
    "hf://keras/dfine_xlarge_coco"
)

# Create a new detector with a different number of classes for fine-tuning.
finetuning_detector = DFineObjectDetector(
    backbone=backbone,
    num_classes=10  # Example: fine-tuning on a new dataset with 10 classes
)

# The `finetuning_detector` is now ready to be compiled and trained on a new dataset.
```

### Create a Model From Scratch
You can also build a D-FINE detector by first creating its components, such as the underlying `HGNetV2Backbone`.
```python
# 1. Define a base backbone for feature extraction.
hgnetv2_backbone = HGNetV2Backbone(
    stem_channels=[3, 16, 16],
    stackwise_stage_filters=[
        [16, 16, 64, 1, 3, 3],
        [64, 32, 256, 1, 3, 3],
        [256, 64, 512, 2, 3, 5],
        [512, 128, 1024, 1, 3, 5],
    ],
    apply_downsample=[False, True, True, True],
    use_lightweight_conv_block=[False, False, True, True],
    depths=[1, 1, 2, 1],
    hidden_sizes=[64, 256, 512, 1024],
    embedding_size=16,
    image_shape=(256, 256, 3),
    out_features=["stage3", "stage4"],
)

# 2. Create the D-FINE backbone, which includes the hybrid encoder and decoder.
d_fine_backbone = DFineBackbone(
    backbone=hgnetv2_backbone,
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=0, # Denoising is off
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# 3. Create the final object detector model.
object_detector_scratch = DFineObjectDetector(
    backbone=d_fine_backbone,
    num_classes=80,
    bounding_box_format="yxyx",
)
```

### Train the Model
Call `fit()` on a batch of images and ground truth bounding boxes. The `compute_loss` method from the detector handles the complex loss calculations.
```python
# Prepare sample training data.
images = np.random.uniform(
    low=0, high=255, size=(2, 256, 256, 3)
).astype("float32")
bounding_boxes = {
    "boxes": [
        np.array([[0.1, 0.1, 0.3, 0.3], [0.5, 0.5, 0.8, 0.8]], dtype="float32"),
        np.array([[0.2, 0.2, 0.4, 0.4]], dtype="float32"),
    ],
    "labels": [
        np.array([1, 10], dtype="int32"),
        np.array([20], dtype="int32"),
    ],
}

# Compile the model with the built-in loss function.
object_detector_scratch.compile(
    optimizer="adam",
    loss=object_detector_scratch.compute_loss,
)

# Train the model.
object_detector_scratch.fit(x=images, y=bounding_boxes, epochs=1)
```

### Train with Contrastive Denoising
To enable contrastive denoising for training, provide ground truth `labels` when initializing the `DFineBackbone`.
```python
# Sample ground truth labels for initializing the denoising generator.
labels_for_denoising = [
    {
        "boxes": np.array([[0.5, 0.5, 0.2, 0.2]]), "labels": np.array([1])
    },
    {
        "boxes": np.array([[0.6, 0.6, 0.3, 0.3]]), "labels": np.array([2])
    },
]

# Create a D-FINE backbone with denoising enabled.
d_fine_backbone_denoising = DFineBackbone(
    backbone=hgnetv2_backbone, # Using the hgnetv2_backbone from before
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=100,  # Number of denoising queries
    label_noise_ratio=0.5,
    box_noise_scale=1.0,
    labels=labels_for_denoising, # Provide labels at initialization
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# Create the final detector.
object_detector_denoising = DFineObjectDetector(
    backbone=d_fine_backbone_denoising,
    num_classes=80
)

# This model can now be compiled and trained as shown in the previous example.
```