| | --- |
| | library_name: keras-hub |
| | --- |
| | ### Model Overview |
| | # Model Summary |
| |
|
| | D-FINE is a family of lightweight, real-time object detection models built on the DETR (DEtection TRansformer) architecture. It achieves outstanding localization precision by redefining the bounding box regression task. D-FINE is a powerful object detector designed for a wide range of computer vision tasks. It's trained on massive image datasets, enabling it to excel at identifying and localizing objects with high accuracy and speed. D-FINE offers a balance of high performance and computational efficiency, making it suitable for both research and deployment in various real-time applications. |
| |
|
| | Key Features: |
| |
|
| | * Transformer-based Architecture: A modern, efficient design based on the DETR framework for direct set prediction of objects. |
| | * Open Source Code: Code is publicly available, promoting accessibility and innovation. |
| | * Strong Performance: Achieves state-of-the-art results on object detection benchmarks like COCO for its size. |
| | * Multiple Sizes: Comes in various sizes (e.g., Nano, Small, Large, X-Large) to fit different hardware capabilities. |
| | * Advanced Bounding Box Refinement: Instead of predicting fixed coordinates, it iteratively refines probability distributions for precise object localization using Fine-grained Distribution Refinement (FDR). |
| |
|
| | Training Strategies: |
| |
|
| | D-FINE is pre-trained on large and diverse datasets like COCO and Objects365. The training process utilizes Global Optimal Localization Self-Distillation (GO-LSD), a bidirectional optimization strategy that transfers localization knowledge from refined distributions in deeper layers to shallower layers. This accelerates convergence and improves the overall performance of the model. |
| |
|
| | Weights are released under the [Apache 2.0 License](https://www.google.com/search?q=https://github.com/Peterande/D-FINE/blob/main/LICENSE). |
| |
|
| | ## Links |
| |
|
| | * [D-FINE Quickstart Notebook](https://www.kaggle.com/code/harshaljanjani/d-fine-quickstart-notebook) |
| | * [D-FINE API Documentation](https://keras.io/keras_hub/api/models/d_fine/) |
| | * [D-FINE Model Card](https://arxiv.org/abs/2410.13842) |
| | * [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/) |
| | * [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/) |
| |
|
| | ## Installation |
| |
|
| | Keras and KerasHub can be installed with: |
| |
|
| | ``` |
| | pip install -U -q keras-hub |
| | pip install -U -q keras |
| | ``` |
| |
|
| | Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page. |
| |
|
| | ## Available D-FINE Presets |
| | The following model checkpoints are provided by the Keras team. Full code examples for each are available below. |
| | | Preset | Parameters | Description | |
| | |--------|------------|-------------| |
| | | dfine_nano_coco | 3.79M | D-FINE Nano model, the smallest variant in the family, pretrained on the COCO dataset. Ideal for applications where computational resources are limited. | |
| | | dfine_small_coco | 10.33M | D-FINE Small model pretrained on the COCO dataset. Offers a balance between performance and computational efficiency. | |
| | | dfine_medium_coco | 19.62M | D-FINE Medium model pretrained on the COCO dataset. A solid baseline with strong performance for general-purpose object detection. | |
| | | dfine_large_coco | 31.34M | D-FINE Large model pretrained on the COCO dataset. Provides high accuracy and is suitable for more demanding tasks. | |
| | | dfine_xlarge_coco | 62.83M | D-FINE X-Large model, the largest COCO-pretrained variant, designed for state-of-the-art performance where accuracy is the top priority. | |
| | | dfine_small_obj365 | 10.62M | D-FINE Small model pretrained on the large-scale Objects365 dataset, enhancing its ability to recognize a wider variety of objects. | |
| | | dfine_medium_obj365 | 19.99M | D-FINE Medium model pretrained on the Objects365 dataset. Benefits from a larger and more diverse pretraining corpus. | |
| | | dfine_large_obj365 | 31.86M | D-FINE Large model pretrained on the Objects365 dataset for improved generalization and performance on diverse object categories. | |
| | | dfine_xlarge_obj365 | 63.35M | D-FINE X-Large model pretrained on the Objects365 dataset, offering maximum performance by leveraging a vast number of object categories during pretraining. | |
| | | dfine_small_obj2coco | 10.33M | D-FINE Small model first pretrained on Objects365 and then fine-tuned on COCO, combining broad feature learning with benchmark-specific adaptation. | |
| | | dfine_medium_obj2coco | 19.62M | D-FINE Medium model using a two-stage training process: pretraining on Objects365 followed by fine-tuning on COCO. | |
| | | dfine_large_obj2coco_e25 | 31.34M | D-FINE Large model pretrained on Objects365 and then fine-tuned on COCO for 25 epochs. A high-performance model with specialized tuning. | |
| | | dfine_xlarge_obj2coco | 62.83M | D-FINE X-Large model, pretrained on Objects365 and fine-tuned on COCO, representing the most powerful model in this series for COCO-style tasks. | |
| | |
| | ## Example Usage |
| | ### Imports |
| | ```python |
| | import keras |
| | import keras_hub |
| | import numpy as np |
| | from keras_hub.models import DFineBackbone |
| | from keras_hub.models import DFineObjectDetector |
| | from keras_hub.models import HGNetV2Backbone |
| | ``` |
| | |
| | ### Load a Pretrained Model |
| | Use `from_preset()` to load a D-FINE model with pretrained weights. |
| | ```python |
| | object_detector = DFineObjectDetector.from_preset( |
| | "dfine_nano_coco" |
| | ) |
| | ``` |
| |
|
| | ### Make a Prediction |
| | Call `predict()` on a batch of images. The images will be automatically preprocessed. |
| | ```python |
| | # Create a random image. |
| | image = np.random.uniform(size=(1, 256, 256, 3)).astype("float32") |
| | |
| | # Make predictions. |
| | predictions = object_detector.predict(image) |
| | |
| | # The output is a dictionary containing boxes, labels, confidence scores, |
| | # and the number of detections. |
| | print(predictions["boxes"].shape) |
| | print(predictions["labels"].shape) |
| | print(predictions["confidence"].shape) |
| | print(predictions["num_detections"]) |
| | ``` |
| |
|
| | ### Fine-Tune a Pre-trained Model |
| | You can load a pretrained backbone and attach a new detection head for a different number of classes. |
| | ```python |
| | # Load a pretrained backbone. |
| | backbone = DFineBackbone.from_preset( |
| | "dfine_nano_coco" |
| | ) |
| | |
| | # Create a new detector with a different number of classes for fine-tuning. |
| | finetuning_detector = DFineObjectDetector( |
| | backbone=backbone, |
| | num_classes=10 # Example: fine-tuning on a new dataset with 10 classes |
| | ) |
| | |
| | # The `finetuning_detector` is now ready to be compiled and trained on a new dataset. |
| | ``` |
| |
|
| | ### Create a Model From Scratch |
| | You can also build a D-FINE detector by first creating its components, such as the underlying `HGNetV2Backbone`. |
| | ```python |
| | # 1. Define a base backbone for feature extraction. |
| | hgnetv2_backbone = HGNetV2Backbone( |
| | stem_channels=[3, 16, 16], |
| | stackwise_stage_filters=[ |
| | [16, 16, 64, 1, 3, 3], |
| | [64, 32, 256, 1, 3, 3], |
| | [256, 64, 512, 2, 3, 5], |
| | [512, 128, 1024, 1, 3, 5], |
| | ], |
| | apply_downsample=[False, True, True, True], |
| | use_lightweight_conv_block=[False, False, True, True], |
| | depths=[1, 1, 2, 1], |
| | hidden_sizes=[64, 256, 512, 1024], |
| | embedding_size=16, |
| | image_shape=(256, 256, 3), |
| | out_features=["stage3", "stage4"], |
| | ) |
| | |
| | # 2. Create the D-FINE backbone, which includes the hybrid encoder and decoder. |
| | d_fine_backbone = DFineBackbone( |
| | backbone=hgnetv2_backbone, |
| | decoder_in_channels=[128, 128], |
| | encoder_hidden_dim=128, |
| | num_denoising=0, # Denoising is off |
| | num_labels=80, |
| | hidden_dim=128, |
| | learn_initial_query=False, |
| | num_queries=300, |
| | anchor_image_size=(256, 256), |
| | feat_strides=[16, 32], |
| | num_feature_levels=2, |
| | encoder_in_channels=[512, 1024], |
| | encode_proj_layers=[1], |
| | num_attention_heads=8, |
| | encoder_ffn_dim=512, |
| | num_encoder_layers=1, |
| | hidden_expansion=0.34, |
| | depth_multiplier=0.5, |
| | eval_idx=-1, |
| | num_decoder_layers=3, |
| | decoder_attention_heads=8, |
| | decoder_ffn_dim=512, |
| | decoder_n_points=[6, 6], |
| | lqe_hidden_dim=64, |
| | num_lqe_layers=2, |
| | image_shape=(256, 256, 3), |
| | ) |
| | |
| | # 3. Create the final object detector model. |
| | object_detector_scratch = DFineObjectDetector( |
| | backbone=d_fine_backbone, |
| | num_classes=80, |
| | bounding_box_format="yxyx", |
| | ) |
| | ``` |
| |
|
| | ### Train the Model |
| | Call `fit()` on a batch of images and ground truth bounding boxes. The `compute_loss` method from the detector handles the complex loss calculations. |
| | ```python |
| | # Prepare sample training data. |
| | images = np.random.uniform( |
| | low=0, high=255, size=(2, 256, 256, 3) |
| | ).astype("float32") |
| | bounding_boxes = { |
| | "boxes": [ |
| | np.array([[0.1, 0.1, 0.3, 0.3], [0.5, 0.5, 0.8, 0.8]], dtype="float32"), |
| | np.array([[0.2, 0.2, 0.4, 0.4]], dtype="float32"), |
| | ], |
| | "labels": [ |
| | np.array([1, 10], dtype="int32"), |
| | np.array([20], dtype="int32"), |
| | ], |
| | } |
| | |
| | # Compile the model with the built-in loss function. |
| | object_detector_scratch.compile( |
| | optimizer="adam", |
| | loss=object_detector_scratch.compute_loss, |
| | ) |
| | |
| | # Train the model. |
| | object_detector_scratch.fit(x=images, y=bounding_boxes, epochs=1) |
| | ``` |
| |
|
| | ### Train with Contrastive Denoising |
| | To enable contrastive denoising for training, provide ground truth `labels` when initializing the `DFineBackbone`. |
| | ```python |
| | # Sample ground truth labels for initializing the denoising generator. |
| | labels_for_denoising = [ |
| | { |
| | "boxes": np.array([[0.5, 0.5, 0.2, 0.2]]), "labels": np.array([1]) |
| | }, |
| | { |
| | "boxes": np.array([[0.6, 0.6, 0.3, 0.3]]), "labels": np.array([2]) |
| | }, |
| | ] |
| | |
| | # Create a D-FINE backbone with denoising enabled. |
| | d_fine_backbone_denoising = DFineBackbone( |
| | backbone=hgnetv2_backbone, # Using the hgnetv2_backbone from before |
| | decoder_in_channels=[128, 128], |
| | encoder_hidden_dim=128, |
| | num_denoising=100, # Number of denoising queries |
| | label_noise_ratio=0.5, |
| | box_noise_scale=1.0, |
| | labels=labels_for_denoising, # Provide labels at initialization |
| | num_labels=80, |
| | hidden_dim=128, |
| | learn_initial_query=False, |
| | num_queries=300, |
| | anchor_image_size=(256, 256), |
| | feat_strides=[16, 32], |
| | num_feature_levels=2, |
| | encoder_in_channels=[512, 1024], |
| | encode_proj_layers=[1], |
| | num_attention_heads=8, |
| | encoder_ffn_dim=512, |
| | num_encoder_layers=1, |
| | hidden_expansion=0.34, |
| | depth_multiplier=0.5, |
| | eval_idx=-1, |
| | num_decoder_layers=3, |
| | decoder_attention_heads=8, |
| | decoder_ffn_dim=512, |
| | decoder_n_points=[6, 6], |
| | lqe_hidden_dim=64, |
| | num_lqe_layers=2, |
| | image_shape=(256, 256, 3), |
| | ) |
| | |
| | # Create the final detector. |
| | object_detector_denoising = DFineObjectDetector( |
| | backbone=d_fine_backbone_denoising, |
| | num_classes=80 |
| | ) |
| | |
| | # This model can now be compiled and trained as shown in the previous example. |
| | ``` |
| |
|
| | ## Example Usage with Hugging Face URI |
| |
|
| | ### Imports |
| | ```python |
| | import keras |
| | import keras_hub |
| | import numpy as np |
| | from keras_hub.models import DFineBackbone |
| | from keras_hub.models import DFineObjectDetector |
| | from keras_hub.models import HGNetV2Backbone |
| | ``` |
| |
|
| | ### Load a Pretrained Model |
| | Use `from_preset()` to load a D-FINE model with pretrained weights. |
| | ```python |
| | object_detector = DFineObjectDetector.from_preset( |
| | "hf://keras/dfine_nano_coco" |
| | ) |
| | ``` |
| |
|
| | ### Make a Prediction |
| | Call `predict()` on a batch of images. The images will be automatically preprocessed. |
| | ```python |
| | # Create a random image. |
| | image = np.random.uniform(size=(1, 256, 256, 3)).astype("float32") |
| | |
| | # Make predictions. |
| | predictions = object_detector.predict(image) |
| | |
| | # The output is a dictionary containing boxes, labels, confidence scores, |
| | # and the number of detections. |
| | print(predictions["boxes"].shape) |
| | print(predictions["labels"].shape) |
| | print(predictions["confidence"].shape) |
| | print(predictions["num_detections"]) |
| | ``` |
| |
|
| | ### Fine-Tune a Pre-trained Model |
| | You can load a pretrained backbone and attach a new detection head for a different number of classes. |
| | ```python |
| | # Load a pretrained backbone. |
| | backbone = DFineBackbone.from_preset( |
| | "hf://keras/dfine_nano_coco" |
| | ) |
| | |
| | # Create a new detector with a different number of classes for fine-tuning. |
| | finetuning_detector = DFineObjectDetector( |
| | backbone=backbone, |
| | num_classes=10 # Example: fine-tuning on a new dataset with 10 classes |
| | ) |
| | |
| | # The `finetuning_detector` is now ready to be compiled and trained on a new dataset. |
| | ``` |
| |
|
| | ### Create a Model From Scratch |
| | You can also build a D-FINE detector by first creating its components, such as the underlying `HGNetV2Backbone`. |
| | ```python |
| | # 1. Define a base backbone for feature extraction. |
| | hgnetv2_backbone = HGNetV2Backbone( |
| | stem_channels=[3, 16, 16], |
| | stackwise_stage_filters=[ |
| | [16, 16, 64, 1, 3, 3], |
| | [64, 32, 256, 1, 3, 3], |
| | [256, 64, 512, 2, 3, 5], |
| | [512, 128, 1024, 1, 3, 5], |
| | ], |
| | apply_downsample=[False, True, True, True], |
| | use_lightweight_conv_block=[False, False, True, True], |
| | depths=[1, 1, 2, 1], |
| | hidden_sizes=[64, 256, 512, 1024], |
| | embedding_size=16, |
| | image_shape=(256, 256, 3), |
| | out_features=["stage3", "stage4"], |
| | ) |
| | |
| | # 2. Create the D-FINE backbone, which includes the hybrid encoder and decoder. |
| | d_fine_backbone = DFineBackbone( |
| | backbone=hgnetv2_backbone, |
| | decoder_in_channels=[128, 128], |
| | encoder_hidden_dim=128, |
| | num_denoising=0, # Denoising is off |
| | num_labels=80, |
| | hidden_dim=128, |
| | learn_initial_query=False, |
| | num_queries=300, |
| | anchor_image_size=(256, 256), |
| | feat_strides=[16, 32], |
| | num_feature_levels=2, |
| | encoder_in_channels=[512, 1024], |
| | encode_proj_layers=[1], |
| | num_attention_heads=8, |
| | encoder_ffn_dim=512, |
| | num_encoder_layers=1, |
| | hidden_expansion=0.34, |
| | depth_multiplier=0.5, |
| | eval_idx=-1, |
| | num_decoder_layers=3, |
| | decoder_attention_heads=8, |
| | decoder_ffn_dim=512, |
| | decoder_n_points=[6, 6], |
| | lqe_hidden_dim=64, |
| | num_lqe_layers=2, |
| | image_shape=(256, 256, 3), |
| | ) |
| | |
| | # 3. Create the final object detector model. |
| | object_detector_scratch = DFineObjectDetector( |
| | backbone=d_fine_backbone, |
| | num_classes=80, |
| | bounding_box_format="yxyx", |
| | ) |
| | ``` |
| |
|
| | ### Train the Model |
| | Call `fit()` on a batch of images and ground truth bounding boxes. The `compute_loss` method from the detector handles the complex loss calculations. |
| | ```python |
| | # Prepare sample training data. |
| | images = np.random.uniform( |
| | low=0, high=255, size=(2, 256, 256, 3) |
| | ).astype("float32") |
| | bounding_boxes = { |
| | "boxes": [ |
| | np.array([[0.1, 0.1, 0.3, 0.3], [0.5, 0.5, 0.8, 0.8]], dtype="float32"), |
| | np.array([[0.2, 0.2, 0.4, 0.4]], dtype="float32"), |
| | ], |
| | "labels": [ |
| | np.array([1, 10], dtype="int32"), |
| | np.array([20], dtype="int32"), |
| | ], |
| | } |
| | |
| | # Compile the model with the built-in loss function. |
| | object_detector_scratch.compile( |
| | optimizer="adam", |
| | loss=object_detector_scratch.compute_loss, |
| | ) |
| | |
| | # Train the model. |
| | object_detector_scratch.fit(x=images, y=bounding_boxes, epochs=1) |
| | ``` |
| |
|
| | ### Train with Contrastive Denoising |
| | To enable contrastive denoising for training, provide ground truth `labels` when initializing the `DFineBackbone`. |
| | ```python |
| | # Sample ground truth labels for initializing the denoising generator. |
| | labels_for_denoising = [ |
| | { |
| | "boxes": np.array([[0.5, 0.5, 0.2, 0.2]]), "labels": np.array([1]) |
| | }, |
| | { |
| | "boxes": np.array([[0.6, 0.6, 0.3, 0.3]]), "labels": np.array([2]) |
| | }, |
| | ] |
| | |
| | # Create a D-FINE backbone with denoising enabled. |
| | d_fine_backbone_denoising = DFineBackbone( |
| | backbone=hgnetv2_backbone, # Using the hgnetv2_backbone from before |
| | decoder_in_channels=[128, 128], |
| | encoder_hidden_dim=128, |
| | num_denoising=100, # Number of denoising queries |
| | label_noise_ratio=0.5, |
| | box_noise_scale=1.0, |
| | labels=labels_for_denoising, # Provide labels at initialization |
| | num_labels=80, |
| | hidden_dim=128, |
| | learn_initial_query=False, |
| | num_queries=300, |
| | anchor_image_size=(256, 256), |
| | feat_strides=[16, 32], |
| | num_feature_levels=2, |
| | encoder_in_channels=[512, 1024], |
| | encode_proj_layers=[1], |
| | num_attention_heads=8, |
| | encoder_ffn_dim=512, |
| | num_encoder_layers=1, |
| | hidden_expansion=0.34, |
| | depth_multiplier=0.5, |
| | eval_idx=-1, |
| | num_decoder_layers=3, |
| | decoder_attention_heads=8, |
| | decoder_ffn_dim=512, |
| | decoder_n_points=[6, 6], |
| | lqe_hidden_dim=64, |
| | num_lqe_layers=2, |
| | image_shape=(256, 256, 3), |
| | ) |
| | |
| | # Create the final detector. |
| | object_detector_denoising = DFineObjectDetector( |
| | backbone=d_fine_backbone_denoising, |
| | num_classes=80 |
| | ) |
| | |
| | # This model can now be compiled and trained as shown in the previous example. |
| | ``` |
| |
|