| | --- |
| | license: apache-2.0 |
| | language: en |
| | library_name: pytorch |
| | pipeline_tag: object-detection |
| | tags: |
| | - rtdetr |
| | - object-detection |
| | - knowledge-distillation |
| | - taco-dataset |
| | - dinov3 |
| | - convnext |
| | --- |
| | |
| | # RT-DisDINOv3-ConvNext: A Distilled RT-DETR-L Model |
| |
|
| | This model is an **RT-DETR-L** whose backbone and encoder have been pre-trained using knowledge distillation from a powerful **DINOv3 ConvNeXt-Base** teacher model. The distillation process was performed on feature maps from the [TACO (Trash Annotations in Context)](https://tacodataset.org/) dataset. |
| |
|
| | This pre-trained checkpoint contains the "distilled knowledge" and is intended to be used as a starting point for fine-tuning on downstream object detection tasks, potentially leading to better performance compared to standard pre-trained weights. |
| |
|
| | This work is part of the **RT-DisDINOv3** project. For full details on the training pipeline, baseline comparisons, and analysis, please visit the [main GitHub repository](https://github.com/your-username/your-repo-name). <!--- <<< TODO: Add your GitHub repo link here --> |
| |
|
| | ## How to Use |
| |
|
| | You can load these distilled weights and apply them to the original RT-DETR-L model's backbone and encoder before fine-tuning. |
| |
|
| | ```python |
| | import torch |
| | from torch.hub import load_state_dict_from_url |
| | |
| | # 1. Load the original RT-DETR-L model architecture |
| | # Make sure you have the 'rtdetr' repository cloned locally or installed |
| | rtdetr_l = torch.hub.load('lyuwenyu/RT-DETR', 'rtdetrv2_l', pretrained=True) |
| | model = rtdetr_l.model |
| | |
| | # 2. Load the distilled weights from this Hugging Face Hub repository |
| | MODEL_URL = "https://huggingface.co/hnamt/RT-DisDINOv3-ConvNext-Base/resolve/main/distilled_rtdetr_convnext_teacher_BEST.pth" |
| | distilled_state_dict = load_state_dict_from_url(MODEL_URL, map_location='cpu')['model'] |
| | |
| | # 3. Load the weights into the model's backbone and encoder |
| | # The `strict=False` flag ensures that only matching keys (backbone + encoder) are loaded. |
| | model.load_state_dict(distilled_state_dict, strict=False) |
| | |
| | print("Successfully loaded and applied distilled knowledge from ConvNeXt teacher!") |
| | |
| | # Now the 'model' is ready for fine-tuning on your own dataset. |
| | # For example: |
| | # optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) |
| | # model.train() |
| | # ... your fine-tuning loop ... |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | - **Student Model**: RT-DETR-L (`rtdetrv2_l` from [lyuwenyu/RT-DETR](https://github.com/lyuwenyu/RT-DETR)). |
| | - **Teacher Model**: DINOv3 ConvNeXt-Base (`facebook/dinov3-convnext-base-pretrain-lvd1689m`). |
| | - **Dataset for Distillation**: TACO dataset images. |
| | - **Distillation Procedure**: The student model's backbone and encoder were trained to minimize the Mean Squared Error (MSE) between their output feature maps and those of the teacher model. |
| |
|
| | ## Evaluation Results |
| |
|
| | After the distillation pre-training, the model was fine-tuned on the TACO dataset. The results show a significant improvement over the baseline. |
| |
|
| | | Model | mAP@50-95 | mAP@50 | Speed (ms) | Notes | |
| | | ----------------------------- | :-------: | :----: | :--------: | ----------------------------------- | |
| | | RT-DETR-L (Baseline) | 2.80% | 4.60% | 50.05 | Fine-tuned from COCO pre-trained. | |
| | | **RT-DisDINOv3 (w/ ConvNeXt)**| **3.60%** | **5.30%**| 49.80 | **+28.6% mAP increase over baseline.** | |
| |
|
| | ## License |
| | The weights in this repository are released under the Apache 2.0 License. Please be aware that the models used for training (RT-DETR, DINOv3) have their own licenses. |
| |
|