hnamt
/

RT-DisDINOv3-ConvNext-Base

Object Detection

knowledge-distillation

Model card Files Files and versions

RT-DisDINOv3-ConvNext-Base / README.md

hnamt's picture

Update README.md

084b0f1 verified 5 months ago

|

history blame contribute delete

3.53 kB

	---
	license: apache-2.0
	language: en
	library_name: pytorch
	pipeline_tag: object-detection
	tags:
	- rtdetr
	- object-detection
	- knowledge-distillation
	- taco-dataset
	- dinov3
	- convnext
	---

	# RT-DisDINOv3-ConvNext: A Distilled RT-DETR-L Model

	This model is an RT-DETR-L whose backbone and encoder have been pre-trained using knowledge distillation from a powerful DINOv3 ConvNeXt-Base teacher model. The distillation process was performed on feature maps from the [TACO (Trash Annotations in Context)](https://tacodataset.org/) dataset.

	This pre-trained checkpoint contains the "distilled knowledge" and is intended to be used as a starting point for fine-tuning on downstream object detection tasks, potentially leading to better performance compared to standard pre-trained weights.

	This work is part of the RT-DisDINOv3 project. For full details on the training pipeline, baseline comparisons, and analysis, please visit the [main GitHub repository](https://github.com/your-username/your-repo-name). <!--- <<< TODO: Add your GitHub repo link here -->

	## How to Use

	You can load these distilled weights and apply them to the original RT-DETR-L model's backbone and encoder before fine-tuning.

	```python
	import torch
	from torch.hub import load_state_dict_from_url

	# 1. Load the original RT-DETR-L model architecture
	# Make sure you have the 'rtdetr' repository cloned locally or installed
	rtdetr_l = torch.hub.load('lyuwenyu/RT-DETR', 'rtdetrv2_l', pretrained=True)
	model = rtdetr_l.model

	# 2. Load the distilled weights from this Hugging Face Hub repository
	MODEL_URL = "https://huggingface.co/hnamt/RT-DisDINOv3-ConvNext-Base/resolve/main/distilled_rtdetr_convnext_teacher_BEST.pth"
	distilled_state_dict = load_state_dict_from_url(MODEL_URL, map_location='cpu')['model']

	# 3. Load the weights into the model's backbone and encoder
	# The `strict=False` flag ensures that only matching keys (backbone + encoder) are loaded.
	model.load_state_dict(distilled_state_dict, strict=False)

	print("Successfully loaded and applied distilled knowledge from ConvNeXt teacher!")

	# Now the 'model' is ready for fine-tuning on your own dataset.
	# For example:
	# optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
	# model.train()
	# ... your fine-tuning loop ...
	```

	## Training Details

	- Student Model: RT-DETR-L (`rtdetrv2_l` from [lyuwenyu/RT-DETR](https://github.com/lyuwenyu/RT-DETR)).
	- Teacher Model: DINOv3 ConvNeXt-Base (`facebook/dinov3-convnext-base-pretrain-lvd1689m`).
	- Dataset for Distillation: TACO dataset images.
	- Distillation Procedure: The student model's backbone and encoder were trained to minimize the Mean Squared Error (MSE) between their output feature maps and those of the teacher model.

	## Evaluation Results

	After the distillation pre-training, the model was fine-tuned on the TACO dataset. The results show a significant improvement over the baseline.

	\| Model \| mAP@50-95 \| mAP@50 \| Speed (ms) \| Notes \|
	\| ----------------------------- \| :-------: \| :----: \| :--------: \| ----------------------------------- \|
	\| RT-DETR-L (Baseline) \| 2.80% \| 4.60% \| 50.05 \| Fine-tuned from COCO pre-trained. \|
	\| RT-DisDINOv3 (w/ ConvNeXt)\| 3.60% \| 5.30%\| 49.80 \| +28.6% mAP increase over baseline. \|

	## License
	The weights in this repository are released under the Apache 2.0 License. Please be aware that the models used for training (RT-DETR, DINOv3) have their own licenses.