Create README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,125 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Card for DINOv2 Model (Trained with DinoMX)
|
| 2 |
+
|
| 3 |
+
This is a Vision Transformer model trained using the DinoMX framework, following methods related to "DINOv2: Learning Robust Visual Features without Supervision" and "Vision Transformers Need Registers."
|
| 4 |
+
|
| 5 |
+
## Model Details
|
| 6 |
+
*(You can adapt this section from the original DINOv2 model card based on your specific model's architecture, or provide details if they differ due to DinoMX training)*
|
| 7 |
+
|
| 8 |
+
The model takes an image as input and returns a class token and patch tokens, and optionally register tokens.
|
| 9 |
+
|
| 10 |
+
The embedding dimension is:
|
| 11 |
+
*(Specify for your ViT-S/B/L/g variant)*
|
| 12 |
+
|
| 13 |
+
The model follows a Transformer architecture, with a patch size of 14.
|
| 14 |
+
*(Specify if registers are used, as in the example: "In the case of registers, we add 4 register tokens, learned during training, to the input sequence after the patch embedding.")*
|
| 15 |
+
|
| 16 |
+
For a 224x224 image, this results in 1 class token + 256 patch tokens *(+ optionally X register tokens)*.
|
| 17 |
+
|
| 18 |
+
The models can accept larger images provided the image shapes are multiples of the patch size (14). If this condition is not verified, the model will crop to the closest smaller multiple of the patch size.
|
| 19 |
+
|
| 20 |
+
### Model Description
|
| 21 |
+
|
| 22 |
+
* **Developed by:** *(Your Organization/Name)*
|
| 23 |
+
* **Model type:** Vision Transformer (DINOv2)
|
| 24 |
+
* **License:** *(Specify License, e.g., Apache License 2.0 or as appropriate)*
|
| 25 |
+
* **Training System:** DinoMX Modular & Flexible Training Framework
|
| 26 |
+
* **Repository:** *(Link to your model repository, if any)*
|
| 27 |
+
* **Paper(s):**
|
| 28 |
+
* "DINOv2: Learning Robust Visual Features without Supervision" (https://arxiv.org/abs/2304.07193)
|
| 29 |
+
* "Vision Transformers Need Registers" (https://arxiv.org/abs/2309.16588)
|
| 30 |
+
* *(Optionally, link to or mention "DINO-MX: Modular & Flexible Training Framework" if it's published)*
|
| 31 |
+
* **Demo:** *(Link to your demo, if any)*
|
| 32 |
+
|
| 33 |
+
## Uses
|
| 34 |
+
|
| 35 |
+
*(Adapt from the original DINOv2 model card based on intended uses)*
|
| 36 |
+
|
| 37 |
+
The models are vision backbones providing multi-purpose features for downstream tasks.
|
| 38 |
+
|
| 39 |
+
### Direct Use
|
| 40 |
+
*(As per DINOv2, e.g., depth estimation, semantic segmentation, image classification with k-NN or linear layers, image retrieval)*
|
| 41 |
+
|
| 42 |
+
### Downstream Use
|
| 43 |
+
*(As per DINOv2, e.g., fine-tuning, though often good out-of-the-box performance is expected)*
|
| 44 |
+
|
| 45 |
+
## Bias, Risks, and Limitations
|
| 46 |
+
*(Address any known biases, risks, and limitations, potentially referencing the original DINOv2 card and adding any specific to your training data or the DinoMX system if applicable)*
|
| 47 |
+
|
| 48 |
+
### Recommendations
|
| 49 |
+
*(As per DINOv2 or specific to your model)*
|
| 50 |
+
|
| 51 |
+
## How to Get Started with the Model
|
| 52 |
+
*(Provide code snippets for loading and using your model. DinoMX emphasizes Hugging Face compatibility, so if your model is available via Hugging Face, that would be a good starting point.)*
|
| 53 |
+
|
| 54 |
+
Example (adjust based on your actual Hugging Face path or loading mechanism):
|
| 55 |
+
```python
|
| 56 |
+
import torch
|
| 57 |
+
|
| 58 |
+
# Example: Replace with your actual model name and hub
|
| 59 |
+
# dinov2_model_dinomx = torch.hub.load('your-hf-hub/dinomx-dinov2-model', 'dinov2_vitb14_dinomx_trained')
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Training Details with DinoMX Framework
|
| 63 |
+
|
| 64 |
+
The DinoMX framework provides a modular and flexible environment for training Vision Transformer models, including DINO and DINOv2 variants. It is designed for self-supervised learning (SSL) and integrates with Hugging Face for standardized model checkpoints.
|
| 65 |
+
|
| 66 |
+
### Training Data
|
| 67 |
+
|
| 68 |
+
* **Training data:** *(Specify your training dataset, e.g., LVD-142M or your custom dataset)*
|
| 69 |
+
* **Training regime:** *(e.g., fp16 using PyTorch-FSDP mixed-precision, bf16)*
|
| 70 |
+
|
| 71 |
+
### DinoMX Training Procedure
|
| 72 |
+
|
| 73 |
+
* **Training System:** DinoMX Modular & Flexible Training Framework.
|
| 74 |
+
* **Core Training Objective (for DINOv2 within DinoMX):**
|
| 75 |
+
* DINO self-distillation loss with multi-crop.
|
| 76 |
+
* iBOT masked-image modeling loss (DinoMX allows iBOT patch learning adaptation for all models).
|
| 77 |
+
* KoLeo regularization on [CLS] tokens.
|
| 78 |
+
* The DinoMX framework allows training ViT models with DINOv1 or DINOv2 techniques, including the option to fine-tune a pretrained DINOv1 model using iBOT patch loss (typically associated with DINOv2).
|
| 79 |
+
* **Architectures (within DinoMX):**
|
| 80 |
+
* DinoMX supports various Vision Transformer (ViT) architectures, configurable via files. The model being described is a DINOv2 variant.
|
| 81 |
+
* *(Specify your ViT architecture, e.g., ViT-S/B/L/g, patch size, embedding dimension, heads, FFN type, similar to the original DINOv2 card)*
|
| 82 |
+
* **Data Augmentation (within DinoMX):**
|
| 83 |
+
* DinoMX supports both natural image and medical image data augmentation strategies.
|
| 84 |
+
* For natural images, standard augmentations like random cropping, flipping, color jittering, etc., are used.
|
| 85 |
+
* For medical images, specific augmentations such as noise addition and brightness changing can be applied, while excluding unsuitable ones like solarization and color jitter.
|
| 86 |
+
* DinoMX also features Label-Guided Data Augmentation, where existing labels can guide the cropping process to focus on specific regions of interest.
|
| 87 |
+
* **Parameter-Efficient Fine-Tuning (PEFT) Options in DinoMX:**
|
| 88 |
+
* **LoRA (Low-Rank Adaptation):** DinoMX incorporates LoRA to adapt pre-trained models by injecting trainable low-rank matrices, significantly reducing trainable parameters. This is applied to attention mechanisms and feed-forward networks.
|
| 89 |
+
* **Layer Freezing:** The framework allows freezing the initial N layers of a pre-trained model to reduce computational cost and memory, and prevent catastrophic forgetting.
|
| 90 |
+
* **Model Distillation (Capability of DinoMX):**
|
| 91 |
+
* DinoMX supports knowledge distillation, allowing knowledge transfer from larger foundational models (teacher) to smaller models (student) using a DINO-like self-distillation approach. The teacher model can be frozen or updated via EMA from a student shadow.
|
| 92 |
+
* **Parallelization:**
|
| 93 |
+
* DinoMX supports both Distributed Data Parallelism (DDP) and Fully Sharded Data Parallelism (FSDP) for efficient training across multiple GPUs. This offers flexibility over frameworks where parallelization techniques might be hardcoded.
|
| 94 |
+
* **Hugging Face Compatibility:**
|
| 95 |
+
* Models trained with DinoMX are built on the Hugging Face transformer library, ensuring compatibility and facilitating public sharing. Configuration files allow modification of ViT models while maintaining this compatibility.
|
| 96 |
+
* **Cross-Training:**
|
| 97 |
+
* DinoMX allows any transformer-based ViT model to be trained with either DINOv1 or DINOv2 techniques, enabling novel experimental combinations.
|
| 98 |
+
|
| 99 |
+
## Evaluation
|
| 100 |
+
*(Refer to the original DINOv2 paper or provide your own evaluation results. The DinoMX paper includes experiments on MedMNIST and calcification detection)*
|
| 101 |
+
|
| 102 |
+
*(Consider including relevant tables from the original DINOv2 model card if your model's performance is comparable or if you've replicated those evaluations. Otherwise, provide your own.)*
|
| 103 |
+
|
| 104 |
+
## Environmental Impact
|
| 105 |
+
*(As per the original DINOv2 model card, or provide your own details if different hardware/software/region was used.)*
|
| 106 |
+
|
| 107 |
+
* **Hardware Type:** *(e.g., Nvidia A100)*
|
| 108 |
+
* **Hours used:** *(Specify)*
|
| 109 |
+
* **Cloud Provider:** *(e.g., Private infra, MGB ERIS Research Computing Core)*
|
| 110 |
+
* **Compute Region:** *(Specify)*
|
| 111 |
+
* **Carbon Emitted:** *(Specify)*
|
| 112 |
+
|
| 113 |
+
#### Hardware
|
| 114 |
+
*(e.g., Nvidia A100 GPUs)*
|
| 115 |
+
|
| 116 |
+
#### Software
|
| 117 |
+
*(e.g., PyTorch, xFormers, Hugging Face Transformers library)*
|
| 118 |
+
|
| 119 |
+
## BibTeX
|
| 120 |
+
*(Include BibTeX for DINOv2 papers, and for DinoMX if available/citable)*
|
| 121 |
+
|
| 122 |
+
```bibtex
|
| 123 |
+
@misc{oquab2023dinov2,
|
| 124 |
+
title={DINOv2: Learning Robust Visual Features without Supervision},
|
| 125 |
+
author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and
|