etornam commited on
Commit
6b45da9
·
verified ·
1 Parent(s): 6402097

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -3
README.md CHANGED
@@ -1,3 +1,108 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: mlx-image
3
+ license: other
4
+ license_name: dinov3
5
+ license_link: https://ai.meta.com/resources/models-and-libraries/dinov3-license/
6
+ tags:
7
+ - mlx
8
+ - mlx-image
9
+ - vision
10
+ - dinov3
11
+ - image-feature-extraction
12
+ pipeline_tag: image-feature-extraction
13
+ ---
14
+
15
+ # vit_base_patch16_224.dinov3
16
+
17
+ A [Vision Transformer](https://arxiv.org/abs/2010.11929v2) feature extraction model trained on the LVD-1689M web dataset with [DINOv3](https://arxiv.org/abs/2508.10104).
18
+
19
+ The model was trained in a self-supervised fashion. No classification head was trained, only the backbone. This is the **ViT-B/16** variant (86M parameters), distilled from the DINOv3 ViT-7B teacher model.
20
+
21
+ Disclaimer: This is a porting of the Meta AI DINOv3 model weights to the Apple MLX Framework.
22
+
23
+ ## How to use
24
+ ```bash
25
+ pip install mlx-image
26
+ ```
27
+
28
+ Here is how to use this model for feature extraction:
29
+ ```python
30
+ import mlx.core as mx
31
+ from mlxim.model import create_model
32
+ from mlxim.io import read_rgb
33
+ from mlxim.transform import ImageNetTransform
34
+
35
+ transform = ImageNetTransform(train=False, img_size=224)
36
+ x = transform(read_rgb("image.png"))
37
+ x = mx.expand_dims(x, 0)
38
+
39
+ model = create_model("vit_base_patch16_224.dinov3")
40
+ model.eval()
41
+
42
+ embeds = model(x, is_training=False)
43
+ ```
44
+
45
+ You can also retrieve embeddings from the layer before the head:
46
+ ```python
47
+ import mlx.core as mx
48
+ from mlxim.model import create_model
49
+ from mlxim.io import read_rgb
50
+ from mlxim.transform import ImageNetTransform
51
+
52
+ transform = ImageNetTransform(train=False, img_size=224)
53
+ x = transform(read_rgb("image.png"))
54
+ x = mx.expand_dims(x, 0)
55
+
56
+ model = create_model("vit_base_patch16_224.dinov3", num_classes=0)
57
+ model.eval()
58
+
59
+ embeds = model(x, is_training=False)
60
+ ```
61
+
62
+ ## Architecture
63
+
64
+ This model follows the ViT architecture with a patch size of 16. For a 224×224 image this results in **1 class token + 4 register tokens + 196 patch tokens = 201 tokens**.
65
+
66
+ The model can accept larger images provided the image shapes are multiples of the patch size (16). If this condition is not met, the model will crop to the closest smaller multiple.
67
+
68
+ | Property | Value |
69
+ |---|---|
70
+ | Parameters | 86M |
71
+ | Patch size | 16 |
72
+ | Embedding dim | 768 |
73
+ | Depth | 12 |
74
+ | Heads | 12 |
75
+ | FFN | MLP |
76
+ | Position encoding | RoPE |
77
+ | Register tokens | 4 |
78
+
79
+ ## Available model variants (mlx-image)
80
+
81
+ | Model name | Params | FFN | IN-ReaL | IN-R | Obj.Net |
82
+ |---|---|---|---|---|---|
83
+ | `vit_small_patch16_224.dinov3` | 21M | MLP | 87.0 | 60.4 | 50.9 |
84
+ | `vit_small_plus_patch16_224.dinov3` | 29M | SwiGLU | 88.0 | 68.8 | 54.6 |
85
+ | `vit_base_patch16_224.dinov3` | **86M** | **MLP** | **89.3** | **76.7** | **64.1** |
86
+ | `vit_large_patch16_224.dinov3` | 300M | MLP | 90.2 | 88.1 | 74.8 |
87
+
88
+ ## Evaluation results
89
+
90
+ *Results on global and dense tasks (LVD-1689M pretraining)*
91
+
92
+ | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair |
93
+ |---|---|---|---|---|---|---|---|---|---|
94
+ | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 |
95
+
96
+ Refer to the [DINOv3 paper](https://arxiv.org/abs/2508.10104) for full evaluation details and protocols.
97
+
98
+ ## Training data
99
+
100
+ The model was distilled from DINOv3 ViT-7B, which was pretrained on **LVD-1689M** — a curated dataset of 1,689 million images from public web sources collected from Instagram.
101
+
102
+ ## Bias and limitations
103
+
104
+ DINOv3 delivers generally consistent performance across income categories on geographical fairness benchmarks, though a performance gap between low-income and high-income buckets remains. A relative difference is also observed between European and African regions. Fine-tuning may amplify these biases depending on the fine-tuning labels used.
105
+
106
+ ## Acknowledgements
107
+
108
+ Original model developed by [Meta AI](https://ai.meta.com/dinov3/). See the [blog post](https://ai.meta.com/blog/dinov3-self-supervised-vision-model/) and [paper](https://arxiv.org/abs/2508.10104). Weights ported to MLX by [etornam45](https://github.com/etornam45/mlx_dinov3).