vit_small_patch16_224.dinov3
A Vision Transformer feature extraction model trained on the LVD-1689M web dataset with DINOv3.
The model was trained in self-supervised fashion. No classification head was trained, only the backbone. This is the ViT-S/16 variant (21M parameters), distilled from the DINOv3 ViT-7B teacher model.
Disclaimer: This is a porting of the Meta AI DINOv3 model weights to Apple MLX Framework.
How to use
pip install mlx-image
Here is how to use this model for feature extraction:
import mlx.core as mx
from mlxim.model import create_model
from mlxim.io import read_rgb
from mlxim.transform import ImageNetTransform
transform = ImageNetTransform(train=False, img_size=224)
x = transform(read_rgb("image.png"))
x = mx.expand_dims(x, 0)
model = create_model("vit_small_patch16_224.dinov3")
model.eval()
embeds = model(x, is_training=False)
You can also use the embeddings from the layer before the head:
import mlx.core as mx
from mlxim.model import create_model
from mlxim.io import read_rgb
from mlxim.transform import ImageNetTransform
transform = ImageNetTransform(train=False, img_size=224)
x = transform(read_rgb("image.png"))
x = mx.expand_dims(x, 0)
model = create_model("vit_small_patch16_224.dinov3", num_classes=0)
model.eval()
embeds = model(x, is_training=False)
Architecture
This model follows the ViT architecture with a patch size of 16. For a 224×224 image this results in 1 class token + 4 register tokens + 196 patch tokens = 201 tokens.
The model can accept larger images provided the image shapes are multiples of the patch size (16). If this condition is not met, the model will crop to the closest smaller multiple.
Key architectural features over DINOv2:
- RoPE: Rotary Position Embeddings for 2D images
- SwiGLU: Efficient SwiGLU feed-forward networks
- LayerScale: For improved training stability in deep transformers
- Register tokens: 4 additional register/storage tokens
Available model variants (mlx-image)
| Model name | Params | Embed dim | Heads | FFN |
|---|---|---|---|---|
vit_small_patch16_224.dinov3 |
21M | 384 | 6 | MLP + RoPE |
vit_base_patch16_224.dinov3 |
86M | 768 | 12 | MLP + RoPE |
vit_large_patch16_224.dinov3 |
300M | 1024 | 16 | MLP + RoPE |
Evaluation results
Results on global and dense tasks (LVD-1689M pretraining)
| Model | IN-ReaL | IN-R | Obj.Net | ADE20k | NYU↓ | DAVIS |
|---|---|---|---|---|---|---|
| DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 47.0 | 0.403 | 72.7 |
| DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 51.8 | 0.373 | 77.2 |
| DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 54.9 | 0.352 | 79.9 |
Refer to the DINOv3 paper for full evaluation details and protocols.
Training data
The model was distilled from DINOv3 ViT-7B, which was pretrained on LVD-1689M — a curated dataset of 1,689 million images from public web sources.
Bias and limitations
DINOv3 delivers generally consistent performance across income categories on geographical fairness benchmarks, though a performance gap between low-income and high-income buckets remains. A relative difference is also observed between European and African regions. Fine-tuning may amplify these biases depending on the fine-tuning labels used.
Acknowledgements
Original model developed by Meta AI. See the blog post and paper. Weights ported to MLX by etornam45.
- Downloads last month
- 13
Quantized