vit_base_patch16_224.dinov3-mlxim

vit_base_patch16_224.dinov3

A Vision Transformer feature extraction model trained on the LVD-1689M web dataset with DINOv3.

The model was trained in a self-supervised fashion. No classification head was trained, only the backbone. This is the ViT-B/16 variant (86M parameters), distilled from the DINOv3 ViT-7B teacher model.

Disclaimer: This is a porting of the Meta AI DINOv3 model weights to the Apple MLX Framework.

How to use

pip install mlx-image

Here is how to use this model for feature extraction:

import mlx.core as mx
from mlxim.model import create_model
from mlxim.io import read_rgb
from mlxim.transform import ImageNetTransform

transform = ImageNetTransform(train=False, img_size=224)
x = transform(read_rgb("image.png"))
x = mx.expand_dims(x, 0)

model = create_model("vit_base_patch16_224.dinov3")
model.eval()

embeds = model(x, is_training=False)

Architecture

This model follows the ViT architecture with a patch size of 16. For a 224×224 image this results in 1 class token + 4 register tokens + 196 patch tokens = 201 tokens.

The model can accept larger images provided the image shapes are multiples of the patch size (16). If this condition is not met, the model will crop to the closest smaller multiple.

Property	Value
Parameters	86M
Patch size	16
Embedding dim	768
Depth	12
Heads	12
FFN	MLP
Position encoding	RoPE
Register tokens	4

Available model variants (mlx-image)

Model name	Params	FFN	IN-ReaL	IN-R	Obj.Net
`vit_small_patch16_224.dinov3`	21M	MLP	87.0	60.4	50.9
`vit_small_plus_patch16_224.dinov3`	29M	SwiGLU	88.0	68.8	54.6
`vit_base_patch16_224.dinov3`	86M	MLP	89.3	76.7	64.1
`vit_large_patch16_224.dinov3`	300M	MLP	90.2	88.1	74.8

Evaluation results

Results on global and dense tasks (LVD-1689M pretraining)

Model	IN-ReaL	IN-R	Obj.Net	Ox.-H	ADE20k	NYU↓	DAVIS	NAVI	SPair
DINOv3 ViT-B/16	89.3	76.7	64.1	58.5	51.8	0.373	77.2	58.8	57.2

Refer to the DINOv3 paper for full evaluation details and protocols.

Training data

The model was distilled from DINOv3 ViT-7B, which was pretrained on LVD-1689M — a curated dataset of 1,689 million images from public web sources collected from Instagram.

Bias and limitations

DINOv3 delivers generally consistent performance across income categories on geographical fairness benchmarks, though a performance gap between low-income and high-income buckets remains. A relative difference is also observed between European and African regions. Fine-tuning may amplify these biases depending on the fine-tuning labels used.

Acknowledgements

Original model developed by Meta AI. See the blog post and paper. Weights ported to MLX by etornam45.

Downloads last month: 50

Safetensors

Model size

85.7M params

Tensor type

F32

BF16

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for mlx-vision/vit_base_patch16_224.dinov3-mlxim

DINOv3

Paper • 2508.10104 • Published Aug 13, 2025 • 310

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper • 2010.11929 • Published Oct 22, 2020 • 21