Vision Transformer (ViT) trained using DINOv2 on ImageNet-1K only
Reproduction of the ViT-L/16 results from the DINOv2 repo (which uses only ImageNet-1K in 224x224 resolution).
The original work uses the much larger LVD142M dataset and distills a larger model (g/14) into a L/14 model.
How to use
import torch
model = torch.hub.load("BenediktAlkin/torchhub-ssl", "in1k_dinov2_l16")
image = torch.randn(1, 3, 224, 224)
features = model(image)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support