Vision Transformer (ViT) trained using DINOv2 on ImageNet-1K only

Reproduction of the ViT-L/16 results from the DINOv2 repo (which uses only ImageNet-1K in 224x224 resolution).

The original work uses the much larger LVD142M dataset and distills a larger model (g/14) into a L/14 model.

import torch

model = torch.hub.load("BenediktAlkin/torchhub-ssl", "in1k_dinov2_l16")
image = torch.randn(1, 3, 224, 224)
features = model(image)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

BenediktAlkin
/

DINOv2