cs-giung
/

vit-large-patch16-imagenet21k

Image Feature Extraction

Model card Files Files and versions

Vision Transformer

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. The weights were converted from the ViT-L_16.npz file in GCS buckets presented in the original repository.

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

·

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for cs-giung/vit-large-patch16-imagenet21k

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper • 2010.11929 • Published Oct 22, 2020 • 16