An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Paper • 2010.11929 • Published • 16
How to use cs-giung/vit-large-patch16-imagenet21k with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-feature-extraction", model="cs-giung/vit-large-patch16-imagenet21k") # Load model directly
from transformers import AutoImageProcessor, AutoModel
processor = AutoImageProcessor.from_pretrained("cs-giung/vit-large-patch16-imagenet21k")
model = AutoModel.from_pretrained("cs-giung/vit-large-patch16-imagenet21k")Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
The weights were converted from the ViT-L_16.npz file in GCS buckets presented in the original repository.