cs-giung
/

vit-base-patch16-imagenet21k-augreg

Image Classification

Model card Files Files and versions

Vision Transformer

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale and further enhanced in the follow-up paper How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. The weights were converted from the B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0.npz file in GCS buckets presented in the original repository.

Downloads last month: 8

Safetensors

Model size

0.1B params

Tensor type

F32

·

Collection including cs-giung/vit-base-patch16-imagenet21k-augreg

ViT (ImageNet-21k)

4 items • Updated Jul 7, 2024

Papers for cs-giung/vit-base-patch16-imagenet21k-augreg

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Paper • 2106.10270 • Published Jun 18, 2021 • 3

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper • 2010.11929 • Published Oct 22, 2020 • 20