ViT-Base/32

Dosovitskiy et al., 2021 — An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (arXiv:2010.11929)

Lucid port of torchvision/ViT_B_32_Weights.IMAGENET1K_V1, converted to Lucid-native safetensors.

Available weights

Tag	acc@1	acc@5	Params	GFLOPs	Size	Source
`IMAGENET1K_V1` (default)	75.912	92.466	88.2M	4.409	336.56 MB	torchvision

Usage

import lucid.models as models
from lucid.models.weights import ViTBase32Weights

# default tag
model = models.vit_base_32_cls(pretrained=True)

# explicit tag (enum or string)
model = models.vit_base_32_cls(weights=ViTBase32Weights.IMAGENET1K_V1)
model = models.vit_base_32_cls(pretrained="IMAGENET1K_V1")

# preprocessing travels with the weights
weights = ViTBase32Weights.IMAGENET1K_V1
preprocess = weights.transforms()
logits = model(preprocess(image)[None]).logits

Conversion

Converted from torchvision/ViT_B_32_Weights.IMAGENET1K_V1 via python -m tools.convert_weights vit_base_32 --tag IMAGENET1K_V1. Key mapping + numerical parity verified against the source.

License

bsd-3-clause — inherited from the original weights.

Citation

@inproceedings{dosovitskiy2021image,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  booktitle={ICLR}, year={2021}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train lucid-dl/vit-base-32

Paper for lucid-dl/vit-base-32

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper • 2010.11929 • Published Oct 22, 2020 • 22

Evaluation results

acc@1 on ImageNet-1K
self-reported

75.912
acc@5 on ImageNet-1K
self-reported

92.466