ViT-Base/32

Dosovitskiy et al., 2021 — An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (arXiv:2010.11929)

Lucid port of torchvision/ViT_B_32_Weights.IMAGENET1K_V1, converted to Lucid-native safetensors.

Available weights

Tag acc@1 acc@5 Params GFLOPs Size Source
IMAGENET1K_V1 (default) 75.912 92.466 88.2M 4.409 336.56 MB torchvision

Usage

import lucid.models as models
from lucid.models.weights import ViTBase32Weights

# default tag
model = models.vit_base_32_cls(pretrained=True)

# explicit tag (enum or string)
model = models.vit_base_32_cls(weights=ViTBase32Weights.IMAGENET1K_V1)
model = models.vit_base_32_cls(pretrained="IMAGENET1K_V1")

# preprocessing travels with the weights
weights = ViTBase32Weights.IMAGENET1K_V1
preprocess = weights.transforms()
logits = model(preprocess(image)[None]).logits

Conversion

Converted from torchvision/ViT_B_32_Weights.IMAGENET1K_V1 via python -m tools.convert_weights vit_base_32 --tag IMAGENET1K_V1. Key mapping + numerical parity verified against the source.

License

bsd-3-clause — inherited from the original weights.

Citation

@inproceedings{dosovitskiy2021image,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  booktitle={ICLR}, year={2021}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train lucid-dl/vit-base-32

Paper for lucid-dl/vit-base-32

Evaluation results