glasses
/

vit_base_patch16_224

Model card Files Files and versions

vit_base_patch16_224 / README.md

Francesco's picture

commit files to HF hub

bacd6ed almost 5 years ago

|

history blame contribute delete

1.62 kB

	# vit_base_patch16_224
	Implementation of Vision Transformer (ViT) proposed in [An Image Is
	Worth 16x16 Words: Transformers For Image Recognition At
	Scale](https://arxiv.org/pdf/2010.11929.pdf)

	The following image from the authors shows the architecture.

	![image](https://github.com/FrancescoSaverioZuppichini/glasses/blob/develop/docs/_static/images/ViT.png?raw=true)

	``` python
	ViT.vit_small_patch16_224()
	ViT.vit_base_patch16_224()
	ViT.vit_base_patch16_384()
	ViT.vit_base_patch32_384()
	ViT.vit_huge_patch16_224()
	ViT.vit_huge_patch32_384()
	ViT.vit_large_patch16_224()
	ViT.vit_large_patch16_384()
	ViT.vit_large_patch32_384()
	```

	Examples:

	``` python
	# change activation
	ViT.vit_base_patch16_224(activation = nn.SELU)
	# change number of classes (default is 1000 )
	ViT.vit_base_patch16_224(n_classes=100)
	# pass a different block, default is TransformerEncoderBlock
	ViT.vit_base_patch16_224(block=MyCoolTransformerBlock)
	# get features
	model = ViT.vit_base_patch16_224
	# first call .features, this will activate the forward hooks and tells the model you'll like to get the features
	model.encoder.features
	model(torch.randn((1,3,224,224)))
	# get the features from the encoder
	features = model.encoder.features
	print([x.shape for x in features])
	#[[torch.Size([1, 197, 768]), torch.Size([1, 197, 768]), ...]
	# change the tokens, you have to subclass ViTTokens
	class MyTokens(ViTTokens):
	def __init__(self, emb_size: int):
	super().__init__(emb_size)
	self.my_new_token = nn.Parameter(torch.randn(1, 1, emb_size))
	ViT(tokens=MyTokens)
	```