Visual Transformers: Token-based Image Representation and Processing for Computer Vision
Paper
•
2006.03677
•
Published
•
2
This model is a fine-tuned version of google/vit-base-patch16-224-in21k on the imagefolder dataset. It achieves the following results on the evaluation set:
More information needed
More information needed
More information needed
The following hyperparameters were used during training:
| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|---|---|---|---|---|
| 0.1923 | 1.0 | 2242 | 0.1294 | 0.9563 |
| 0.1569 | 2.0 | 4484 | 0.1086 | 0.9647 |
| 0.1306 | 3.0 | 6726 | 0.1044 | 0.9683 |
@misc{wu2020visual,
title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision},
author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
year={2020},
eprint={2006.03677},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{deng2009imagenet,
title={Imagenet: A large-scale hierarchical image database},
author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
booktitle={2009 IEEE conference on computer vision and pattern recognition},
pages={248--255},
year={2009},
organization={Ieee}
}
@misc{rogge2025transformerstutorials,
author = {Rogge, Niels},
title = {Tutorials},
url = {[https://github.com/NielsRogge/tutorials](https://github.com/NielsRogge/Transformers-Tutorials)},
year = {2025}
}
Base model
google/vit-base-patch16-224-in21k