| | --- |
| | license: cc-by-4.0 |
| | datasets: |
| | - imagenet-1k |
| | metrics: |
| | - accuracy |
| | pipeline_tag: image-classification |
| | language: |
| | - en |
| | tags: |
| | - vision transformer |
| | - simpool |
| | - dino |
| | - computer vision |
| | - deep learning |
| | --- |
| | |
| | # Self-supervised ViT-S/16 (small-sized Vision Transformer with patch size 16) model with SimPool |
| |
|
| | ViT-S model with SimPool (gamma=1.25) trained on ImageNet-1k for 300 epochs. Self-supervision with [DINO](https://arxiv.org/abs/2104.14294). |
| |
|
| | SimPool is a simple attention-based pooling method at the end of network, introduced on this ICCV 2023 [paper](https://arxiv.org/pdf/2309.06891.pdf) and released in this [repository](https://github.com/billpsomas/simpool/). |
| | Disclaimer: This model card is written by the author of SimPool, i.e. [Bill Psomas](http://users.ntua.gr/psomasbill/). |
| |
|
| | ## Motivation |
| |
|
| | Convolutional networks and vision transformers have different forms of pairwise interactions, pooling across layers and pooling at the end of the network. Does the latter really need to be different? |
| | As a by-product of pooling, vision transformers provide spatial attention for free, but this is most often of low quality unless self-supervised, which is not well studied. Is supervision really the problem? |
| |
|
| | ## Method |
| |
|
| | SimPool is a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. For transformers, we completely discard the [CLS] token. |
| | Interestingly, we find that, whether supervised or self-supervised, SimPool improves performance on pre-training and downstream tasks and provides attention maps delineating object boundaries in all cases. |
| | One could thus call SimPool universal. |
| |
|
| | ## Evaluation with k-NN |
| |
|
| | | k | top1 | top5 | |
| | | ------- | ------- | ------- | |
| | | 10 | 72.56 | 87.638 | |
| | | 20 | 72.434 | 89.24 | |
| | | 100 | 70.526 | 90.582 | |
| | | 200 | 69.33 | 90.424 | |
| |
|
| | ## BibTeX entry and citation info |
| |
|
| | ``` |
| | @misc{psomas2023simpool, |
| | title={Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?}, |
| | author={Bill Psomas and Ioannis Kakogeorgiou and Konstantinos Karantzalos and Yannis Avrithis}, |
| | year={2023}, |
| | eprint={2309.06891}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV} |
| | } |
| | ``` |