| | --- |
| | license: cc-by-4.0 |
| | datasets: |
| | - imagenet-1k |
| | metrics: |
| | - accuracy |
| | pipeline_tag: image-classification |
| | language: |
| | - en |
| | tags: |
| | - resnet |
| | - convolutional neural network |
| | - simpool |
| | - dino |
| | - computer vision |
| | - deep learning |
| | --- |
| | |
| | # Self-supervised ResNet-50 model with SimPool |
| |
|
| | ResNet-50 model with SimPool (gamma=2.0) trained on ImageNet-1k for 100 epochs. Self-supervision with [DINO](https://arxiv.org/abs/2104.14294). |
| |
|
| | SimPool is a simple attention-based pooling method at the end of network, introduced on this ICCV 2023 [paper](https://arxiv.org/pdf/2309.06891.pdf) and released in this [repository](https://github.com/billpsomas/simpool/). |
| | Disclaimer: This model card is written by the author of SimPool, i.e. [Bill Psomas](http://users.ntua.gr/psomasbill/). |
| |
|
| | ## Motivation |
| |
|
| | Convolutional networks and vision transformers have different forms of pairwise interactions, pooling across layers and pooling at the end of the network. Does the latter really need to be different? |
| | As a by-product of pooling, vision transformers provide spatial attention for free, but this is most often of low quality unless self-supervised, which is not well studied. Is supervision really the problem? |
| |
|
| | ## Method |
| |
|
| | SimPool is a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. For transformers, we completely discard the [CLS] token. |
| | Interestingly, we find that, whether supervised or self-supervised, SimPool improves performance on pre-training and downstream tasks and provides attention maps delineating object boundaries in all cases. |
| | One could thus call SimPool universal. |
| |
|
| | ## Evaluation with k-NN |
| |
|
| | | k | top1 | top5 | |
| | | ------- | ------- | ------- | |
| | | 10 | 63.828 | 81.82 | |
| | | 20 | 63.502 | 83.824 | |
| | | 100 | 60.658 | 84.716 | |
| | | 200 | 58.66 | 83.846 | |
| |
|
| | ## BibTeX entry and citation info |
| |
|
| | ``` |
| | @misc{psomas2023simpool, |
| | title={Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?}, |
| | author={Bill Psomas and Ioannis Kakogeorgiou and Konstantinos Karantzalos and Yannis Avrithis}, |
| | year={2023}, |
| | eprint={2309.06891}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV} |
| | } |
| | ``` |
| |
|