LOCA: Location-Aware Self-Supervised Vision Transformers for Semantic Segmentation
JAX implementation and pretrained models for LOCA. For details, see arXiv.
Training
Like other projects in Scenic, all model parameters, training sets and datasets are specified using configuration files.
An example command-line to train ViT-Base/16 on the ImageNet-1k dataset during 100 epochs using this config file is:
$ python -m scenic.projects.loca.main \
--config=scenic/projects/loca/configs/loca_imnet1k_base16.py \
--workdir=loca_base/
The resulting checkpoint should reach 46.2 mIoU after finetuning on ADE20k dataset with the linear decoder from Segmenter.
Model Zoo
| arch | data | mIoU ADE20k | download |
|---|---|---|---|
| ViT-S/16 | ImageNet-1k | 44.8 | checkpoint |
| ViT-B/16 | ImageNet-1k | 48.0 | checkpoint |
| ViT-B/16 | ImageNet-21k | 48.5 | checkpoint |
| ViT-L/16 | ImageNet-21k | 52.3 | checkpoint |
| ViT-H/16 | ImageNet-21k | 54.3 | checkpoint |
Citation
If you use LOCA, please use the following BibTeX entry.
@article{caron2022location,
title={Location-Aware Self-Supervised Vision Transformers for Semantic Segmentation},
author={Caron, Mathilde and Houlsby, Neil and Schmid, Cordelia},
journal={arXiv:2212.02400},
year={2022}
}