LOCA: Location-Aware Self-Supervised Vision Transformers for Semantic Segmentation

JAX implementation and pretrained models for LOCA. For details, see arXiv.

Training

Like other projects in Scenic, all model parameters, training sets and datasets are specified using configuration files.

An example command-line to train ViT-Base/16 on the ImageNet-1k dataset during 100 epochs using this config file is:

$ python -m scenic.projects.loca.main \
  --config=scenic/projects/loca/configs/loca_imnet1k_base16.py \
  --workdir=loca_base/

The resulting checkpoint should reach 46.2 mIoU after finetuning on ADE20k dataset with the linear decoder from Segmenter.

Model Zoo

arch	data	mIoU ADE20k	download
ViT-S/16	ImageNet-1k	44.8	checkpoint
ViT-B/16	ImageNet-1k	48.0	checkpoint
ViT-B/16	ImageNet-21k	48.5	checkpoint
ViT-L/16	ImageNet-21k	52.3	checkpoint
ViT-H/16	ImageNet-21k	54.3	checkpoint

Citation

If you use LOCA, please use the following BibTeX entry.

@article{caron2022location,
    title={Location-Aware Self-Supervised Vision Transformers for Semantic Segmentation},
    author={Caron, Mathilde and Houlsby, Neil and Schmid, Cordelia},
    journal={arXiv:2212.02400},
    year={2022}
}