Kasamuday's picture
Upload 4315 files
07ef7ab verified

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

MaxViT: Multi-Axis Vision Transformer (ECCV 2022)

Paper

⚠️ DISCLAIMER: This implementation is still under development.

[TOC]

MaxViT is a family of hybrid (CNN + ViT) vision backbone models, that achieves better performances across the board for both parameter and FLOPs efficiency than both state-of-the-art ConvNets and Transformers (Blog). They can also scale well on large dataset sizes like ImageNet-21K. Notably, due to the linear-complexity of the grid attention used, MaxViT scales well on tasks requiring large image sizes, such as object detection and segmentation.

MaxViT meta-architecture: a homogeneously stacked backbone, wherein each MaxViT block contains MBConv, block attention (window-based local attention), and grid attention (dilated global attention).

Results on ImageNet-1k standard train and test:

Results on ImageNet-21k and JFT pre-trained models:

Model Performance

Note: Deit ImageNet pretrain experimental settings are different from the paper. These experiments follows the pre-training hyperparameters in paper and only run pre-training for similar number of steps. The paper suggested a short fine-tuning with different hyper-parameters and EMA.

Deit ImageNet pretrain {.new-tab}

Model Eval Size Top-1 Acc Acc on Paper #Param #FLOPs Config
MaxViT-Tiny 224x224 83.1 (-0.5) 83.6 31M 5.6G config
MaxViT-Small 224x224 84.1 (-0.3) 84.4 69M 11.7G config
MaxViT-Base 224x224 84.2 (-0.7) 84.9 120M 23.4G config
MaxViT-Large 224x224 84.6 (-0.6) 85.2 212M 43.9G config
MaxViT-XLarge 224x224 84.8 - 475M 97.9G config

Cascade RCNN models {.new-tab}

Model Image Size Window Size Epochs box AP box AP on paper mask AP Config
MaxViT-Tiny 640x640 20x20 200 49.97 - 42.69 config
MaxViT-Tiny 896x896 28x28 200 52.35 (+0.25) 52.1 44.69 -
MaxViT-Small 640x640 20x20 200 50.79 - 43.36 -
MaxViT-Small 896x896 28x28 200 53.54 (+0.44) 53.1 45.79 config
MaxViT-Base 640x640 20x20 200 51.59 - 44.07 config
MaxViT-Base 896x896 28x28 200 53.47 (+0.07) 53.4 45.96 config

JFT-300M supervised pretrain {.new-tab}

Model Pretrain Size #Param #FLOPs globalPR-AUC
MaxViT-Base 224x224 120M 23.4G 52.75%
MaxViT-Large 224x224 212M 43.9G 53.77%
MaxViT-XLarge 224x224 475M - 54.71%

ImageNet Finetuning {.new-tab}

Model Image Size Top-1 Acc Acc on Paper #Param #FLOPs Config
MaxViT-Base 384x384 88.37% (-0.32%) 88.69% 120M 74.2G config
MaxViT-Base 512x512 88.63% (-0.19%) 88.82% 120M 138.3G config
MaxViT-Large 384x384 88.86% (-0.26%) 89.12% 212M 128.7G config
MaxViT-Large 512x512 89.02% (-0.39%) 89.41% 212M 245.2G config
MaxViT-XLarge 384x384 89.21% (-0.15%) 89.36% 475M 293.7G config
MaxViT-XLarge 512x512 89.31% (-0.22%) 89.53% 475M 535.2G config

Cascade RCNN models {.new-tab}

Model Image Size Window Size Epochs box AP box AP on paper mask AP Config
MaxViT-Base 896x896 28x28 200 54.31 (+0.91) 53.4 46.31 config
MaxViT-Large 896x896 28x28 200 54.69 - 46.59 config

Citation

Should you find this repository useful, please consider citing:

@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}