object / models /official /projects /mae /README.md
Kasamuday's picture
Upload 4260 files
f3507ef verified

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

Masked Autoencoders Are Scalable Vision Learners (MAE)

TF2 implementation of MAE.

Imagenet pretrain

Model reolution pathch size batch size epochs target pixel norm val MSE
(a) ViT-L14 224x224 14 4096 800 no 0.2456
(b) ViT-L14 224x224 14 4096 800 yes 0.3630
(c) ViT-L16 224x224 16 4096 800 yes 0.3866

ImageNet linear probing

Model resolution pathch size base learning rate batch size init checkpoint epochs top1 Acc dashboard
ViT-L14 224x224 14 0.1 16384 (b) 90 72.8 -
ViT-L16 224x224 16 0.1 16384 (c) 90 73.0 -
ViT-L16 224x224 16 0.1 16384 norm 90 73.9 Table 1 (d)

ImageNet finetune

Model resolution pathch size base learning rate batch size init checkpoint epochs top1 Acc dashboard
ViT-L14 224x224 14 0.001 1024 (a) 50 84.4 -
ViT-L14 224x224 14 0.001 1024 (b) 50 85.3 -
ViT-L14 224x224 14 0.00075 1024 (b) 50 85.4 -
ViT-L14 224x224 14 0.0001 4096 scratch 200 82.4 -
ViT-L16 224x224 16 0.001 1024 (c) 50 84.9 -
ViT-L16 224x224 16 0.001 1024 no-norm 50 84.9 Table 1(d)
ViT-L16 224x224 16 0.001 1024 norm 50 85.4 paper section 4.
ViT-L16 224x224 16 0.0001 4096 scratch 200 82.5 paper section 4.

Known discrepancy with the paper:

  • ~-0.9 linear probing top1 acc (w/ norm) compared to paper results with patch size 16.

  • ~-0.5 finetune top1 acc (w/ norm) compared to paper results with patch size 16.