A newer version of the Streamlit SDK is available: 1.56.0
Masked Autoencoders Are Scalable Vision Learners (MAE)
TF2 implementation of MAE.
Imagenet pretrain
| Model | reolution | pathch size | batch size | epochs | target pixel norm | val MSE |
|---|---|---|---|---|---|---|
| (a) ViT-L14 | 224x224 | 14 | 4096 | 800 | no | 0.2456 |
| (b) ViT-L14 | 224x224 | 14 | 4096 | 800 | yes | 0.3630 |
| (c) ViT-L16 | 224x224 | 16 | 4096 | 800 | yes | 0.3866 |
ImageNet linear probing
| Model | resolution | pathch size | base learning rate | batch size | init checkpoint | epochs | top1 Acc | dashboard |
|---|---|---|---|---|---|---|---|---|
| ViT-L14 | 224x224 | 14 | 0.1 | 16384 | (b) | 90 | 72.8 | - |
| ViT-L16 | 224x224 | 16 | 0.1 | 16384 | (c) | 90 | 73.0 | - |
| ViT-L16 | 224x224 | 16 | 0.1 | 16384 | norm | 90 | 73.9 | Table 1 (d) |
ImageNet finetune
| Model | resolution | pathch size | base learning rate | batch size | init checkpoint | epochs | top1 Acc | dashboard |
|---|---|---|---|---|---|---|---|---|
| ViT-L14 | 224x224 | 14 | 0.001 | 1024 | (a) | 50 | 84.4 | - |
| ViT-L14 | 224x224 | 14 | 0.001 | 1024 | (b) | 50 | 85.3 | - |
| ViT-L14 | 224x224 | 14 | 0.00075 | 1024 | (b) | 50 | 85.4 | - |
| ViT-L14 | 224x224 | 14 | 0.0001 | 4096 | scratch | 200 | 82.4 | - |
| ViT-L16 | 224x224 | 16 | 0.001 | 1024 | (c) | 50 | 84.9 | - |
| ViT-L16 | 224x224 | 16 | 0.001 | 1024 | no-norm | 50 | 84.9 | Table 1(d) |
| ViT-L16 | 224x224 | 16 | 0.001 | 1024 | norm | 50 | 85.4 | paper section 4. |
| ViT-L16 | 224x224 | 16 | 0.0001 | 4096 | scratch | 200 | 82.5 | paper section 4. |
Known discrepancy with the paper:
~-0.9 linear probing top1 acc (w/ norm) compared to paper results with patch size 16.
~-0.5 finetune top1 acc (w/ norm) compared to paper results with patch size 16.