Spaces:
Runtime error
Runtime error
| Just read VitMAE paper, sharing some highlights 🧶 ViTMAE is a simply yet effective self-supervised pre-training technique, where authors combined vision transformer with masked autoencoder. | |
| The images are first masked (75 percent of the image!) and then the model tries to learn about the features through trying to reconstruct the original image! | |
|  | |
| The image is not masked, but rather only the visible patches are fed to the encoder (and that is the only thing encoder sees!). | |
| Next, a mask token is added to where the masked patches are (a bit like BERT, if you will) and the mask tokens and encoded patches are fed to decoder. | |
| The decoder then tries to reconstruct the original image. | |
|  | |
| As a result, the authors found out that high masking ratio works well in fine-tuning for downstream tasks and linear probing 🤯🤯 | |
|  | |
| If you want to try the model or fine-tune, all the pre-trained VITMAE models released released by Meta are available on [Huggingface](https://t.co/didvTL9Zkm). | |
| We've built a [demo](https://t.co/PkuACJiKrB) for you to see the intermediate outputs and reconstruction by VITMAE. | |
| Also there's a nice [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb) by [@NielsRogge](https://twitter.com/NielsRogge). | |
|  | |
| > [!TIP] | |
| Ressources: | |
| [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377v3) | |
| by LKaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick (2021) | |
| [GitHub](https://github.com/facebookresearch/mae) | |
| [Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/vit_mae) | |
| > [!NOTE] | |
| [Original tweet](https://twitter.com/mervenoyann/status/1740688304784183664) (December 29, 2023) | |