| | --- |
| | library_name: tf-keras |
| | license: mit |
| | metrics: |
| | - accuracy |
| | pipeline_tag: video-classification |
| | tags: |
| | - pretraining |
| | - finetuning |
| | - vision |
| | - videomae |
| | --- |
| | |
| | # VideoMAE |
| |
|
| |  |
| |
|
| |
|
| | | Paper | Colab | HF Space | HF Hub | |
| | | :--: | :--: | :---: | :---: | |
| | | [](https://arxiv.org/abs/2203.12602) | [](https://colab.research.google.com/drive/1BFisOW2yzdvDEBN_0P3M41vQCwF6dTWR?usp=sharing) | [](https://huggingface.co/spaces/innat/VideoMAE) | [](https://huggingface.co/innat/videomae) | |
| |
|
| |
|
| | Video masked autoencoders (**VideoMAE**) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent [ImageMAE](https://arxiv.org/abs/2111.06377), and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of **VideoMAE**: |
| |
|
| | - **Masked Video Modeling for Video Pre-Training** |
| | - **A Simple, Efficient and Strong Baseline in SSVP** |
| | - **High performance, but NO extra data required** |
| |
|
| | This is a unofficial `Keras` reimplementation of [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) model. The official `PyTorch` implementation can be found [here](https://github.com/MCG-NJU/VideoMAE). |
| |
|
| | # Model Zoo |
| |
|
| | The pre-trained and fine-tuned models are listed in [MODEL_ZOO.md](MODEL_ZOO.md). Following are some hightlights. |
| |
|
| | ### Kinetics-400 |
| |
|
| | For Kinetrics-400, VideoMAE is trained around **1600** epoch without **any extra data**. The following checkpoints are available in both tensorflow `SavedModel` and `h5` format. |
| |
|
| |
|
| | | Backbone | \#Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB) | FLOPs | |
| | | :--: | :--: | :---: | :---: | :---: | :---: | :---: | |
| | ViT-S | 16x5x3 | 79.0 | 93.8 | 22 | 24 | 57G | |
| | ViT-B | 16x5x3 | 81.5 | 95.1 | 87 | 94 | 181G | |
| | ViT-L | 16x5x3 | 85.2 | 96.8 | 304 | 343 | - | |
| | ViT-H | 16x5x3 | 86.6 | 97.1 | 632 | ? | - | |
| |
|
| | <sup>?* Official `ViT-H` backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89.</sup> |
| | <sup>The FLOPs of encoder models (FT) are reported only.</sup> |
| |
|
| |
|
| | ### Something-Something V2 |
| |
|
| | For SSv2, VideoMAE is trained around **2400** epoch without **any extra data**. |
| |
|
| | | Backbone | \#Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPs | |
| | | :------: | :-----: | :---: | :---: | :---: | :---: | :---: | |
| | | ViT-S | 16x2x3 | 66.8 | 90.3 | 22 | 24 | 57G | |
| | | ViT-B | 16x2x3 | 70.8 | 92.4 | 86 | 94 | 181G | |
| |
|
| |
|
| | ### UCF101 |
| |
|
| | For UCF101, VideoMAE is trained around **3200** epoch without **any extra data**. |
| |
|
| | | Backbone | \#Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPS | |
| | | :---: | :-----: | :---: | :---: | :---: | :---: | :---: | |
| | | ViT-B | 16x5x3 | 91.3 | 98.5 | 86 | 94 | 181G | |