videomae / README.md

Set `library_name` to `tf-keras`.

cc633b3 verified over 1 year ago

3.43 kB

	---
	library_name: tf-keras
	license: mit
	metrics:
	- accuracy
	pipeline_tag: video-classification
	tags:
	- pretraining
	- finetuning
	- vision
	- videomae
	---

	# VideoMAE

	![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/622dcfbee27c88667db09411/cIVuujQqtTv-jlcRl9Gcf.jpeg)


	\| Paper \| Colab \| HF Space \| HF Hub \|
	\| :--: \| :--: \| :---: \| :---: \|
	\| [![arXiv](https://img.shields.io/badge/arXiv-2203.12602-darkred)](https://arxiv.org/abs/2203.12602) \| [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1BFisOW2yzdvDEBN_0P3M41vQCwF6dTWR?usp=sharing) \| [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-yellow.svg)](https://huggingface.co/spaces/innat/VideoMAE) \| [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Hub-yellow.svg)](https://huggingface.co/innat/videomae) \|


	Video masked autoencoders (VideoMAE) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent [ImageMAE](https://arxiv.org/abs/2111.06377), and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of VideoMAE:

	- Masked Video Modeling for Video Pre-Training
	- A Simple, Efficient and Strong Baseline in SSVP
	- High performance, but NO extra data required

	This is a unofficial `Keras` reimplementation of [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) model. The official `PyTorch` implementation can be found [here](https://github.com/MCG-NJU/VideoMAE).

	# Model Zoo

	The pre-trained and fine-tuned models are listed in [MODEL_ZOO.md](MODEL_ZOO.md). Following are some hightlights.

	### Kinetics-400

	For Kinetrics-400, VideoMAE is trained around 1600 epoch without any extra data. The following checkpoints are available in both tensorflow `SavedModel` and `h5` format.


	\| Backbone \| \#Frame \| Top-1 \| Top-5 \| Params [FT] MB \| Params [PT] MB) \| FLOPs \|
	\| :--: \| :--: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	ViT-S \| 16x5x3 \| 79.0 \| 93.8 \| 22 \| 24 \| 57G \|
	ViT-B \| 16x5x3 \| 81.5 \| 95.1 \| 87 \| 94 \| 181G \|
	ViT-L \| 16x5x3 \| 85.2 \| 96.8 \| 304 \| 343 \| - \|
	ViT-H \| 16x5x3 \| 86.6 \| 97.1 \| 632 \| ? \| - \|

	<sup>?* Official `ViT-H` backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89.</sup>
	<sup>The FLOPs of encoder models (FT) are reported only.</sup>


	### Something-Something V2

	For SSv2, VideoMAE is trained around 2400 epoch without any extra data.

	\| Backbone \| \#Frame \| Top-1 \| Top-5 \| Params [FT] MB \| Params [PT] MB \| FLOPs \|
	\| :------: \| :-----: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| ViT-S \| 16x2x3 \| 66.8 \| 90.3 \| 22 \| 24 \| 57G \|
	\| ViT-B \| 16x2x3 \| 70.8 \| 92.4 \| 86 \| 94 \| 181G \|


	### UCF101

	For UCF101, VideoMAE is trained around 3200 epoch without any extra data.

	\| Backbone \| \#Frame \| Top-1 \| Top-5 \| Params [FT] MB \| Params [PT] MB \| FLOPS \|
	\| :---: \| :-----: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| ViT-B \| 16x5x3 \| 91.3 \| 98.5 \| 86 \| 94 \| 181G \|

	---
	library_name: tf-keras
	license: mit
	metrics:
	- accuracy
	pipeline_tag: video-classification
	tags:
	- pretraining
	- finetuning
	- vision
	- videomae
	---

	# VideoMAE

	![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/622dcfbee27c88667db09411/cIVuujQqtTv-jlcRl9Gcf.jpeg)


	\| Paper \| Colab \| HF Space \| HF Hub \|
	\| :--: \| :--: \| :---: \| :---: \|
	\| [![arXiv](https://img.shields.io/badge/arXiv-2203.12602-darkred)](https://arxiv.org/abs/2203.12602) \| [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1BFisOW2yzdvDEBN_0P3M41vQCwF6dTWR?usp=sharing) \| [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-yellow.svg)](https://huggingface.co/spaces/innat/VideoMAE) \| [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Hub-yellow.svg)](https://huggingface.co/innat/videomae) \|


	Video masked autoencoders (VideoMAE) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent [ImageMAE](https://arxiv.org/abs/2111.06377), and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of VideoMAE:

	- Masked Video Modeling for Video Pre-Training
	- A Simple, Efficient and Strong Baseline in SSVP
	- High performance, but NO extra data required

	This is a unofficial `Keras` reimplementation of [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) model. The official `PyTorch` implementation can be found [here](https://github.com/MCG-NJU/VideoMAE).

	# Model Zoo

	The pre-trained and fine-tuned models are listed in [MODEL_ZOO.md](MODEL_ZOO.md). Following are some hightlights.

	### Kinetics-400

	For Kinetrics-400, VideoMAE is trained around 1600 epoch without any extra data. The following checkpoints are available in both tensorflow `SavedModel` and `h5` format.


	\| Backbone \| \#Frame \| Top-1 \| Top-5 \| Params [FT] MB \| Params [PT] MB) \| FLOPs \|
	\| :--: \| :--: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	ViT-S \| 16x5x3 \| 79.0 \| 93.8 \| 22 \| 24 \| 57G \|
	ViT-B \| 16x5x3 \| 81.5 \| 95.1 \| 87 \| 94 \| 181G \|
	ViT-L \| 16x5x3 \| 85.2 \| 96.8 \| 304 \| 343 \| - \|
	ViT-H \| 16x5x3 \| 86.6 \| 97.1 \| 632 \| ? \| - \|

	<sup>?* Official `ViT-H` backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89.</sup>
	<sup>The FLOPs of encoder models (FT) are reported only.</sup>


	### Something-Something V2

	For SSv2, VideoMAE is trained around 2400 epoch without any extra data.

	\| Backbone \| \#Frame \| Top-1 \| Top-5 \| Params [FT] MB \| Params [PT] MB \| FLOPs \|
	\| :------: \| :-----: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| ViT-S \| 16x2x3 \| 66.8 \| 90.3 \| 22 \| 24 \| 57G \|
	\| ViT-B \| 16x2x3 \| 70.8 \| 92.4 \| 86 \| 94 \| 181G \|


	### UCF101

	For UCF101, VideoMAE is trained around 3200 epoch without any extra data.

	\| Backbone \| \#Frame \| Top-1 \| Top-5 \| Params [FT] MB \| Params [PT] MB \| FLOPS \|
	\| :---: \| :-----: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| ViT-B \| 16x5x3 \| 91.3 \| 98.5 \| 86 \| 94 \| 181G \|