Update README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,48 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: video-classification
|
| 4 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: video-classification
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
license: mit
|
| 8 |
+
pipeline_tag: video-classification
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# VideoMAEv2_TikTok
|
| 12 |
+
|
| 13 |
+
We provide pre-trained weights on the **TikTokActions** dataset for two backbones: **ViT-B** (Vision Transformer-Base) and **ViT-Giant**. Additionally, we include fine-tuned weights on **Kinetics-400** for both backbones.
|
| 14 |
+
|
| 15 |
+
## Pre-trained and Fine-tuned Weights
|
| 16 |
+
- **Pre-trained weights on TikTokActions**: These weights were trained using TikTok video clips categorized into multiple actions. The dataset consists of 283,582 unique videos across 386 hashtags.
|
| 17 |
+
- **Fine-tuned weights on Kinetics-400**: After pre-training, the models were fine-tuned on Kinetics-400, achieving state-of-the-art results.
|
| 18 |
+
|
| 19 |
+
We also provide the `log.txt` file, which includes information on the fine-tuning process.
|
| 20 |
+
|
| 21 |
+
To use the weights and fine-tuning scripts, please refer to [VideoMAEv2's GitHub repository](https://github.com/OpenGVLab/VideoMAEv2) for implementation details.
|
| 22 |
+
|
| 23 |
+
## Citation
|
| 24 |
+
|
| 25 |
+
For **VideoMAEv2**, please cite the following works:
|
| 26 |
+
|
| 27 |
+
@InProceedings{wang2023videomaev2, author = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu}, title = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023}, pages = {14549-14560} }
|
| 28 |
+
|
| 29 |
+
@misc{videomaev2, title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking}, author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao}, year={2023}, eprint={2303.16727}, archivePrefix={arXiv}, primaryClass={cs.CV} }
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
For **our repository**, please cite the following paper:
|
| 33 |
+
|
| 34 |
+
@article{qian2024actionrecognition, author = {Yang Qian, Yinan Sun, Ali Kargarandehkordi, Parnian Azizian, Onur Cezmi Mutlu, Saimourya Surabhi, Pingyi Chen, Zain Jabbar, Dennis Paul Wall, Peter Washington}, title = {Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos}, journal = {arXiv preprint arXiv:2402.08875}, year = {2024}, pages = {10}, doi = {https://doi.org/10.48550/arXiv.2402.08875} }
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
## Results
|
| 38 |
+
|
| 39 |
+
Our model achieves the following results on established action recognition benchmarks using the **ViT-Giant** backbone:
|
| 40 |
+
- **UCF101**: 99.05%
|
| 41 |
+
- **HMDB51**: 86.08%
|
| 42 |
+
- **Kinetics-400**: 85.51%
|
| 43 |
+
- **Something-Something V2**: 74.27%
|
| 44 |
+
|
| 45 |
+
These results highlight the power of using diverse, unlabeled, and dynamic video content for training foundation models, especially in the domain of action recognition.
|
| 46 |
+
|
| 47 |
+
## License
|
| 48 |
+
This project is licensed under the MIT License.
|