Conna commited on
Commit
0c39f6f
Β·
verified Β·
1 Parent(s): 218862a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -3
README.md CHANGED
@@ -1,3 +1,105 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - OpenGVLab/VideoMAEv2-Base
5
+ ---
6
+
7
+ <h1 align="center">AVF-MAE++ : Scaling Affective Video Facial Masked Autoencoders via Efficient Audio-Visual Self-Supervised Learning</h1>
8
+
9
+ > [Xuecheng Wu](https://scholar.google.com.hk/citations?user=MuTEp7sAAAAJ), [Heli Sun](https://scholar.google.com.hk/citations?user=HXjwuE4AAAAJ), Yifan Wang, Jiayu Nie, [Jie Zhang](https://scholar.google.com.hk/citations?user=7YkR3CoAAAAJ), [Yabing Wang](https://scholar.google.com.hk/citations?user=3WVFdMUAAAAJ), [Junxiao Xue](https://scholar.google.com.hk/citations?user=Za9YFVIAAAAJ), Liang He<br>
10
+ > Xi'an Jiaotong University & University of Science and Technology of China & A*STAR & Zhejiang Lab<br>
11
+
12
+
13
+
14
+ ## 🌟 Overview
15
+ ![AVF-MAE++](figs/AVF-MAE++_v6_0315.png)
16
+
17
+ **Abstract: Affective Video Facial Analysis (AVFA) is important for advancing emotion-aware AI, yet the persistent data scarcity in AVFA presents challenges. Recently, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained significant attention, particularly in its audio-visual adaptation. Insights from general domains suggest that scaling is vital for unlocking impressive improvements, though its effects on AVFA remain largely unexplored. Additionally, capturing both intra- and inter-modal correlations through scalable representations is a crucial challenge in this field. To tackle these gaps, we introduce AVF-MAE++, a series audio-visual MAE designed to explore the impact of scaling on AVFA with a focus on advanced correlation modeling. Our method incorporates a novel audio-visual dual masking strategy and an improved modality encoder with a holistic view to better support scalable pre-training. Furthermore, we propose the Iteratively Audio-Visual Correlations Learning Module to improve correlations capture within the SSL framework, bridging the limitations of prior methods. To support smooth adaptation and mitigate overfitting, we also introduce a progressive semantics injection strategy, which structures training in three stages. Extensive experiments across 17 datasets, spanning three key AVFA tasks, demonstrate the superior performance of AVF-MAE++, establishing new state-of-the-art outcomes. Ablation studies provide further insights into the critical design choices driving these gains.**
18
+
19
+
20
+ ## πŸ›« Main Results
21
+
22
+ <p align="center">
23
+ <img src="figs/radar_1030.png" width=45%> <br>
24
+ Performance comparisons of AVF-MAE++ and state-of-the-art AVFA methods on 17 datasets across CEA, DEA, and MER tasks.
25
+ </p>
26
+
27
+
28
+ <p align="center">
29
+ <img src="figs/CEA-DEA.jpg" width=65%> <br>
30
+ Performance comparisons of AVF-MAE++ with state-of-the-art CEA and DEA methods on twelve datasets.
31
+ </p>
32
+
33
+
34
+ <p align="center">
35
+ <img src="figs/MER.jpg" width=35%> <br>
36
+ Performance comparisons of AVF-MAE++ and state-ofthe-art MER methods in terms of UF1 (%) on five datasets
37
+ </p>
38
+
39
+
40
+ ## 🌞 Visualizations
41
+
42
+ ### 🌟 Audio-visual reconstructions
43
+
44
+ ![Audio-visual_reconstructions](figs/overall_reconstruction-0317.png)
45
+
46
+
47
+ ### 🌟 Confusion matrix on MAFW (11-class) dataset
48
+
49
+
50
+ ![Confusion_matrix_on_MAFW](figs/MAFW-Fold5-0315.png)
51
+
52
+
53
+
54
+ ## πŸ‘ Acknowledgements
55
+
56
+ This project is built upon [HiCMAE](https://github.com/sunlicai/HiCMAE), [MAE-DFER](https://github.com/sunlicai/MAE-DFER), [VideoMAE](https://github.com/MCG-NJU/VideoMAE), and [AudioMAE](https://github.com/facebookresearch/AudioMAE). Thanks for their insightful and great codebase.
57
+
58
+
59
+ ## ✏️ Citation
60
+ **If you find this paper useful in your research, please consider citing:**
61
+
62
+ ```
63
+ @InProceedings{Wu_2025_CVPR,
64
+ author = {Wu, Xuecheng and Sun, Heli and Wang, Yifan and Nie, Jiayu and Zhang, Jie and Wang, Yabing and Xue, Junxiao and He, Liang},
65
+ title = {AVF-MAE++: Scaling Affective Video Facial Masked Autoencoders via Efficient Audio-Visual Self-Supervised Learning},
66
+ booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
67
+ month = {June},
68
+ year = {2025},
69
+ pages = {9142-9153}
70
+ }
71
+ ```
72
+
73
+ **You can also consider citing the following related papers:**
74
+
75
+ ```
76
+ @article{sun2024hicmae,
77
+ title={Hicmae: Hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recognition},
78
+ author={Sun, Licai and Lian, Zheng and Liu, Bin and Tao, Jianhua},
79
+ journal={Information Fusion},
80
+ volume={108},
81
+ pages={102382},
82
+ year={2024},
83
+ publisher={Elsevier}
84
+ }
85
+ ```
86
+
87
+ ```
88
+ @inproceedings{sun2023mae,
89
+ title={Mae-dfer: Efficient masked autoencoder for self-supervised dynamic facial expression recognition},
90
+ author={Sun, Licai and Lian, Zheng and Liu, Bin and Tao, Jianhua},
91
+ booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
92
+ pages={6110--6121},
93
+ year={2023}
94
+ }
95
+ ```
96
+
97
+ ```
98
+ @article{sun2024svfap,
99
+ title={SVFAP: Self-supervised video facial affect perceiver},
100
+ author={Sun, Licai and Lian, Zheng and Wang, Kexin and He, Yu and Xu, Mingyu and Sun, Haiyang and Liu, Bin and Tao, Jianhua},
101
+ journal={IEEE Transactions on Affective Computing},
102
+ year={2024},
103
+ publisher={IEEE}
104
+ }
105
+ ```