File size: 5,066 Bytes
0c39f6f
 
 
 
 
 
 
 
b314286
 
0c39f6f
 
 
 
 
 
 
 
 
 
 
 
b314286
0c39f6f
 
 
 
 
b314286
0c39f6f
 
 
 
 
b314286
0c39f6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
license: apache-2.0
base_model:
- OpenGVLab/VideoMAEv2-Base
---

<h1 align="center">AVF-MAE++ : Scaling Affective Video Facial Masked Autoencoders via Efficient Audio-Visual Self-Supervised Learning</h1>

[Xuecheng Wu](https://scholar.google.com.hk/citations?user=MuTEp7sAAAAJ), [Heli Sun](https://scholar.google.com.hk/citations?user=HXjwuE4AAAAJ), Yifan Wang, Jiayu Nie, [Jie Zhang](https://scholar.google.com.hk/citations?user=7YkR3CoAAAAJ), [Yabing Wang](https://scholar.google.com.hk/citations?user=3WVFdMUAAAAJ), [Junxiao Xue](https://scholar.google.com.hk/citations?user=Za9YFVIAAAAJ), Liang He<br>
Xi'an Jiaotong University & University of Science and Technology of China & A*STAR & Zhejiang Lab<br>



## 🌟 Overview
![AVF-MAE++](figs/AVF-MAE++_v6_0315.png)

**Abstract: Affective Video Facial Analysis (AVFA) is important for advancing emotion-aware AI, yet the persistent data scarcity in AVFA presents challenges. Recently, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained significant attention, particularly in its audio-visual adaptation. Insights from general domains suggest that scaling is vital for unlocking impressive improvements, though its effects on AVFA remain largely unexplored. Additionally, capturing both intra- and inter-modal correlations through scalable representations is a crucial challenge in this field. To tackle these gaps, we introduce AVF-MAE++, a series audio-visual MAE designed to explore the impact of scaling on AVFA with a focus on advanced correlation modeling. Our method incorporates a novel audio-visual dual masking strategy and an improved modality encoder with a holistic view to better support scalable pre-training. Furthermore, we propose the Iteratively Audio-Visual Correlations Learning Module to improve correlations capture within the SSL framework, bridging the limitations of prior methods. To support smooth adaptation and mitigate overfitting, we also introduce a progressive semantics injection strategy, which structures training in three stages. Extensive experiments across 17 datasets, spanning three key AVFA tasks, demonstrate the superior performance of AVF-MAE++, establishing new state-of-the-art outcomes. Ablation studies provide further insights into the critical design choices driving these gains.**


## πŸ›« Main Results

<p align="center">
  <img src="figs/radar_1030.png" width=55%> <br>
   Performance comparisons of AVF-MAE++ and state-of-the-art AVFA methods on 17 datasets across CEA, DEA, and MER tasks.
</p>


<p align="center">
  <img src="figs/CEA-DEA.jpg" width=75%> <br>
   Performance comparisons of AVF-MAE++ with state-of-the-art CEA and DEA methods on twelve datasets.
</p>


<p align="center">
  <img src="figs/MER.jpg" width=55%> <br>
   Performance comparisons of AVF-MAE++ and state-ofthe-art MER methods in terms of UF1 (%) on five datasets
</p>


## 🌞 Visualizations

### 🌟 Audio-visual reconstructions

![Audio-visual_reconstructions](figs/overall_reconstruction-0317.png)


### 🌟 Confusion matrix on MAFW (11-class) dataset


![Confusion_matrix_on_MAFW](figs/MAFW-Fold5-0315.png)



## πŸ‘ Acknowledgements

This project is built upon [HiCMAE](https://github.com/sunlicai/HiCMAE), [MAE-DFER](https://github.com/sunlicai/MAE-DFER), [VideoMAE](https://github.com/MCG-NJU/VideoMAE), and [AudioMAE](https://github.com/facebookresearch/AudioMAE). Thanks for their insightful and great codebase.


## ✏️ Citation
**If you find this paper useful in your research, please consider citing:**

```
@InProceedings{Wu_2025_CVPR,
    author    = {Wu, Xuecheng and Sun, Heli and Wang, Yifan and Nie, Jiayu and Zhang, Jie and Wang, Yabing and Xue, Junxiao and He, Liang},
    title     = {AVF-MAE++: Scaling Affective Video Facial Masked Autoencoders via Efficient Audio-Visual Self-Supervised Learning},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {9142-9153}
}
```

**You can also consider citing the following related papers:**

```
@article{sun2024hicmae,
  title={Hicmae: Hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recognition},
  author={Sun, Licai and Lian, Zheng and Liu, Bin and Tao, Jianhua},
  journal={Information Fusion},
  volume={108},
  pages={102382},
  year={2024},
  publisher={Elsevier}
}
```

```
@inproceedings{sun2023mae,
  title={Mae-dfer: Efficient masked autoencoder for self-supervised dynamic facial expression recognition},
  author={Sun, Licai and Lian, Zheng and Liu, Bin and Tao, Jianhua},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={6110--6121},
  year={2023}
}
```

```
@article{sun2024svfap,
  title={SVFAP: Self-supervised video facial affect perceiver},
  author={Sun, Licai and Lian, Zheng and Wang, Kexin and He, Yu and Xu, Mingyu and Sun, Haiyang and Liu, Bin and Tao, Jianhua},
  journal={IEEE Transactions on Affective Computing},
  year={2024},
  publisher={IEEE}
}
```