markitantov commited on
Commit
57b4a79
·
1 Parent(s): cacd1f7

Initial commit

Browse files
Files changed (3) hide show
  1. README.md +137 -0
  2. emoaffectnet.pt +3 -0
  3. lefsa.pt +3 -0
README.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: pytorch
3
+ tags:
4
+ - chimera-ml
5
+ - lefsa
6
+ - pytorch
7
+ - audio
8
+ - video
9
+ - text
10
+ - multimodal
11
+ - emotion-recognition
12
+ - sentiment-analysis
13
+ - affective-computing
14
+ - emoaffectnet
15
+ - wav2vec2
16
+ - whisper
17
+ - jina-embeddings
18
+ datasets:
19
+ - RAMAS
20
+ - MELD
21
+ - CMU-MOSEI
22
+ base_model:
23
+ - openai/whisper-base
24
+ - FacebookAI/xlm-roberta-base
25
+ - jinaai/jina-embeddings-v3
26
+ ---
27
+
28
+ # LEFSA Models
29
+
30
+ This repository contains LEFSA model weights for multimodal affective state recognition.
31
+ LEFSA stands for **Label Encoder Fusion Strategy with Averaging** and is designed for joint **emotion recognition** and **sentiment recognition** from audio, video, and text modalities.
32
+
33
+ ## Files
34
+
35
+ - `lefsa.pt` — LEFSA checkpoint for joint audio-video-text emotion and sentiment recognition.
36
+ - `emoaffectnet.pt` — EmoAffectNet model for visual feature extraction. The original EmoAffectNet repository is available here: https://github.com/ElenaRyumina/EMO-AffectNetModel.
37
+
38
+ ## What the Model Predicts
39
+
40
+ The model has two classification heads.
41
+
42
+ | Task | Number of classes | Class order |
43
+ |---|---:|---|
44
+ | Emotion recognition | 7 | `neutral`, `happy`, `sad`, `anger`, `surprise`, `disgust`, `fear` |
45
+ | Sentiment recognition | 3 | `negative`, `neutral`, `positive` |
46
+
47
+ Use this exact class order when converting logits or probabilities to labels.
48
+
49
+ ## Model Overview
50
+ - Acoustic, visual, and linguistic features are downsampled to a common temporal representation.
51
+ - The model applies cross-modal transformer blocks to model interactions between modalities.
52
+ - A label encoder produces unimodal emotion and sentiment predictions and injects this label-level context back into the fusion module.
53
+ - In LEFSA, unimodal predictions are additionally averaged with multimodal predictions to improve robustness.
54
+
55
+ ## Research Corpora
56
+
57
+ The model family was evaluated in a multilingual and multicorpus setting.
58
+
59
+ | Corpus | Language / domain | Modalities | Tasks |
60
+ |---|---|---|---|
61
+ | RAMAS | Russian, dyadic semi-spontaneous interactions | Audio, video, text | Emotion, sentiment |
62
+ | MELD | English, scripted TV-series dialogues | Audio, video, text | Emotion, sentiment |
63
+ | CMU-MOSEI | English, in-the-wild YouTube monologues | Audio, video, text | Emotion, sentiment |
64
+
65
+ Emotion labels are mapped to seven classes: neutral, happiness, sadness, anger, surprise, disgust, and fear.
66
+ Sentiment labels are mapped to three classes: negative, neutral, and positive.
67
+
68
+ ### Feature and Fusion Strategy Comparison
69
+
70
+ `MF` is macro F1-score. For CMU-MOSEI emotion recognition, `mMF` is the mean macro F1-score over binary positive/negative emotion classes. The average value is the integral score reported in the paper.
71
+
72
+ | ID | Audio | Video | Text | Fusion | RAMAS Emotion MF7 | RAMAS Sentiment MF3 | MELD Emotion MF7 | MELD Sentiment MF3 | CMU-MOSEI Emotion mMF6 | CMU-MOSEI Sentiment MF3 | Average |
73
+ |---:|---|---|---|---|---:|---:|---:|---:|---:|---:|---:|
74
+ | 1 | Wav2Vec2 | EmoAffectNet | JINA | BFS | 60.57 | 65.02 | 38.56 | 65.94 | 62.46 | 62.56 | 59.18 |
75
+ | 2 | ExHuBERT | EmoAffectNet | JINA | BFS | 62.35 | 64.02 | 38.40 | 63.93 | 62.10 | 62.32 | 58.85 |
76
+ | 3 | Wav2Vec2 | EmoAffectNet | RoBERTa | BFS | 57.64 | 67.14 | 35.48 | 65.03 | 61.58 | 60.19 | 57.84 |
77
+ | 4 | ExHuBERT | EmoAffectNet | RoBERTa | BFS | 60.54 | 64.70 | 35.05 | 63.78 | 60.78 | 59.25 | 57.35 |
78
+ | 5 | Wav2Vec2 | ResEmoteNet | JINA | BFS | 55.88 | 59.93 | 37.75 | 63.81 | 62.22 | 63.69 | 57.21 |
79
+ | 6 | ExHuBERT | ResEmoteNet | JINA | BFS | 57.21 | 61.40 | 38.25 | 64.06 | 61.11 | 60.81 | 57.14 |
80
+ | 7 | ExHuBERT | ResEmoteNet | RoBERTa | BFS | 49.62 | 54.44 | 32.56 | 62.15 | 59.88 | 60.47 | 53.19 |
81
+ | 8 | Wav2Vec2 | ResEmoteNet | RoBERTa | BFS | 52.29 | 54.87 | 34.08 | 61.65 | 58.79 | 57.42 | 53.18 |
82
+ | 9 | Wav2Vec2 | EmoAffectNet | JINA | LEFS | 61.38 | 66.57 | 39.79 | 65.70 | 62.79 | 61.59 | 59.64 |
83
+ | 10 | Wav2Vec2 | EmoAffectNet | JINA | LEFSA | **62.52** | 64.96 | **40.09** | **67.02** | 62.30 | 62.00 | **59.81** |
84
+
85
+ The best LEFSA configuration uses **Wav2Vec2 + EmoAffectNet + JINA** features with Label Encoder Fusion Strategy with Averaging.
86
+
87
+ ### Comparison with Prior Multimodal Approaches
88
+
89
+ `ST` means single-task recognition and `MT` means multitask recognition.
90
+
91
+ | Approach | Corpus | Setup | Emotion A7 | Emotion WF7 | Sentiment A3 | Sentiment WF3 |
92
+ |---|---|---|---:|---:|---:|---:|
93
+ | Ours | RAMAS | MT | 68.99 | 67.79 | 84.11 | 84.02 |
94
+ | Zhang et al. | MELD | MT | 41.17 | 41.22 | 67.33 | 67.21 |
95
+ | Van et al. | MELD | ST | 66.28 | 65.69 | — | — |
96
+ | Hwang et al. | MELD | ST | 66.70 | 65.93 | — | — |
97
+ | Tu et al. | MELD | ST | 67.85 | 67.02 | — | — |
98
+ | Ours | MELD | MT | 62.30 | 59.79 | 69.20 | 69.02 |
99
+
100
+ | Approach | Corpus | Setup | Emotion mWA6 | Emotion mWF6 | Sentiment A2 | Sentiment WF2 |
101
+ |---|---|---|---:|---:|---:|---:|
102
+ | Chauhan et al. | CMU-MOSEI | MT | 62.97 | 79.02 | 80.37 | 78.23 |
103
+ | Sangwan et al. | CMU-MOSEI | MT | 63.16 | 79.06 | 80.15 | 78.30 |
104
+ | Hwang et al. | CMU-MOSEI | ST | — | — | 87.40 | 87.30 |
105
+ | Zheng et al. | CMU-MOSEI | ST | �� | — | 85.90 | 86.00 |
106
+ | Ours | CMU-MOSEI | MT | 64.78 | 79.06 | 84.83 | 84.90 |
107
+
108
+ ## Related Publications
109
+
110
+ Markitantov M., Ryumina E., Kaya H., Karpov A. Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion // In Proc. Interspeech 2025, pp. 3010–3014. https://doi.org/10.21437/Interspeech.2025-2060
111
+
112
+ ```bibtex
113
+ @inproceedings{markitantov25_interspeech,
114
+ title = {{Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion}},
115
+ author = {Maxim Markitantov and Elena Ryumina and Heysem Kaya and Alexey Karpov},
116
+ year = {2025},
117
+ booktitle = {{Interspeech 2025}},
118
+ pages = {3010--3014},
119
+ doi = {10.21437/Interspeech.2025-2060},
120
+ issn = {2958-1796}
121
+ }
122
+ ```
123
+
124
+ Markitantov M., Ryumina E., Dvoynikova A., Karpov A. Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion // Information Fusion, 2026, vol. 132, article 104207. https://doi.org/10.1016/j.inffus.2026.104207
125
+
126
+ ```bibtex
127
+ @article{markitantov2026triplefusion,
128
+ title = {Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion},
129
+ author = {Markitantov, Maxim and Ryumina, Elena and Dvoynikova, Anastasia and Karpov, Alexey},
130
+ journal = {Information Fusion},
131
+ volume = {132},
132
+ pages = {104207},
133
+ year = {2026},
134
+ doi = {10.1016/j.inffus.2026.104207},
135
+ url = {https://doi.org/10.1016/j.inffus.2026.104207}
136
+ }
137
+ ```
emoaffectnet.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8274190b5be4355bd2f07b59f593fcdb294f9d7c563bfa9ac9e5ea06c10692d2
3
+ size 98562934
lefsa.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:247acfa8fb55c528917d6535428c64d8769e2e1ba396f8499112f94535271f2b
3
+ size 28791882