Commit ·
57b4a79
1
Parent(s): cacd1f7
Initial commit
Browse files- README.md +137 -0
- emoaffectnet.pt +3 -0
- lefsa.pt +3 -0
README.md
ADDED
|
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: pytorch
|
| 3 |
+
tags:
|
| 4 |
+
- chimera-ml
|
| 5 |
+
- lefsa
|
| 6 |
+
- pytorch
|
| 7 |
+
- audio
|
| 8 |
+
- video
|
| 9 |
+
- text
|
| 10 |
+
- multimodal
|
| 11 |
+
- emotion-recognition
|
| 12 |
+
- sentiment-analysis
|
| 13 |
+
- affective-computing
|
| 14 |
+
- emoaffectnet
|
| 15 |
+
- wav2vec2
|
| 16 |
+
- whisper
|
| 17 |
+
- jina-embeddings
|
| 18 |
+
datasets:
|
| 19 |
+
- RAMAS
|
| 20 |
+
- MELD
|
| 21 |
+
- CMU-MOSEI
|
| 22 |
+
base_model:
|
| 23 |
+
- openai/whisper-base
|
| 24 |
+
- FacebookAI/xlm-roberta-base
|
| 25 |
+
- jinaai/jina-embeddings-v3
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
# LEFSA Models
|
| 29 |
+
|
| 30 |
+
This repository contains LEFSA model weights for multimodal affective state recognition.
|
| 31 |
+
LEFSA stands for **Label Encoder Fusion Strategy with Averaging** and is designed for joint **emotion recognition** and **sentiment recognition** from audio, video, and text modalities.
|
| 32 |
+
|
| 33 |
+
## Files
|
| 34 |
+
|
| 35 |
+
- `lefsa.pt` — LEFSA checkpoint for joint audio-video-text emotion and sentiment recognition.
|
| 36 |
+
- `emoaffectnet.pt` — EmoAffectNet model for visual feature extraction. The original EmoAffectNet repository is available here: https://github.com/ElenaRyumina/EMO-AffectNetModel.
|
| 37 |
+
|
| 38 |
+
## What the Model Predicts
|
| 39 |
+
|
| 40 |
+
The model has two classification heads.
|
| 41 |
+
|
| 42 |
+
| Task | Number of classes | Class order |
|
| 43 |
+
|---|---:|---|
|
| 44 |
+
| Emotion recognition | 7 | `neutral`, `happy`, `sad`, `anger`, `surprise`, `disgust`, `fear` |
|
| 45 |
+
| Sentiment recognition | 3 | `negative`, `neutral`, `positive` |
|
| 46 |
+
|
| 47 |
+
Use this exact class order when converting logits or probabilities to labels.
|
| 48 |
+
|
| 49 |
+
## Model Overview
|
| 50 |
+
- Acoustic, visual, and linguistic features are downsampled to a common temporal representation.
|
| 51 |
+
- The model applies cross-modal transformer blocks to model interactions between modalities.
|
| 52 |
+
- A label encoder produces unimodal emotion and sentiment predictions and injects this label-level context back into the fusion module.
|
| 53 |
+
- In LEFSA, unimodal predictions are additionally averaged with multimodal predictions to improve robustness.
|
| 54 |
+
|
| 55 |
+
## Research Corpora
|
| 56 |
+
|
| 57 |
+
The model family was evaluated in a multilingual and multicorpus setting.
|
| 58 |
+
|
| 59 |
+
| Corpus | Language / domain | Modalities | Tasks |
|
| 60 |
+
|---|---|---|---|
|
| 61 |
+
| RAMAS | Russian, dyadic semi-spontaneous interactions | Audio, video, text | Emotion, sentiment |
|
| 62 |
+
| MELD | English, scripted TV-series dialogues | Audio, video, text | Emotion, sentiment |
|
| 63 |
+
| CMU-MOSEI | English, in-the-wild YouTube monologues | Audio, video, text | Emotion, sentiment |
|
| 64 |
+
|
| 65 |
+
Emotion labels are mapped to seven classes: neutral, happiness, sadness, anger, surprise, disgust, and fear.
|
| 66 |
+
Sentiment labels are mapped to three classes: negative, neutral, and positive.
|
| 67 |
+
|
| 68 |
+
### Feature and Fusion Strategy Comparison
|
| 69 |
+
|
| 70 |
+
`MF` is macro F1-score. For CMU-MOSEI emotion recognition, `mMF` is the mean macro F1-score over binary positive/negative emotion classes. The average value is the integral score reported in the paper.
|
| 71 |
+
|
| 72 |
+
| ID | Audio | Video | Text | Fusion | RAMAS Emotion MF7 | RAMAS Sentiment MF3 | MELD Emotion MF7 | MELD Sentiment MF3 | CMU-MOSEI Emotion mMF6 | CMU-MOSEI Sentiment MF3 | Average |
|
| 73 |
+
|---:|---|---|---|---|---:|---:|---:|---:|---:|---:|---:|
|
| 74 |
+
| 1 | Wav2Vec2 | EmoAffectNet | JINA | BFS | 60.57 | 65.02 | 38.56 | 65.94 | 62.46 | 62.56 | 59.18 |
|
| 75 |
+
| 2 | ExHuBERT | EmoAffectNet | JINA | BFS | 62.35 | 64.02 | 38.40 | 63.93 | 62.10 | 62.32 | 58.85 |
|
| 76 |
+
| 3 | Wav2Vec2 | EmoAffectNet | RoBERTa | BFS | 57.64 | 67.14 | 35.48 | 65.03 | 61.58 | 60.19 | 57.84 |
|
| 77 |
+
| 4 | ExHuBERT | EmoAffectNet | RoBERTa | BFS | 60.54 | 64.70 | 35.05 | 63.78 | 60.78 | 59.25 | 57.35 |
|
| 78 |
+
| 5 | Wav2Vec2 | ResEmoteNet | JINA | BFS | 55.88 | 59.93 | 37.75 | 63.81 | 62.22 | 63.69 | 57.21 |
|
| 79 |
+
| 6 | ExHuBERT | ResEmoteNet | JINA | BFS | 57.21 | 61.40 | 38.25 | 64.06 | 61.11 | 60.81 | 57.14 |
|
| 80 |
+
| 7 | ExHuBERT | ResEmoteNet | RoBERTa | BFS | 49.62 | 54.44 | 32.56 | 62.15 | 59.88 | 60.47 | 53.19 |
|
| 81 |
+
| 8 | Wav2Vec2 | ResEmoteNet | RoBERTa | BFS | 52.29 | 54.87 | 34.08 | 61.65 | 58.79 | 57.42 | 53.18 |
|
| 82 |
+
| 9 | Wav2Vec2 | EmoAffectNet | JINA | LEFS | 61.38 | 66.57 | 39.79 | 65.70 | 62.79 | 61.59 | 59.64 |
|
| 83 |
+
| 10 | Wav2Vec2 | EmoAffectNet | JINA | LEFSA | **62.52** | 64.96 | **40.09** | **67.02** | 62.30 | 62.00 | **59.81** |
|
| 84 |
+
|
| 85 |
+
The best LEFSA configuration uses **Wav2Vec2 + EmoAffectNet + JINA** features with Label Encoder Fusion Strategy with Averaging.
|
| 86 |
+
|
| 87 |
+
### Comparison with Prior Multimodal Approaches
|
| 88 |
+
|
| 89 |
+
`ST` means single-task recognition and `MT` means multitask recognition.
|
| 90 |
+
|
| 91 |
+
| Approach | Corpus | Setup | Emotion A7 | Emotion WF7 | Sentiment A3 | Sentiment WF3 |
|
| 92 |
+
|---|---|---|---:|---:|---:|---:|
|
| 93 |
+
| Ours | RAMAS | MT | 68.99 | 67.79 | 84.11 | 84.02 |
|
| 94 |
+
| Zhang et al. | MELD | MT | 41.17 | 41.22 | 67.33 | 67.21 |
|
| 95 |
+
| Van et al. | MELD | ST | 66.28 | 65.69 | — | — |
|
| 96 |
+
| Hwang et al. | MELD | ST | 66.70 | 65.93 | — | — |
|
| 97 |
+
| Tu et al. | MELD | ST | 67.85 | 67.02 | — | — |
|
| 98 |
+
| Ours | MELD | MT | 62.30 | 59.79 | 69.20 | 69.02 |
|
| 99 |
+
|
| 100 |
+
| Approach | Corpus | Setup | Emotion mWA6 | Emotion mWF6 | Sentiment A2 | Sentiment WF2 |
|
| 101 |
+
|---|---|---|---:|---:|---:|---:|
|
| 102 |
+
| Chauhan et al. | CMU-MOSEI | MT | 62.97 | 79.02 | 80.37 | 78.23 |
|
| 103 |
+
| Sangwan et al. | CMU-MOSEI | MT | 63.16 | 79.06 | 80.15 | 78.30 |
|
| 104 |
+
| Hwang et al. | CMU-MOSEI | ST | — | — | 87.40 | 87.30 |
|
| 105 |
+
| Zheng et al. | CMU-MOSEI | ST | �� | — | 85.90 | 86.00 |
|
| 106 |
+
| Ours | CMU-MOSEI | MT | 64.78 | 79.06 | 84.83 | 84.90 |
|
| 107 |
+
|
| 108 |
+
## Related Publications
|
| 109 |
+
|
| 110 |
+
Markitantov M., Ryumina E., Kaya H., Karpov A. Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion // In Proc. Interspeech 2025, pp. 3010–3014. https://doi.org/10.21437/Interspeech.2025-2060
|
| 111 |
+
|
| 112 |
+
```bibtex
|
| 113 |
+
@inproceedings{markitantov25_interspeech,
|
| 114 |
+
title = {{Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion}},
|
| 115 |
+
author = {Maxim Markitantov and Elena Ryumina and Heysem Kaya and Alexey Karpov},
|
| 116 |
+
year = {2025},
|
| 117 |
+
booktitle = {{Interspeech 2025}},
|
| 118 |
+
pages = {3010--3014},
|
| 119 |
+
doi = {10.21437/Interspeech.2025-2060},
|
| 120 |
+
issn = {2958-1796}
|
| 121 |
+
}
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
Markitantov M., Ryumina E., Dvoynikova A., Karpov A. Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion // Information Fusion, 2026, vol. 132, article 104207. https://doi.org/10.1016/j.inffus.2026.104207
|
| 125 |
+
|
| 126 |
+
```bibtex
|
| 127 |
+
@article{markitantov2026triplefusion,
|
| 128 |
+
title = {Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion},
|
| 129 |
+
author = {Markitantov, Maxim and Ryumina, Elena and Dvoynikova, Anastasia and Karpov, Alexey},
|
| 130 |
+
journal = {Information Fusion},
|
| 131 |
+
volume = {132},
|
| 132 |
+
pages = {104207},
|
| 133 |
+
year = {2026},
|
| 134 |
+
doi = {10.1016/j.inffus.2026.104207},
|
| 135 |
+
url = {https://doi.org/10.1016/j.inffus.2026.104207}
|
| 136 |
+
}
|
| 137 |
+
```
|
emoaffectnet.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8274190b5be4355bd2f07b59f593fcdb294f9d7c563bfa9ac9e5ea06c10692d2
|
| 3 |
+
size 98562934
|
lefsa.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:247acfa8fb55c528917d6535428c64d8769e2e1ba396f8499112f94535271f2b
|
| 3 |
+
size 28791882
|