| --- |
| library_name: pytorch |
| tags: |
| - chimera-ml |
| - lefsa |
| - pytorch |
| - audio |
| - video |
| - text |
| - multimodal |
| - emotion-recognition |
| - sentiment-analysis |
| - affective-computing |
| - emoaffectnet |
| - wav2vec2 |
| - whisper |
| - jina-embeddings |
| datasets: |
| - RAMAS |
| - MELD |
| - CMU-MOSEI |
| --- |
| |
| # LEFSA Models |
|
|
| This repository contains LEFSA model weights for multimodal affective state recognition. |
| LEFSA stands for **Label Encoder Fusion Strategy with Averaging** and is designed for joint **emotion recognition** and **sentiment recognition** from audio, video, and text modalities. |
|
|
| ## Files |
|
|
| - `lefsa.pt` β LEFSA checkpoint for joint audio-video-text emotion and sentiment recognition. |
| - `emoaffectnet.pt` β EmoAffectNet model for visual feature extraction. The original EmoAffectNet repository is available here: https://github.com/ElenaRyumina/EMO-AffectNetModel. |
|
|
| ## What the Model Predicts |
|
|
| The model has two classification heads. |
|
|
| | Task | Number of classes | Class order | |
| |---|---:|---| |
| | Emotion recognition | 7 | `neutral`, `happy`, `sad`, `anger`, `surprise`, `disgust`, `fear` | |
| | Sentiment recognition | 3 | `negative`, `neutral`, `positive` | |
|
|
| Use this exact class order when converting logits or probabilities to labels. |
|
|
| ## Model Overview |
| - Acoustic, visual, and linguistic features are downsampled to a common temporal representation. |
| - The model applies cross-modal transformer blocks to model interactions between modalities. |
| - A label encoder produces unimodal emotion and sentiment predictions and injects this label-level context back into the fusion module. |
| - In LEFSA, unimodal predictions are additionally averaged with multimodal predictions to improve robustness. |
|
|
| ## Research Corpora |
|
|
| The model family was evaluated in a multilingual and multicorpus setting. |
|
|
| | Corpus | Language / domain | Modalities | Tasks | |
| |---|---|---|---| |
| | RAMAS | Russian, dyadic semi-spontaneous interactions | Audio, video, text | Emotion, sentiment | |
| | MELD | English, scripted TV-series dialogues | Audio, video, text | Emotion, sentiment | |
| | CMU-MOSEI | English, in-the-wild YouTube monologues | Audio, video, text | Emotion, sentiment | |
|
|
| Emotion labels are mapped to seven classes: neutral, happiness, sadness, anger, surprise, disgust, and fear. |
| Sentiment labels are mapped to three classes: negative, neutral, and positive. |
|
|
| ### Feature and Fusion Strategy Comparison |
|
|
| `MF` is macro F1-score. For CMU-MOSEI emotion recognition, `mMF` is the mean macro F1-score over binary positive/negative emotion classes. The average value is the integral score reported in the paper. |
|
|
| | ID | Audio | Video | Text | Fusion | RAMAS Emotion MF7 | RAMAS Sentiment MF3 | MELD Emotion MF7 | MELD Sentiment MF3 | CMU-MOSEI Emotion mMF6 | CMU-MOSEI Sentiment MF3 | Average | |
| |---:|---|---|---|---|---:|---:|---:|---:|---:|---:|---:| |
| | 1 | Wav2Vec2 | EmoAffectNet | JINA | BFS | 60.57 | 65.02 | 38.56 | 65.94 | 62.46 | 62.56 | 59.18 | |
| | 2 | ExHuBERT | EmoAffectNet | JINA | BFS | 62.35 | 64.02 | 38.40 | 63.93 | 62.10 | 62.32 | 58.85 | |
| | 3 | Wav2Vec2 | EmoAffectNet | RoBERTa | BFS | 57.64 | 67.14 | 35.48 | 65.03 | 61.58 | 60.19 | 57.84 | |
| | 4 | ExHuBERT | EmoAffectNet | RoBERTa | BFS | 60.54 | 64.70 | 35.05 | 63.78 | 60.78 | 59.25 | 57.35 | |
| | 5 | Wav2Vec2 | ResEmoteNet | JINA | BFS | 55.88 | 59.93 | 37.75 | 63.81 | 62.22 | 63.69 | 57.21 | |
| | 6 | ExHuBERT | ResEmoteNet | JINA | BFS | 57.21 | 61.40 | 38.25 | 64.06 | 61.11 | 60.81 | 57.14 | |
| | 7 | ExHuBERT | ResEmoteNet | RoBERTa | BFS | 49.62 | 54.44 | 32.56 | 62.15 | 59.88 | 60.47 | 53.19 | |
| | 8 | Wav2Vec2 | ResEmoteNet | RoBERTa | BFS | 52.29 | 54.87 | 34.08 | 61.65 | 58.79 | 57.42 | 53.18 | |
| | 9 | Wav2Vec2 | EmoAffectNet | JINA | LEFS | 61.38 | 66.57 | 39.79 | 65.70 | 62.79 | 61.59 | 59.64 | |
| | 10 | Wav2Vec2 | EmoAffectNet | JINA | LEFSA | **62.52** | 64.96 | **40.09** | **67.02** | 62.30 | 62.00 | **59.81** | |
|
|
| The best LEFSA configuration uses **Wav2Vec2 + EmoAffectNet + JINA** features with Label Encoder Fusion Strategy with Averaging. |
|
|
| ### Comparison with Prior Multimodal Approaches |
|
|
| `ST` means single-task recognition and `MT` means multitask recognition. |
|
|
| | Approach | Corpus | Setup | Emotion A7 | Emotion WF7 | Sentiment A3 | Sentiment WF3 | |
| |---|---|---|---:|---:|---:|---:| |
| | Ours | RAMAS | MT | 68.99 | 67.79 | 84.11 | 84.02 | |
| | Zhang et al. | MELD | MT | 41.17 | 41.22 | 67.33 | 67.21 | |
| | Van et al. | MELD | ST | 66.28 | 65.69 | β | β | |
| | Hwang et al. | MELD | ST | 66.70 | 65.93 | β | β | |
| | Tu et al. | MELD | ST | 67.85 | 67.02 | β | β | |
| | Ours | MELD | MT | 62.30 | 59.79 | 69.20 | 69.02 | |
|
|
| | Approach | Corpus | Setup | Emotion mWA6 | Emotion mWF6 | Sentiment A2 | Sentiment WF2 | |
| |---|---|---|---:|---:|---:|---:| |
| | Chauhan et al. | CMU-MOSEI | MT | 62.97 | 79.02 | 80.37 | 78.23 | |
| | Sangwan et al. | CMU-MOSEI | MT | 63.16 | 79.06 | 80.15 | 78.30 | |
| | Hwang et al. | CMU-MOSEI | ST | β | β | 87.40 | 87.30 | |
| | Zheng et al. | CMU-MOSEI | ST | β | β | 85.90 | 86.00 | |
| | Ours | CMU-MOSEI | MT | 64.78 | 79.06 | 84.83 | 84.90 | |
|
|
| ## Related Publications |
|
|
| Markitantov M., Ryumina E., Kaya H., Karpov A. Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion // In Proc. Interspeech 2025, pp. 3010β3014. https://doi.org/10.21437/Interspeech.2025-2060 |
|
|
| ```bibtex |
| @inproceedings{markitantov25_interspeech, |
| title = {{Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion}}, |
| author = {Maxim Markitantov and Elena Ryumina and Heysem Kaya and Alexey Karpov}, |
| year = {2025}, |
| booktitle = {{Interspeech 2025}}, |
| pages = {3010--3014}, |
| doi = {10.21437/Interspeech.2025-2060}, |
| issn = {2958-1796} |
| } |
| ``` |
|
|
| Markitantov M., Ryumina E., Dvoynikova A., Karpov A. Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion // Information Fusion, 2026, vol. 132, article 104207. https://doi.org/10.1016/j.inffus.2026.104207 |
|
|
| ```bibtex |
| @article{markitantov2026triplefusion, |
| title = {Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion}, |
| author = {Markitantov, Maxim and Ryumina, Elena and Dvoynikova, Anastasia and Karpov, Alexey}, |
| journal = {Information Fusion}, |
| volume = {132}, |
| pages = {104207}, |
| year = {2026}, |
| doi = {10.1016/j.inffus.2026.104207}, |
| url = {https://doi.org/10.1016/j.inffus.2026.104207} |
| } |
| ``` |