YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning
This is the official implementation of the paper: "Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning".
Note: Due to the anonymous review process, the code, model weights and dataset links will be updated here upon acceptance.
π‘ Key Contributions
- SABER Dataset: A large-scale multimodal emotion reasoning dataset containing ~600K video clips, featuring a unique six-dimensional annotation schema.
- SED Paradigm: Structured Evidence Decomposition forces the model to disentangle and analyze uni-modal evidence (Visual, Acoustic, etc.) before synthesizing a final emotional conclusion.
- CA-DPO: Consistency-Aware Direct Preference Optimization refines the model's judgment in modality-conflicting scenarios (e.g., a "sarcastic smile" with a "hostile tone").
- SOTA Performance: Outperforms existing open-source baselines on EMER, EmoBench-M, and SABER-Test.
π Data Pipeline and Model Architecture
Our data construction pipeline integrates a unified fine-grained annotation strategy with automated quality control mechanisms across three stages.
Figure 1: (a) Overview of the SABER data pipeline, featuring Raw Data Cleaning, Fine-grained Multimodal Annotation, and Instruction Generation. (b) Training Paradigm: Stage 1 (SED) for sequential grounding and Stage 2 (CA-DPO) for preference alignment in conflicting scenarios.*
SABER-LLM utilizes a two-stage training paradigm to ensure robust evidence grounding.
Six-Dimensional Annotation Schema
- Video Description: Macro scene context.
- Facial Expression: Micro-expressions and gaze.
- Body Language: Posture, gestures, and social signals.
- Acoustic Features: Prosody, pitch, and tonal intensity.
- Speech Content: Verbatim transcripts and semantic info.
- Multimodal Emotion Analysis: Final holistic reasoning and causal logic.
π To-Do List
- Release SABER-Test benchmark (1,800 clips)
- Release SABER-LLM-7B model weights
- Release the full SABER training dataset
- Provide automated data annotation scripts
- Quick Start and Inference Example scripts
β¨ Model Weights
The SABER-LLM-7B model weights are now available on Hugging Face!
You can download them from the following repository: https://huggingface.co/XXXXXX/XXXXX. (Due to the anonymous review process, the code, model weights and dataset links will be updated here upon acceptance.)
- Downloads last month
- -
