You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning

This is the official implementation of the paper: "Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning".

Hugging Face GitHub

Note: Due to the anonymous review process, the code, model weights and dataset links will be updated here upon acceptance.


πŸ’‘ Key Contributions

  • SABER Dataset: A large-scale multimodal emotion reasoning dataset containing ~600K video clips, featuring a unique six-dimensional annotation schema.
  • SED Paradigm: Structured Evidence Decomposition forces the model to disentangle and analyze uni-modal evidence (Visual, Acoustic, etc.) before synthesizing a final emotional conclusion.
  • CA-DPO: Consistency-Aware Direct Preference Optimization refines the model's judgment in modality-conflicting scenarios (e.g., a "sarcastic smile" with a "hostile tone").
  • SOTA Performance: Outperforms existing open-source baselines on EMER, EmoBench-M, and SABER-Test.

πŸ“Š Data Pipeline and Model Architecture

Our data construction pipeline integrates a unified fine-grained annotation strategy with automated quality control mechanisms across three stages.

Data Processing Pipeline Figure 1: (a) Overview of the SABER data pipeline, featuring Raw Data Cleaning, Fine-grained Multimodal Annotation, and Instruction Generation. (b) Training Paradigm: Stage 1 (SED) for sequential grounding and Stage 2 (CA-DPO) for preference alignment in conflicting scenarios.*


SABER-LLM utilizes a two-stage training paradigm to ensure robust evidence grounding.

Model Architecture

Six-Dimensional Annotation Schema

  1. Video Description: Macro scene context.
  2. Facial Expression: Micro-expressions and gaze.
  3. Body Language: Posture, gestures, and social signals.
  4. Acoustic Features: Prosody, pitch, and tonal intensity.
  5. Speech Content: Verbatim transcripts and semantic info.
  6. Multimodal Emotion Analysis: Final holistic reasoning and causal logic.

πŸ“… To-Do List

  • Release SABER-Test benchmark (1,800 clips)
  • Release SABER-LLM-7B model weights
  • Release the full SABER training dataset
  • Provide automated data annotation scripts
  • Quick Start and Inference Example scripts

✨ Model Weights

The SABER-LLM-7B model weights are now available on Hugging Face!

Hugging Face

You can download them from the following repository: https://huggingface.co/XXXXXX/XXXXX. (Due to the anonymous review process, the code, model weights and dataset links will be updated here upon acceptance.)


Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support