Early Fusion Multimodal Emotion Recognition Model

A multimodal emotion recognition model trained on the MELD dataset using early fusion of audio and text representations. The model combines complementary acoustic and semantic information at the feature level to improve emotion classification performance.

Model Summary

  • Task: Multimodal Emotion Recognition
  • Dataset: MELD
  • Audio Encoder: facebook/wav2vec2-base
  • Text Encoder: bert-base-uncased
  • Fusion Strategy: Early fusion (feature concatenation)
  • Classifier: MLP with class-weighted loss
  • Classes: 7 emotion categories

Architecture

  1. Audio Branch

    • Wav2Vec 2.0 encoder
    • Frame-level representations
    • Temporal pooling (mean + std over time)
    • Fixed-size audio embedding (768)
  2. Text Branch

    • BERT encoder
    • [CLS] token representation
    • Fixed-size text embedding (768)
  3. Early Fusion

    • Concatenation of audio and text embeddings
    • Joint multimodal representation (1536)
  4. Fusion Classifier

    • Fully connected MLP
    • ReLU activation and dropout
    • Softmax output layer

Class Imbalance Handling

The MELD dataset exhibits strong class imbalance. To address this, class weights are applied in the cross-entropy loss function, improving macro-level emotion recognition performance.

Training Details

  • Audio sampling rate: 16 kHz
  • Max audio duration: 6 seconds
  • Max text length: 128 tokens
  • Optimizer: Adam
  • Loss: CrossEntropyLoss (with class weights)
  • Metrics: Accuracy, Macro F1, Weighted F1

Usage

  • Standalone multimodal emotion classifier
  • Benchmark model for comparison with unimodal and late-fusion approaches
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support