Audio Emotion Recognition Model

An audio-based emotion recognition model trained on the MELD dataset.
It serves as a strong unimodal audio baseline and as the audio encoder in a multimodal emotion recognition system.

Model Summary

  • Task: Speech Emotion Recognition
  • Dataset: MELD
  • Backbone: facebook/wav2vec2-base
  • Pooling: Temporal pooling (mean + std over time)
  • Classifier: MLP with class-weighted loss
  • Classes: 7 emotion categories

Architecture

  1. Wav2Vec 2.0 Encoder
    Extracts frame-level representations from raw audio.

  2. Temporal Pooling
    Mean and standard deviation pooling over the time dimension to obtain a fixed-size utterance embedding.

  3. MLP Classifier
    Fully connected layers with ReLU and dropout, followed by a softmax output layer.

Class Imbalance Handling

Class imbalance in MELD is addressed using class weights in the cross-entropy loss, improving macro-level performance on underrepresented emotions.

Training Details

  • Sampling rate: 16 kHz
  • Max utterance length: 6 seconds
  • Optimizer: Adam
  • Loss: CrossEntropyLoss (with class weights)
  • Metrics: Accuracy, Macro F1, Weighted F1

Usage

  • Standalone audio emotion classifier
  • Audio branch for early and late fusion in multimodal emotion recognition
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support