Audio Emotion Recognition Model

An audio-based emotion recognition model trained on the MELD dataset.
It serves as a strong unimodal audio baseline and as the audio encoder in a multimodal emotion recognition system.

Model Summary

Task: Speech Emotion Recognition
Dataset: MELD
Backbone: facebook/wav2vec2-base
Pooling: Temporal pooling (mean + std over time)
Classifier: MLP with class-weighted loss
Classes: 7 emotion categories

Architecture

Wav2Vec 2.0 Encoder
Extracts frame-level representations from raw audio.
Temporal Pooling
Mean and standard deviation pooling over the time dimension to obtain a fixed-size utterance embedding.
MLP Classifier
Fully connected layers with ReLU and dropout, followed by a softmax output layer.

Class Imbalance Handling

Class imbalance in MELD is addressed using class weights in the cross-entropy loss, improving macro-level performance on underrepresented emotions.

Training Details

Sampling rate: 16 kHz
Max utterance length: 6 seconds
Optimizer: Adam
Loss: CrossEntropyLoss (with class weights)
Metrics: Accuracy, Macro F1, Weighted F1

Usage

Standalone audio emotion classifier
Audio branch for early and late fusion in multimodal emotion recognition

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support