Audio Emotion Recognition Model
An audio-based emotion recognition model trained on the MELD dataset.
It serves as a strong unimodal audio baseline and as the audio encoder in a multimodal emotion recognition system.
Model Summary
- Task: Speech Emotion Recognition
- Dataset: MELD
- Backbone:
facebook/wav2vec2-base - Pooling: Temporal pooling (mean + std over time)
- Classifier: MLP with class-weighted loss
- Classes: 7 emotion categories
Architecture
Wav2Vec 2.0 Encoder
Extracts frame-level representations from raw audio.Temporal Pooling
Mean and standard deviation pooling over the time dimension to obtain a fixed-size utterance embedding.MLP Classifier
Fully connected layers with ReLU and dropout, followed by a softmax output layer.
Class Imbalance Handling
Class imbalance in MELD is addressed using class weights in the cross-entropy loss, improving macro-level performance on underrepresented emotions.
Training Details
- Sampling rate: 16 kHz
- Max utterance length: 6 seconds
- Optimizer: Adam
- Loss: CrossEntropyLoss (with class weights)
- Metrics: Accuracy, Macro F1, Weighted F1
Usage
- Standalone audio emotion classifier
- Audio branch for early and late fusion in multimodal emotion recognition
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support