PetraMicanovic
/

meld-audio-baseline

bert-embedding-mlp

emotion-recognition

Model card Files Files and versions

Audio Emotion Recognition on MELD

Baseline – Mean Temporal Pooling

This repository contains a pretrained audio-only emotion recognition model evaluated on the MELD dataset.

The model uses a pretrained Wav2Vec2 encoder and simple mean pooling over temporal frames to obtain utterance-level representations.

Dataset

MELD (declare-lab/MELD)
Seven emotion classes:

neutral, joy, surprise, anger, sadness, fear, disgust

Model Architecture

Audio encoder: facebook/wav2vec2-base
Pooling: Mean pooling over time frames
Classifier: Fully connected layer
Output: 7 emotion classes

This model serves as an audio-only baseline before introducing more advanced temporal or attention-based pooling mechanisms.

Training Setup (Summary)

Sampling rate: 16 kHz
Batch size: 32
Learning rate: 1e-4
Optimizer: Adam
Scheduler: ReduceLROnPlateau
Epochs: 30
Early stopping on validation weighted F1-score

Evaluation Metrics

Accuracy
Weighted F1-score
Confusion matrix

Files

pytorch_model.bin – Audio model weights
config.json – Model configuration

Reproducibility

The full training and evaluation pipeline is available in the corresponding GitHub repository notebooks.

Downloads last month: 5

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support