Audio Emotion Recognition on MELD

Baseline โ€“ Mean Temporal Pooling

This repository contains a pretrained audio-only emotion recognition model evaluated on the MELD dataset.

The model uses a pretrained Wav2Vec2 encoder and simple mean pooling over temporal frames to obtain utterance-level representations.


Dataset

MELD (declare-lab/MELD)
Seven emotion classes:

neutral, joy, surprise, anger, sadness, fear, disgust


Model Architecture

  • Audio encoder: facebook/wav2vec2-base
  • Pooling: Mean pooling over time frames
  • Classifier: Fully connected layer
  • Output: 7 emotion classes

This model serves as an audio-only baseline before introducing more advanced temporal or attention-based pooling mechanisms.


Training Setup (Summary)

  • Sampling rate: 16 kHz
  • Batch size: 32
  • Learning rate: 1e-4
  • Optimizer: Adam
  • Scheduler: ReduceLROnPlateau
  • Epochs: 30
  • Early stopping on validation weighted F1-score

Evaluation Metrics

  • Accuracy
  • Weighted F1-score
  • Confusion matrix

Files

  • pytorch_model.bin โ€“ Audio model weights
  • config.json โ€“ Model configuration

Reproducibility

The full training and evaluation pipeline is available in the corresponding GitHub repository notebooks.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support