Audio Emotion Recognition on MELD
Baseline โ Mean Temporal Pooling
This repository contains a pretrained audio-only emotion recognition model evaluated on the MELD dataset.
The model uses a pretrained Wav2Vec2 encoder and simple mean pooling over temporal frames to obtain utterance-level representations.
Dataset
MELD (declare-lab/MELD)
Seven emotion classes:
neutral, joy, surprise, anger, sadness, fear, disgust
Model Architecture
- Audio encoder:
facebook/wav2vec2-base - Pooling: Mean pooling over time frames
- Classifier: Fully connected layer
- Output: 7 emotion classes
This model serves as an audio-only baseline before introducing more advanced temporal or attention-based pooling mechanisms.
Training Setup (Summary)
- Sampling rate: 16 kHz
- Batch size: 32
- Learning rate: 1e-4
- Optimizer: Adam
- Scheduler: ReduceLROnPlateau
- Epochs: 30
- Early stopping on validation weighted F1-score
Evaluation Metrics
- Accuracy
- Weighted F1-score
- Confusion matrix
Files
pytorch_model.binโ Audio model weightsconfig.jsonโ Model configuration
Reproducibility
The full training and evaluation pipeline is available in the corresponding GitHub repository notebooks.
- Downloads last month
- 6
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support