Speech Emotion Recognition (CNN-BiLSTM-Attention)

Front-end: 4-block CNN for feature extraction from Mel Spectrograms.
Mid-section: Bidirectional LSTM for temporal dependencies.
Pooling: Multi-head Attention pooling.
Back-end: Fully connected classifier.

This model was trained from scratch on the RAVDESS and TESS datasets.

Model Architecture

0: neutral, 1: calm, 2: happy, 3: sad, 4: angry, 5: fearful, 6: disgust

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support