Speech Emotion Recognition (SER) System

Overview

Production-quality Speech Emotion Recognition detecting 6 core emotions from voice/audio:

  • Angry | Disgust | Fear | Happy | Neutral | Sad

Architecture

Fusion Model: CNN + BiLSTM + Multi-Head Self-Attention (spectrogram features) + emotion2vec embeddings

Feature Pipeline

Feature Dimensions
Mel Spectrogram 128 bands
MFCC 40 coefficients
Zero Crossing Rate 1
RMS Energy 1
Total 170 ร— 200 โ†’ (170, 200, 1)
emotion2vec embedding 768-dim

Training Data

  • CREMA-D: 7,442 clips, 91 actors (train/val/test split provided)
  • RAVDESS: 1,056 speech clips, 24 actors (70/15/15 split)
  • Augmentation: pitch shift, time stretch, Gaussian noise, SpecAugment

Results

Model Val Accuracy Test Accuracy
CNN+BiLSTM+Attention 56.0% 59.2%
Fusion (CNN + emotion2vec) 53.2% 54.9%
Human baseline (audio-only) - 40.9%

Best: Model 1 โ€” 59.2% test accuracy (+18.3pp over human baseline)

Quick Start

pip install tensorflow librosa numpy funasr modelscope
from predict import predict_emotion

label, confidence, probs = predict_emotion("audio.wav", model_dir="./outputs")
# Prints: Predicted Emotion: HAPPY, Confidence: 87.3%

Download & Use Locally

# Clone the repo
git lfs install
git clone https://huggingface.co/SamOp224/speech-emotion-recognition
cd speech-emotion-recognition

# Run prediction
python outputs/predict.py your_audio.wav outputs

Files

  • outputs/fusion_model.keras โ€” Fusion model (best)
  • outputs/model1_cnn_bilstm_attn.keras โ€” CNN+BiLSTM+Attention standalone
  • outputs/predict.py โ€” Prediction script with visualization
  • outputs/config.json โ€” Configuration and results
Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support