SamOp224's picture
Upload README.md with huggingface_hub
4f0b6d3 verified
metadata
tags:
  - audio-classification
  - speech-emotion-recognition
  - tensorflow
  - keras
  - emotion2vec
language:
  - en
license: apache-2.0
metrics:
  - accuracy

Speech Emotion Recognition (SER) System

Overview

Production-quality Speech Emotion Recognition detecting 6 core emotions from voice/audio:

  • Angry | Disgust | Fear | Happy | Neutral | Sad

Architecture

Fusion Model: CNN + BiLSTM + Multi-Head Self-Attention (spectrogram features) + emotion2vec embeddings

Feature Pipeline

Feature Dimensions
Mel Spectrogram 128 bands
MFCC 40 coefficients
Zero Crossing Rate 1
RMS Energy 1
Total 170 × 200 → (170, 200, 1)
emotion2vec embedding 768-dim

Training Data

  • CREMA-D: 7,442 clips, 91 actors (train/val/test split provided)
  • RAVDESS: 1,056 speech clips, 24 actors (70/15/15 split)
  • Augmentation: pitch shift, time stretch, Gaussian noise, SpecAugment

Results

Model Val Accuracy Test Accuracy
CNN+BiLSTM+Attention 56.0% 59.2%
Fusion (CNN + emotion2vec) 53.2% 54.9%
Human baseline (audio-only) - 40.9%

Best: Model 1 — 59.2% test accuracy (+18.3pp over human baseline)

Quick Start

pip install tensorflow librosa numpy funasr modelscope
from predict import predict_emotion

label, confidence, probs = predict_emotion("audio.wav", model_dir="./outputs")
# Prints: Predicted Emotion: HAPPY, Confidence: 87.3%

Download & Use Locally

# Clone the repo
git lfs install
git clone https://huggingface.co/SamOp224/speech-emotion-recognition
cd speech-emotion-recognition

# Run prediction
python outputs/predict.py your_audio.wav outputs

Files

  • outputs/fusion_model.keras — Fusion model (best)
  • outputs/model1_cnn_bilstm_attn.keras — CNN+BiLSTM+Attention standalone
  • outputs/predict.py — Prediction script with visualization
  • outputs/config.json — Configuration and results