--- tags: - audio-classification - speech-emotion-recognition - tensorflow - keras - emotion2vec language: - en license: apache-2.0 metrics: - accuracy --- # Speech Emotion Recognition (SER) System ## Overview Production-quality Speech Emotion Recognition detecting **6 core emotions** from voice/audio: - **Angry** | **Disgust** | **Fear** | **Happy** | **Neutral** | **Sad** ## Architecture **Fusion Model**: CNN + BiLSTM + Multi-Head Self-Attention (spectrogram features) + emotion2vec embeddings ### Feature Pipeline | Feature | Dimensions | |---------|-----------| | Mel Spectrogram | 128 bands | | MFCC | 40 coefficients | | Zero Crossing Rate | 1 | | RMS Energy | 1 | | **Total** | **170 × 200 → (170, 200, 1)** | | emotion2vec embedding | 768-dim | ### Training Data - **CREMA-D**: 7,442 clips, 91 actors (train/val/test split provided) - **RAVDESS**: 1,056 speech clips, 24 actors (70/15/15 split) - **Augmentation**: pitch shift, time stretch, Gaussian noise, SpecAugment ## Results | Model | Val Accuracy | Test Accuracy | |-------|-------------|---------------| | CNN+BiLSTM+Attention | 56.0% | 59.2% | | **Fusion (CNN + emotion2vec)** | **53.2%** | **54.9%** | | Human baseline (audio-only) | - | 40.9% | **Best: Model 1 — 59.2% test accuracy (+18.3pp over human baseline)** ## Quick Start ```bash pip install tensorflow librosa numpy funasr modelscope ``` ```python from predict import predict_emotion label, confidence, probs = predict_emotion("audio.wav", model_dir="./outputs") # Prints: Predicted Emotion: HAPPY, Confidence: 87.3% ``` ## Download & Use Locally ```bash # Clone the repo git lfs install git clone https://huggingface.co/SamOp224/speech-emotion-recognition cd speech-emotion-recognition # Run prediction python outputs/predict.py your_audio.wav outputs ``` ## Files - `outputs/fusion_model.keras` — Fusion model (best) - `outputs/model1_cnn_bilstm_attn.keras` — CNN+BiLSTM+Attention standalone - `outputs/predict.py` — Prediction script with visualization - `outputs/config.json` — Configuration and results