metadata
language: en
tags:
- audio
- emotion-recognition
- speech
- pytorch
- cnn
- ravdess
license: mit
datasets:
- ravdess
metrics:
- accuracy
- f1
model-index:
- name: speech-emotion-recognition-v2
results:
- task:
type: audio-classification
name: Speech Emotion Recognition
dataset:
name: RAVDESS
type: ravdess
metrics:
- type: accuracy
value: 75
name: Validation Accuracy
- type: accuracy
value: 66.2
name: Test Accuracy
Speech Emotion Recognition (Enhanced Model V2)
Model Description
This model is a deep CNN-based classifier for detecting emotions from speech audio. It achieves 75% validation accuracy and 66.2% test accuracy on the RAVDESS dataset through enhanced feature extraction, residual connections, and attention mechanisms.
Model Architecture
- Type: Convolutional Neural Network with Residual Blocks
- Parameters: 11,873,480
- Input: 196-dimensional audio features × 128 time steps
- Output: 8 emotion classes
Architecture Details:
- 4 Residual Layers (2 blocks each)
- Channel Attention Mechanisms
- Dual Global Pooling (Average + Max)
- Fully Connected Layers: 1024 → 512 → 256 → 8
Features (196 dimensions)
- Mel-spectrograms: 128 bands
- MFCCs: 13 coefficients
- Delta MFCCs: 13 (temporal dynamics)
- Delta-Delta MFCCs: 13 (acceleration)
- Chromagram: 12 (pitch content)
- Spectral Contrast: 7 (texture)
- Tonnetz: 6 (harmonic content)
- Additional: 4 (ZCR, centroid, rolloff, bandwidth)
Intended Use
Primary Use Cases
- Emotion detection from speech audio
- Affective computing research
- Human-computer interaction
- Mental health monitoring
- Call center analytics
Out-of-Scope Use
- Real-time streaming audio (model requires 3-second clips)
- Non-speech audio (music, environmental sounds)
- Languages other than English
- Clinical diagnosis without professional oversight
Training Data
RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)
- 1,440 speech files
- 8 emotion classes: neutral, calm, happy, sad, angry, fearful, disgust, surprised
- 24 professional actors (12 male, 12 female)
- Controlled recording environment
- Split: 70% train, 15% validation, 15% test
Performance
Overall Metrics
| Metric | Value |
|---|---|
| Validation Accuracy | 75.00% |
| Test Accuracy | 66.20% |
| Macro F1-Score | 0.660 |
| Weighted F1-Score | 0.658 |
Per-Class Performance (Test Set)
| Emotion | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Neutral | 71.43% | 0.667 | 0.714 | 0.690 |
| Calm | 85.71% | 0.686 | 0.857 | 0.762 |
| Happy | 58.62% | 0.531 | 0.586 | 0.557 |
| Sad | 51.72% | 0.500 | 0.517 | 0.508 |
| Angry | 68.97% | 0.769 | 0.690 | 0.727 |
| Fearful | 41.38% | 0.706 | 0.414 | 0.522 |
| Disgust | 75.86% | 0.688 | 0.759 | 0.721 |
| Surprised | 79.31% | 0.793 | 0.793 | 0.793 |
Comparison with Baseline
| Metric | Baseline | Enhanced | Improvement |
|---|---|---|---|
| Validation Acc | 38.89% | 75.00% | +36.11% |
| Test Acc | 39.81% | 66.20% | +26.39% |
| Parameters | 536K | 11.8M | 22x |
Usage
Installation
pip install torch torchaudio librosa numpy
Quick Start
import torch
import librosa
import numpy as np
from models.emotion_cnn_v2 import ImprovedEmotionCNN
from data.prepare_data import extract_features
# Load model
model = ImprovedEmotionCNN(num_classes=8)
checkpoint = torch.load('best_model_v2.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Load and process audio
features = extract_features('path/to/audio.wav')
features_tensor = torch.FloatTensor(features).unsqueeze(0).unsqueeze(0)
# Predict
with torch.no_grad():
output = model(features_tensor)
probs = torch.softmax(output, dim=1)
predicted_idx = output.argmax(1).item()
emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
print(f"Predicted: {emotions[predicted_idx]} ({probs[0][predicted_idx]:.2%})")
Limitations
Known Issues
- Fearful Emotion: Lower accuracy (41.38%) - often confused with other negative emotions
- Test-Validation Gap: 75% validation vs 66.2% test suggests some overfitting
- Dataset Bias: Trained on professional actors in controlled environment
- Language: English only
- Audio Quality: Requires clear speech without background noise
Ethical Considerations
- Privacy: Emotion detection from voice raises privacy concerns
- Bias: May not generalize well across different demographics, accents, or cultures
- Misuse: Should not be used for surveillance or manipulation
- Context: Emotions are complex and context-dependent; model provides probabilities, not certainties
Training Procedure
Hyperparameters
{
'batch_size': 24,
'learning_rate': 0.001,
'epochs': 150,
'optimizer': 'AdamW',
'weight_decay': 1e-4,
'loss': 'CrossEntropyLoss + Label Smoothing (0.1)',
'lr_scheduler': 'ReduceLROnPlateau (patience=8, factor=0.5)',
'early_stopping': 'patience=20',
'mixed_precision': 'FP16',
'gradient_clipping': 'max_norm=1.0'
}
Data Augmentation
- SpecAugment (time and frequency masking)
- Gaussian noise injection
- Time shifting
- Augmentation probability: 60%
Hardware
- GPU: NVIDIA RTX 5060 Ti
- Training Time: ~2.5 hours (150 epochs)
- CUDA: 13.0
- PyTorch: 2.0+
Citation
If you use this model, please cite:
@misc{speech-emotion-recognition-v2,
title={Speech Emotion Recognition with Enhanced CNN},
author={Your Name},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/yourusername/speech-emotion-recognition}}
}
RAVDESS Dataset Citation
@article{livingstone2018ravdess,
title={The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)},
author={Livingstone, Steven R and Russo, Frank A},
journal={PLoS ONE},
volume={13},
number={5},
pages={e0196391},
year={2018},
publisher={Public Library of Science}
}
License
MIT License - See LICENSE file for details
Contact
For questions or issues, please open an issue on the GitHub repository.
Acknowledgments
- RAVDESS dataset creators
- PyTorch team
- librosa developers
- Hugging Face community