|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- audio |
|
|
- emotion-recognition |
|
|
- speech |
|
|
- pytorch |
|
|
- cnn |
|
|
- ravdess |
|
|
license: mit |
|
|
datasets: |
|
|
- ravdess |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
model-index: |
|
|
- name: speech-emotion-recognition-v2 |
|
|
results: |
|
|
- task: |
|
|
type: audio-classification |
|
|
name: Speech Emotion Recognition |
|
|
dataset: |
|
|
name: RAVDESS |
|
|
type: ravdess |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 75.0 |
|
|
name: Validation Accuracy |
|
|
- type: accuracy |
|
|
value: 66.2 |
|
|
name: Test Accuracy |
|
|
--- |
|
|
|
|
|
# Speech Emotion Recognition (Enhanced Model V2) |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a deep CNN-based classifier for detecting emotions from speech audio. It achieves **75% validation accuracy** and **66.2% test accuracy** on the RAVDESS dataset through enhanced feature extraction, residual connections, and attention mechanisms. |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Type**: Convolutional Neural Network with Residual Blocks |
|
|
- **Parameters**: 11,873,480 |
|
|
- **Input**: 196-dimensional audio features × 128 time steps |
|
|
- **Output**: 8 emotion classes |
|
|
|
|
|
**Architecture Details:** |
|
|
- 4 Residual Layers (2 blocks each) |
|
|
- Channel Attention Mechanisms |
|
|
- Dual Global Pooling (Average + Max) |
|
|
- Fully Connected Layers: 1024 → 512 → 256 → 8 |
|
|
|
|
|
### Features (196 dimensions) |
|
|
|
|
|
- Mel-spectrograms: 128 bands |
|
|
- MFCCs: 13 coefficients |
|
|
- Delta MFCCs: 13 (temporal dynamics) |
|
|
- Delta-Delta MFCCs: 13 (acceleration) |
|
|
- Chromagram: 12 (pitch content) |
|
|
- Spectral Contrast: 7 (texture) |
|
|
- Tonnetz: 6 (harmonic content) |
|
|
- Additional: 4 (ZCR, centroid, rolloff, bandwidth) |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
- Emotion detection from speech audio |
|
|
- Affective computing research |
|
|
- Human-computer interaction |
|
|
- Mental health monitoring |
|
|
- Call center analytics |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- Real-time streaming audio (model requires 3-second clips) |
|
|
- Non-speech audio (music, environmental sounds) |
|
|
- Languages other than English |
|
|
- Clinical diagnosis without professional oversight |
|
|
|
|
|
## Training Data |
|
|
|
|
|
**RAVDESS** (Ryerson Audio-Visual Database of Emotional Speech and Song) |
|
|
- 1,440 speech files |
|
|
- 8 emotion classes: neutral, calm, happy, sad, angry, fearful, disgust, surprised |
|
|
- 24 professional actors (12 male, 12 female) |
|
|
- Controlled recording environment |
|
|
- Split: 70% train, 15% validation, 15% test |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Overall Metrics |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Validation Accuracy | 75.00% | |
|
|
| Test Accuracy | 66.20% | |
|
|
| Macro F1-Score | 0.660 | |
|
|
| Weighted F1-Score | 0.658 | |
|
|
|
|
|
### Per-Class Performance (Test Set) |
|
|
|
|
|
| Emotion | Accuracy | Precision | Recall | F1-Score | |
|
|
|---------|----------|-----------|--------|----------| |
|
|
| Neutral | 71.43% | 0.667 | 0.714 | 0.690 | |
|
|
| Calm | 85.71% | 0.686 | 0.857 | 0.762 | |
|
|
| Happy | 58.62% | 0.531 | 0.586 | 0.557 | |
|
|
| Sad | 51.72% | 0.500 | 0.517 | 0.508 | |
|
|
| Angry | 68.97% | 0.769 | 0.690 | 0.727 | |
|
|
| Fearful | 41.38% | 0.706 | 0.414 | 0.522 | |
|
|
| Disgust | 75.86% | 0.688 | 0.759 | 0.721 | |
|
|
| Surprised | 79.31% | 0.793 | 0.793 | 0.793 | |
|
|
|
|
|
### Comparison with Baseline |
|
|
|
|
|
| Metric | Baseline | Enhanced | Improvement | |
|
|
|--------|----------|----------|-------------| |
|
|
| Validation Acc | 38.89% | 75.00% | +36.11% | |
|
|
| Test Acc | 39.81% | 66.20% | +26.39% | |
|
|
| Parameters | 536K | 11.8M | 22x | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch torchaudio librosa numpy |
|
|
``` |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import librosa |
|
|
import numpy as np |
|
|
from models.emotion_cnn_v2 import ImprovedEmotionCNN |
|
|
from data.prepare_data import extract_features |
|
|
|
|
|
# Load model |
|
|
model = ImprovedEmotionCNN(num_classes=8) |
|
|
checkpoint = torch.load('best_model_v2.pth', map_location='cpu') |
|
|
model.load_state_dict(checkpoint['model_state_dict']) |
|
|
model.eval() |
|
|
|
|
|
# Load and process audio |
|
|
features = extract_features('path/to/audio.wav') |
|
|
features_tensor = torch.FloatTensor(features).unsqueeze(0).unsqueeze(0) |
|
|
|
|
|
# Predict |
|
|
with torch.no_grad(): |
|
|
output = model(features_tensor) |
|
|
probs = torch.softmax(output, dim=1) |
|
|
predicted_idx = output.argmax(1).item() |
|
|
|
|
|
emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised'] |
|
|
print(f"Predicted: {emotions[predicted_idx]} ({probs[0][predicted_idx]:.2%})") |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
### Known Issues |
|
|
|
|
|
1. **Fearful Emotion**: Lower accuracy (41.38%) - often confused with other negative emotions |
|
|
2. **Test-Validation Gap**: 75% validation vs 66.2% test suggests some overfitting |
|
|
3. **Dataset Bias**: Trained on professional actors in controlled environment |
|
|
4. **Language**: English only |
|
|
5. **Audio Quality**: Requires clear speech without background noise |
|
|
|
|
|
### Ethical Considerations |
|
|
|
|
|
- **Privacy**: Emotion detection from voice raises privacy concerns |
|
|
- **Bias**: May not generalize well across different demographics, accents, or cultures |
|
|
- **Misuse**: Should not be used for surveillance or manipulation |
|
|
- **Context**: Emotions are complex and context-dependent; model provides probabilities, not certainties |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
```python |
|
|
{ |
|
|
'batch_size': 24, |
|
|
'learning_rate': 0.001, |
|
|
'epochs': 150, |
|
|
'optimizer': 'AdamW', |
|
|
'weight_decay': 1e-4, |
|
|
'loss': 'CrossEntropyLoss + Label Smoothing (0.1)', |
|
|
'lr_scheduler': 'ReduceLROnPlateau (patience=8, factor=0.5)', |
|
|
'early_stopping': 'patience=20', |
|
|
'mixed_precision': 'FP16', |
|
|
'gradient_clipping': 'max_norm=1.0' |
|
|
} |
|
|
``` |
|
|
|
|
|
### Data Augmentation |
|
|
|
|
|
- SpecAugment (time and frequency masking) |
|
|
- Gaussian noise injection |
|
|
- Time shifting |
|
|
- Augmentation probability: 60% |
|
|
|
|
|
### Hardware |
|
|
|
|
|
- GPU: NVIDIA RTX 5060 Ti |
|
|
- Training Time: ~2.5 hours (150 epochs) |
|
|
- CUDA: 13.0 |
|
|
- PyTorch: 2.0+ |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{speech-emotion-recognition-v2, |
|
|
title={Speech Emotion Recognition with Enhanced CNN}, |
|
|
author={Your Name}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/yourusername/speech-emotion-recognition}} |
|
|
} |
|
|
``` |
|
|
|
|
|
### RAVDESS Dataset Citation |
|
|
|
|
|
```bibtex |
|
|
@article{livingstone2018ravdess, |
|
|
title={The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)}, |
|
|
author={Livingstone, Steven R and Russo, Frank A}, |
|
|
journal={PLoS ONE}, |
|
|
volume={13}, |
|
|
number={5}, |
|
|
pages={e0196391}, |
|
|
year={2018}, |
|
|
publisher={Public Library of Science} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - See LICENSE file for details |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/speech-emotion-recognition). |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- RAVDESS dataset creators |
|
|
- PyTorch team |
|
|
- librosa developers |
|
|
- Hugging Face community |
|
|
|