saadmannan's picture
Upload MODEL_CARD.md with huggingface_hub
792d931 verified
---
language: en
tags:
- audio
- emotion-recognition
- speech
- pytorch
- cnn
- ravdess
license: mit
datasets:
- ravdess
metrics:
- accuracy
- f1
model-index:
- name: speech-emotion-recognition-v2
results:
- task:
type: audio-classification
name: Speech Emotion Recognition
dataset:
name: RAVDESS
type: ravdess
metrics:
- type: accuracy
value: 75.0
name: Validation Accuracy
- type: accuracy
value: 66.2
name: Test Accuracy
---
# Speech Emotion Recognition (Enhanced Model V2)
## Model Description
This model is a deep CNN-based classifier for detecting emotions from speech audio. It achieves **75% validation accuracy** and **66.2% test accuracy** on the RAVDESS dataset through enhanced feature extraction, residual connections, and attention mechanisms.
### Model Architecture
- **Type**: Convolutional Neural Network with Residual Blocks
- **Parameters**: 11,873,480
- **Input**: 196-dimensional audio features × 128 time steps
- **Output**: 8 emotion classes
**Architecture Details:**
- 4 Residual Layers (2 blocks each)
- Channel Attention Mechanisms
- Dual Global Pooling (Average + Max)
- Fully Connected Layers: 1024 → 512 → 256 → 8
### Features (196 dimensions)
- Mel-spectrograms: 128 bands
- MFCCs: 13 coefficients
- Delta MFCCs: 13 (temporal dynamics)
- Delta-Delta MFCCs: 13 (acceleration)
- Chromagram: 12 (pitch content)
- Spectral Contrast: 7 (texture)
- Tonnetz: 6 (harmonic content)
- Additional: 4 (ZCR, centroid, rolloff, bandwidth)
## Intended Use
### Primary Use Cases
- Emotion detection from speech audio
- Affective computing research
- Human-computer interaction
- Mental health monitoring
- Call center analytics
### Out-of-Scope Use
- Real-time streaming audio (model requires 3-second clips)
- Non-speech audio (music, environmental sounds)
- Languages other than English
- Clinical diagnosis without professional oversight
## Training Data
**RAVDESS** (Ryerson Audio-Visual Database of Emotional Speech and Song)
- 1,440 speech files
- 8 emotion classes: neutral, calm, happy, sad, angry, fearful, disgust, surprised
- 24 professional actors (12 male, 12 female)
- Controlled recording environment
- Split: 70% train, 15% validation, 15% test
## Performance
### Overall Metrics
| Metric | Value |
|--------|-------|
| Validation Accuracy | 75.00% |
| Test Accuracy | 66.20% |
| Macro F1-Score | 0.660 |
| Weighted F1-Score | 0.658 |
### Per-Class Performance (Test Set)
| Emotion | Accuracy | Precision | Recall | F1-Score |
|---------|----------|-----------|--------|----------|
| Neutral | 71.43% | 0.667 | 0.714 | 0.690 |
| Calm | 85.71% | 0.686 | 0.857 | 0.762 |
| Happy | 58.62% | 0.531 | 0.586 | 0.557 |
| Sad | 51.72% | 0.500 | 0.517 | 0.508 |
| Angry | 68.97% | 0.769 | 0.690 | 0.727 |
| Fearful | 41.38% | 0.706 | 0.414 | 0.522 |
| Disgust | 75.86% | 0.688 | 0.759 | 0.721 |
| Surprised | 79.31% | 0.793 | 0.793 | 0.793 |
### Comparison with Baseline
| Metric | Baseline | Enhanced | Improvement |
|--------|----------|----------|-------------|
| Validation Acc | 38.89% | 75.00% | +36.11% |
| Test Acc | 39.81% | 66.20% | +26.39% |
| Parameters | 536K | 11.8M | 22x |
## Usage
### Installation
```bash
pip install torch torchaudio librosa numpy
```
### Quick Start
```python
import torch
import librosa
import numpy as np
from models.emotion_cnn_v2 import ImprovedEmotionCNN
from data.prepare_data import extract_features
# Load model
model = ImprovedEmotionCNN(num_classes=8)
checkpoint = torch.load('best_model_v2.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Load and process audio
features = extract_features('path/to/audio.wav')
features_tensor = torch.FloatTensor(features).unsqueeze(0).unsqueeze(0)
# Predict
with torch.no_grad():
output = model(features_tensor)
probs = torch.softmax(output, dim=1)
predicted_idx = output.argmax(1).item()
emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
print(f"Predicted: {emotions[predicted_idx]} ({probs[0][predicted_idx]:.2%})")
```
## Limitations
### Known Issues
1. **Fearful Emotion**: Lower accuracy (41.38%) - often confused with other negative emotions
2. **Test-Validation Gap**: 75% validation vs 66.2% test suggests some overfitting
3. **Dataset Bias**: Trained on professional actors in controlled environment
4. **Language**: English only
5. **Audio Quality**: Requires clear speech without background noise
### Ethical Considerations
- **Privacy**: Emotion detection from voice raises privacy concerns
- **Bias**: May not generalize well across different demographics, accents, or cultures
- **Misuse**: Should not be used for surveillance or manipulation
- **Context**: Emotions are complex and context-dependent; model provides probabilities, not certainties
## Training Procedure
### Hyperparameters
```python
{
'batch_size': 24,
'learning_rate': 0.001,
'epochs': 150,
'optimizer': 'AdamW',
'weight_decay': 1e-4,
'loss': 'CrossEntropyLoss + Label Smoothing (0.1)',
'lr_scheduler': 'ReduceLROnPlateau (patience=8, factor=0.5)',
'early_stopping': 'patience=20',
'mixed_precision': 'FP16',
'gradient_clipping': 'max_norm=1.0'
}
```
### Data Augmentation
- SpecAugment (time and frequency masking)
- Gaussian noise injection
- Time shifting
- Augmentation probability: 60%
### Hardware
- GPU: NVIDIA RTX 5060 Ti
- Training Time: ~2.5 hours (150 epochs)
- CUDA: 13.0
- PyTorch: 2.0+
## Citation
If you use this model, please cite:
```bibtex
@misc{speech-emotion-recognition-v2,
title={Speech Emotion Recognition with Enhanced CNN},
author={Your Name},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/yourusername/speech-emotion-recognition}}
}
```
### RAVDESS Dataset Citation
```bibtex
@article{livingstone2018ravdess,
title={The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)},
author={Livingstone, Steven R and Russo, Frank A},
journal={PLoS ONE},
volume={13},
number={5},
pages={e0196391},
year={2018},
publisher={Public Library of Science}
}
```
## License
MIT License - See LICENSE file for details
## Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/speech-emotion-recognition).
## Acknowledgments
- RAVDESS dataset creators
- PyTorch team
- librosa developers
- Hugging Face community