|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- audio |
|
|
- emotion-recognition |
|
|
- speech |
|
|
- pytorch |
|
|
- cnn |
|
|
- ravdess |
|
|
license: mit |
|
|
datasets: |
|
|
- ravdess |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
library_name: pytorch |
|
|
pipeline_tag: audio-classification |
|
|
--- |
|
|
|
|
|
# π Speech Emotion Recognition |
|
|
|
|
|
[](https://www.python.org/downloads/release/python-3100/) |
|
|
[](https://pytorch.org/) |
|
|
[](https://opensource.org/licenses/MIT) |
|
|
|
|
|
A production-ready deep learning system for detecting emotions from speech using the RAVDESS dataset. **Achieved 75% validation accuracy** through enhanced CNN architecture with residual connections, attention mechanisms, and comprehensive data augmentation. |
|
|
|
|
|
## π― Project Achievements |
|
|
|
|
|
β
**Primary Goal Met**: 75% validation accuracy (66.2% test accuracy) |
|
|
β
**Enhanced Features**: 196-dimensional feature vectors |
|
|
β
**Advanced Architecture**: 11.8M parameter CNN with residual blocks and attention |
|
|
β
**Production Ready**: Complete pipeline from data to deployment |
|
|
|
|
|
## π Results Summary |
|
|
|
|
|
| Metric | Baseline Model | Enhanced Model | Improvement | |
|
|
|--------|---------------|----------------|-------------| |
|
|
| **Validation Accuracy** | 38.89% | **75.00%** | **+36.11%** | |
|
|
| **Test Accuracy** | 39.81% | **66.20%** | **+26.39%** | |
|
|
| **Parameters** | 536K | 11.8M | 22x larger | |
|
|
| **Features** | 143 | 196 | +37% richer | |
|
|
|
|
|
### Per-Class Performance (Test Set) |
|
|
|
|
|
| Emotion | Baseline | Enhanced | Improvement | Status | |
|
|
|---------|----------|----------|-------------|--------| |
|
|
| Neutral | 78.57% | 71.43% | -7.14% | β Good | |
|
|
| Calm | 85.71% | 85.71% | +0.00% | β Excellent | |
|
|
| **Happy** | 6.90% | **58.62%** | **+51.72%** | π Huge gain | |
|
|
| **Sad** | 0.00% | **51.72%** | **+51.72%** | π Huge gain | |
|
|
| Angry | 31.03% | 68.97% | +37.94% | β Major gain | |
|
|
| Fearful | 13.79% | 41.38% | +27.59% | β Good gain | |
|
|
| Disgust | 68.97% | 75.86% | +6.89% | β Improved | |
|
|
| Surprised | 55.17% | 79.31% | +24.14% | β Major gain | |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
# Clone the repository |
|
|
git clone https://github.com/yourusername/speech-emotion-recognition.git |
|
|
cd speech-emotion-recognition |
|
|
|
|
|
# Create conda environment |
|
|
conda create -n voice_ai python=3.10 |
|
|
conda activate voice_ai |
|
|
|
|
|
# Install dependencies |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Usage |
|
|
|
|
|
#### 1. Download Dataset |
|
|
```bash |
|
|
python data/download_dataset.py |
|
|
``` |
|
|
|
|
|
#### 2. Prepare Features |
|
|
```bash |
|
|
python data/prepare_data.py |
|
|
``` |
|
|
|
|
|
#### 3. Train Enhanced Model |
|
|
```bash |
|
|
python models/train_v2.py |
|
|
``` |
|
|
|
|
|
#### 4. Evaluate Model |
|
|
```bash |
|
|
python models/evaluate_v2.py |
|
|
``` |
|
|
|
|
|
#### 5. Run Streamlit Demo |
|
|
```bash |
|
|
streamlit run deployment/app.py |
|
|
``` |
|
|
|
|
|
### Quick Inference |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from models.emotion_cnn_v2 import ImprovedEmotionCNN |
|
|
from data.prepare_data import extract_features |
|
|
|
|
|
# Load model |
|
|
model = ImprovedEmotionCNN(num_classes=8) |
|
|
checkpoint = torch.load('results/best_model_v2.pth') |
|
|
model.load_state_dict(checkpoint['model_state_dict']) |
|
|
model.eval() |
|
|
|
|
|
# Extract features from audio |
|
|
features = extract_features('path/to/audio.wav') |
|
|
features_tensor = torch.FloatTensor(features).unsqueeze(0).unsqueeze(0) |
|
|
|
|
|
# Predict |
|
|
with torch.no_grad(): |
|
|
output = model(features_tensor) |
|
|
probs = torch.softmax(output, dim=1) |
|
|
predicted = output.argmax(1) |
|
|
|
|
|
emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised'] |
|
|
print(f"Predicted emotion: {emotions[predicted]}") |
|
|
print(f"Confidence: {probs[0][predicted]:.2%}") |
|
|
``` |
|
|
|
|
|
## ποΈ Architecture |
|
|
|
|
|
### Enhanced Model (V2) - 75% Accuracy |
|
|
|
|
|
**Features (196 dimensions):** |
|
|
- Mel-spectrograms: 128 bands |
|
|
- MFCCs: 13 coefficients |
|
|
- Delta MFCCs: 13 (temporal dynamics) |
|
|
- Delta-Delta MFCCs: 13 (acceleration) |
|
|
- Chromagram: 12 (pitch content) |
|
|
- Spectral Contrast: 7 (texture) |
|
|
- Tonnetz: 6 (harmonic content) |
|
|
- Additional: 4 (ZCR, centroid, rolloff, bandwidth) |
|
|
|
|
|
**Model Architecture:** |
|
|
``` |
|
|
Input (1, 196, 128) |
|
|
β |
|
|
Conv2d 7Γ7, stride 2 β 64 channels |
|
|
β |
|
|
Residual Block Γ 2 (64 channels) + Channel Attention |
|
|
β |
|
|
Residual Block Γ 2 (128 channels) + Channel Attention |
|
|
β |
|
|
Residual Block Γ 2 (256 channels) + Channel Attention |
|
|
β |
|
|
Residual Block Γ 2 (512 channels) + Channel Attention |
|
|
β |
|
|
Dual Global Pooling (Avg + Max) β 1024 features |
|
|
β |
|
|
FC 1024 β 512 β 256 β 8 (emotions) |
|
|
|
|
|
Total Parameters: 11,873,480 |
|
|
``` |
|
|
|
|
|
**Key Improvements:** |
|
|
- β
Residual connections for deeper learning |
|
|
- β
Channel attention mechanisms |
|
|
- β
Dual pooling (average + max) |
|
|
- β
Batch normalization throughout |
|
|
- β
Dropout (0.4) for regularization |
|
|
|
|
|
### Baseline Model (V1) - 39% Accuracy |
|
|
|
|
|
**Features (143 dimensions):** |
|
|
- Mel-spectrograms: 128 |
|
|
- MFCCs: 13 |
|
|
- ZCR: 1 |
|
|
- Spectral Centroid: 1 |
|
|
|
|
|
**Model Architecture:** |
|
|
- 3 Conv blocks (64 β 128 β 256) |
|
|
- Global average pooling |
|
|
- FC layers: 256 β 128 β 8 |
|
|
- Total Parameters: 536,584 |
|
|
|
|
|
## π Project Structure |
|
|
|
|
|
``` |
|
|
speech-emotion-recognition/ |
|
|
βββ data/ |
|
|
β βββ download_dataset.py # RAVDESS dataset downloader |
|
|
β βββ prepare_data.py # Enhanced feature extraction (196 features) |
|
|
β βββ dataset.py # PyTorch Dataset with train/val/test splits |
|
|
β βββ augmentation.py # Data augmentation (SpecAugment, noise, etc.) |
|
|
β |
|
|
βββ models/ |
|
|
β βββ emotion_cnn.py # Baseline CNN (536K params) |
|
|
β βββ emotion_cnn_v2.py # Enhanced CNN (11.8M params) β |
|
|
β βββ train.py # Baseline training script |
|
|
β βββ train_v2.py # Enhanced training script β |
|
|
β βββ evaluate.py # Baseline evaluation |
|
|
β βββ evaluate_v2.py # Enhanced evaluation β |
|
|
β |
|
|
βββ deployment/ |
|
|
β βββ app.py # Streamlit demo application |
|
|
β βββ requirements.txt # Deployment dependencies |
|
|
β |
|
|
βββ notebooks/ |
|
|
β βββ emotion_eda.ipynb # Exploratory analysis + model comparison |
|
|
β |
|
|
βββ results/ |
|
|
β βββ best_model.pth # Baseline model weights |
|
|
β βββ best_model_v2.pth # Enhanced model weights β |
|
|
β βββ confusion_matrix_v2.png # Confusion matrix visualization |
|
|
β βββ per_class_accuracy_v2.png # Per-class performance chart |
|
|
β βββ model_comparison.png # Baseline vs Enhanced comparison |
|
|
β |
|
|
βββ runs/ # TensorBoard logs |
|
|
βββ README.md # This file |
|
|
βββ requirements.txt # Python dependencies |
|
|
βββ LICENSE # MIT License |
|
|
``` |
|
|
|
|
|
## π§ Technical Details |
|
|
|
|
|
### Dataset: RAVDESS |
|
|
|
|
|
**Ryerson Audio-Visual Database of Emotional Speech and Song** |
|
|
- 1,440 speech files |
|
|
- 8 emotion classes (neutral, calm, happy, sad, angry, fearful, disgust, surprised) |
|
|
- 24 professional actors (12 male, 12 female) |
|
|
- Controlled recording environment |
|
|
- Download: https://zenodo.org/record/1188976 |
|
|
|
|
|
### Training Configuration (Enhanced Model) |
|
|
|
|
|
```python |
|
|
config = { |
|
|
'batch_size': 24, |
|
|
'learning_rate': 0.001, |
|
|
'epochs': 150, |
|
|
'optimizer': 'AdamW', |
|
|
'weight_decay': 1e-4, |
|
|
'loss': 'CrossEntropyLoss + Label Smoothing (0.1)', |
|
|
'lr_scheduler': 'ReduceLROnPlateau (patience=8, factor=0.5)', |
|
|
'early_stopping': 'patience=20', |
|
|
'mixed_precision': 'FP16', |
|
|
'gradient_clipping': 'max_norm=1.0', |
|
|
'data_augmentation': True |
|
|
} |
|
|
``` |
|
|
|
|
|
### Data Augmentation |
|
|
|
|
|
- **SpecAugment**: Time and frequency masking |
|
|
- **Gaussian Noise**: Random noise injection |
|
|
- **Time Shifting**: Temporal variations |
|
|
- **Augmentation Probability**: 60% |
|
|
|
|
|
### Hardware Requirements |
|
|
|
|
|
- **Recommended**: NVIDIA GPU with 8GB+ VRAM |
|
|
- **Tested on**: RTX 5060 Ti |
|
|
- **Training Time**: ~2.5 hours (150 epochs) |
|
|
- **Inference**: <1 second per file |
|
|
|
|
|
## π Monitoring & Visualization |
|
|
|
|
|
### TensorBoard |
|
|
|
|
|
```bash |
|
|
tensorboard --logdir=runs/ |
|
|
``` |
|
|
|
|
|
View real-time training metrics: |
|
|
- Training/validation loss |
|
|
- Training/validation accuracy |
|
|
- Learning rate schedule |
|
|
- Per-class accuracy |
|
|
|
|
|
### Generated Visualizations |
|
|
|
|
|
- **Confusion Matrix**: Shows emotion confusion patterns |
|
|
- **Per-Class Accuracy**: Bar chart of individual emotion performance |
|
|
- **Model Comparison**: Baseline vs Enhanced side-by-side |
|
|
|
|
|
## π Key Learnings |
|
|
|
|
|
### What Worked |
|
|
|
|
|
1. **Enhanced Features**: Delta MFCCs and Chromagram were crucial for distinguishing similar emotions |
|
|
2. **Residual Connections**: Enabled much deeper learning without degradation |
|
|
3. **Channel Attention**: Helped model focus on important frequency bands |
|
|
4. **Data Augmentation**: SpecAugment significantly improved generalization |
|
|
5. **Label Smoothing**: Prevented overconfidence and improved calibration |
|
|
|
|
|
### Challenges Overcome |
|
|
|
|
|
- **Happy vs Sad Confusion**: Solved with chromagram (pitch) and delta MFCCs (dynamics) |
|
|
- **Overfitting**: Addressed with dropout, weight decay, and augmentation |
|
|
- **Training Stability**: Fixed with gradient clipping and batch normalization |
|
|
|
|
|
### Remaining Challenges |
|
|
|
|
|
- **Fearful Emotion**: Still only 41.38% accuracy (confused with other negative emotions) |
|
|
- **Test-Val Gap**: 75% validation vs 66.2% test suggests some overfitting |
|
|
|
|
|
## π Deployment |
|
|
|
|
|
### Hugging Face Model Hub |
|
|
|
|
|
The trained model is available on Hugging Face: |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
model_path = hf_hub_download( |
|
|
repo_id="yourusername/speech-emotion-recognition", |
|
|
filename="best_model_v2.pth" |
|
|
) |
|
|
``` |
|
|
|
|
|
### Streamlit Cloud |
|
|
|
|
|
Live demo: [Coming Soon] |
|
|
|
|
|
### Local Demo |
|
|
|
|
|
```bash |
|
|
streamlit run deployment/app.py |
|
|
``` |
|
|
|
|
|
Features: |
|
|
- Audio file upload |
|
|
- Real-time emotion prediction |
|
|
- Confidence scores visualization |
|
|
- Top-3 predictions |
|
|
|
|
|
## π Performance Metrics |
|
|
|
|
|
### Classification Report (Enhanced Model) |
|
|
|
|
|
``` |
|
|
precision recall f1-score support |
|
|
|
|
|
neutral 0.667 0.714 0.690 14 |
|
|
calm 0.686 0.857 0.762 28 |
|
|
happy 0.531 0.586 0.557 29 |
|
|
sad 0.500 0.517 0.508 29 |
|
|
angry 0.769 0.690 0.727 29 |
|
|
fearful 0.706 0.414 0.522 29 |
|
|
disgust 0.688 0.759 0.721 29 |
|
|
surprised 0.793 0.793 0.793 29 |
|
|
|
|
|
accuracy 0.662 216 |
|
|
macro avg 0.667 0.666 0.660 216 |
|
|
weighted avg 0.667 0.662 0.658 216 |
|
|
``` |
|
|
|
|
|
## π οΈ Development |
|
|
|
|
|
### Running Tests |
|
|
|
|
|
```bash |
|
|
# Test model architecture |
|
|
python models/emotion_cnn_v2.py |
|
|
|
|
|
# Test dataset loading |
|
|
python data/dataset.py |
|
|
|
|
|
# Check environment |
|
|
python quick_start.py |
|
|
``` |
|
|
|
|
|
### Training from Scratch |
|
|
|
|
|
```bash |
|
|
# Complete pipeline |
|
|
./run_pipeline.sh |
|
|
|
|
|
# Or step by step: |
|
|
python data/download_dataset.py |
|
|
python data/prepare_data.py |
|
|
python models/train_v2.py |
|
|
python models/evaluate_v2.py |
|
|
``` |
|
|
|
|
|
## π References |
|
|
|
|
|
1. **RAVDESS Dataset**: Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). PLoS ONE 13(5): e0196391. |
|
|
|
|
|
2. **SpecAugment**: Park et al. (2019) "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition" |
|
|
|
|
|
3. **ResNet**: He et al. (2016) "Deep Residual Learning for Image Recognition" |
|
|
|
|
|
4. **Channel Attention**: Hu et al. (2018) "Squeeze-and-Excitation Networks" |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
Contributions are welcome! Please feel free to submit a Pull Request. |
|
|
|
|
|
1. Fork the repository |
|
|
2. Create your feature branch (`git checkout -b feature/AmazingFeature`) |
|
|
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`) |
|
|
4. Push to the branch (`git push origin feature/AmazingFeature`) |
|
|
5. Open a Pull Request |
|
|
|
|
|
## π License |
|
|
|
|
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- RAVDESS dataset creators for the high-quality emotion database |
|
|
- PyTorch team for the excellent deep learning framework |
|
|
- librosa developers for comprehensive audio processing tools |
|
|
|
|
|
## π§ Contact |
|
|
|
|
|
For questions or feedback, please open an issue on GitHub. |
|
|
|
|
|
--- |
|
|
|
|
|
**Built with β€οΈ using PyTorch, librosa, and Streamlit** |
|
|
|