--- language: en tags: - audio - emotion-recognition - speech - pytorch - cnn - ravdess license: mit datasets: - ravdess metrics: - accuracy - f1 model-index: - name: speech-emotion-recognition-v2 results: - task: type: audio-classification name: Speech Emotion Recognition dataset: name: RAVDESS type: ravdess metrics: - type: accuracy value: 75.0 name: Validation Accuracy - type: accuracy value: 66.2 name: Test Accuracy --- # Speech Emotion Recognition (Enhanced Model V2) ## Model Description This model is a deep CNN-based classifier for detecting emotions from speech audio. It achieves **75% validation accuracy** and **66.2% test accuracy** on the RAVDESS dataset through enhanced feature extraction, residual connections, and attention mechanisms. ### Model Architecture - **Type**: Convolutional Neural Network with Residual Blocks - **Parameters**: 11,873,480 - **Input**: 196-dimensional audio features × 128 time steps - **Output**: 8 emotion classes **Architecture Details:** - 4 Residual Layers (2 blocks each) - Channel Attention Mechanisms - Dual Global Pooling (Average + Max) - Fully Connected Layers: 1024 → 512 → 256 → 8 ### Features (196 dimensions) - Mel-spectrograms: 128 bands - MFCCs: 13 coefficients - Delta MFCCs: 13 (temporal dynamics) - Delta-Delta MFCCs: 13 (acceleration) - Chromagram: 12 (pitch content) - Spectral Contrast: 7 (texture) - Tonnetz: 6 (harmonic content) - Additional: 4 (ZCR, centroid, rolloff, bandwidth) ## Intended Use ### Primary Use Cases - Emotion detection from speech audio - Affective computing research - Human-computer interaction - Mental health monitoring - Call center analytics ### Out-of-Scope Use - Real-time streaming audio (model requires 3-second clips) - Non-speech audio (music, environmental sounds) - Languages other than English - Clinical diagnosis without professional oversight ## Training Data **RAVDESS** (Ryerson Audio-Visual Database of Emotional Speech and Song) - 1,440 speech files - 8 emotion classes: neutral, calm, happy, sad, angry, fearful, disgust, surprised - 24 professional actors (12 male, 12 female) - Controlled recording environment - Split: 70% train, 15% validation, 15% test ## Performance ### Overall Metrics | Metric | Value | |--------|-------| | Validation Accuracy | 75.00% | | Test Accuracy | 66.20% | | Macro F1-Score | 0.660 | | Weighted F1-Score | 0.658 | ### Per-Class Performance (Test Set) | Emotion | Accuracy | Precision | Recall | F1-Score | |---------|----------|-----------|--------|----------| | Neutral | 71.43% | 0.667 | 0.714 | 0.690 | | Calm | 85.71% | 0.686 | 0.857 | 0.762 | | Happy | 58.62% | 0.531 | 0.586 | 0.557 | | Sad | 51.72% | 0.500 | 0.517 | 0.508 | | Angry | 68.97% | 0.769 | 0.690 | 0.727 | | Fearful | 41.38% | 0.706 | 0.414 | 0.522 | | Disgust | 75.86% | 0.688 | 0.759 | 0.721 | | Surprised | 79.31% | 0.793 | 0.793 | 0.793 | ### Comparison with Baseline | Metric | Baseline | Enhanced | Improvement | |--------|----------|----------|-------------| | Validation Acc | 38.89% | 75.00% | +36.11% | | Test Acc | 39.81% | 66.20% | +26.39% | | Parameters | 536K | 11.8M | 22x | ## Usage ### Installation ```bash pip install torch torchaudio librosa numpy ``` ### Quick Start ```python import torch import librosa import numpy as np from models.emotion_cnn_v2 import ImprovedEmotionCNN from data.prepare_data import extract_features # Load model model = ImprovedEmotionCNN(num_classes=8) checkpoint = torch.load('best_model_v2.pth', map_location='cpu') model.load_state_dict(checkpoint['model_state_dict']) model.eval() # Load and process audio features = extract_features('path/to/audio.wav') features_tensor = torch.FloatTensor(features).unsqueeze(0).unsqueeze(0) # Predict with torch.no_grad(): output = model(features_tensor) probs = torch.softmax(output, dim=1) predicted_idx = output.argmax(1).item() emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised'] print(f"Predicted: {emotions[predicted_idx]} ({probs[0][predicted_idx]:.2%})") ``` ## Limitations ### Known Issues 1. **Fearful Emotion**: Lower accuracy (41.38%) - often confused with other negative emotions 2. **Test-Validation Gap**: 75% validation vs 66.2% test suggests some overfitting 3. **Dataset Bias**: Trained on professional actors in controlled environment 4. **Language**: English only 5. **Audio Quality**: Requires clear speech without background noise ### Ethical Considerations - **Privacy**: Emotion detection from voice raises privacy concerns - **Bias**: May not generalize well across different demographics, accents, or cultures - **Misuse**: Should not be used for surveillance or manipulation - **Context**: Emotions are complex and context-dependent; model provides probabilities, not certainties ## Training Procedure ### Hyperparameters ```python { 'batch_size': 24, 'learning_rate': 0.001, 'epochs': 150, 'optimizer': 'AdamW', 'weight_decay': 1e-4, 'loss': 'CrossEntropyLoss + Label Smoothing (0.1)', 'lr_scheduler': 'ReduceLROnPlateau (patience=8, factor=0.5)', 'early_stopping': 'patience=20', 'mixed_precision': 'FP16', 'gradient_clipping': 'max_norm=1.0' } ``` ### Data Augmentation - SpecAugment (time and frequency masking) - Gaussian noise injection - Time shifting - Augmentation probability: 60% ### Hardware - GPU: NVIDIA RTX 5060 Ti - Training Time: ~2.5 hours (150 epochs) - CUDA: 13.0 - PyTorch: 2.0+ ## Citation If you use this model, please cite: ```bibtex @misc{speech-emotion-recognition-v2, title={Speech Emotion Recognition with Enhanced CNN}, author={Your Name}, year={2024}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/yourusername/speech-emotion-recognition}} } ``` ### RAVDESS Dataset Citation ```bibtex @article{livingstone2018ravdess, title={The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)}, author={Livingstone, Steven R and Russo, Frank A}, journal={PLoS ONE}, volume={13}, number={5}, pages={e0196391}, year={2018}, publisher={Public Library of Science} } ``` ## License MIT License - See LICENSE file for details ## Contact For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/speech-emotion-recognition). ## Acknowledgments - RAVDESS dataset creators - PyTorch team - librosa developers - Hugging Face community