saadmannan's picture
Upload MODEL_CARD.md with huggingface_hub
792d931 verified
metadata
language: en
tags:
  - audio
  - emotion-recognition
  - speech
  - pytorch
  - cnn
  - ravdess
license: mit
datasets:
  - ravdess
metrics:
  - accuracy
  - f1
model-index:
  - name: speech-emotion-recognition-v2
    results:
      - task:
          type: audio-classification
          name: Speech Emotion Recognition
        dataset:
          name: RAVDESS
          type: ravdess
        metrics:
          - type: accuracy
            value: 75
            name: Validation Accuracy
          - type: accuracy
            value: 66.2
            name: Test Accuracy

Speech Emotion Recognition (Enhanced Model V2)

Model Description

This model is a deep CNN-based classifier for detecting emotions from speech audio. It achieves 75% validation accuracy and 66.2% test accuracy on the RAVDESS dataset through enhanced feature extraction, residual connections, and attention mechanisms.

Model Architecture

  • Type: Convolutional Neural Network with Residual Blocks
  • Parameters: 11,873,480
  • Input: 196-dimensional audio features × 128 time steps
  • Output: 8 emotion classes

Architecture Details:

  • 4 Residual Layers (2 blocks each)
  • Channel Attention Mechanisms
  • Dual Global Pooling (Average + Max)
  • Fully Connected Layers: 1024 → 512 → 256 → 8

Features (196 dimensions)

  • Mel-spectrograms: 128 bands
  • MFCCs: 13 coefficients
  • Delta MFCCs: 13 (temporal dynamics)
  • Delta-Delta MFCCs: 13 (acceleration)
  • Chromagram: 12 (pitch content)
  • Spectral Contrast: 7 (texture)
  • Tonnetz: 6 (harmonic content)
  • Additional: 4 (ZCR, centroid, rolloff, bandwidth)

Intended Use

Primary Use Cases

  • Emotion detection from speech audio
  • Affective computing research
  • Human-computer interaction
  • Mental health monitoring
  • Call center analytics

Out-of-Scope Use

  • Real-time streaming audio (model requires 3-second clips)
  • Non-speech audio (music, environmental sounds)
  • Languages other than English
  • Clinical diagnosis without professional oversight

Training Data

RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)

  • 1,440 speech files
  • 8 emotion classes: neutral, calm, happy, sad, angry, fearful, disgust, surprised
  • 24 professional actors (12 male, 12 female)
  • Controlled recording environment
  • Split: 70% train, 15% validation, 15% test

Performance

Overall Metrics

Metric Value
Validation Accuracy 75.00%
Test Accuracy 66.20%
Macro F1-Score 0.660
Weighted F1-Score 0.658

Per-Class Performance (Test Set)

Emotion Accuracy Precision Recall F1-Score
Neutral 71.43% 0.667 0.714 0.690
Calm 85.71% 0.686 0.857 0.762
Happy 58.62% 0.531 0.586 0.557
Sad 51.72% 0.500 0.517 0.508
Angry 68.97% 0.769 0.690 0.727
Fearful 41.38% 0.706 0.414 0.522
Disgust 75.86% 0.688 0.759 0.721
Surprised 79.31% 0.793 0.793 0.793

Comparison with Baseline

Metric Baseline Enhanced Improvement
Validation Acc 38.89% 75.00% +36.11%
Test Acc 39.81% 66.20% +26.39%
Parameters 536K 11.8M 22x

Usage

Installation

pip install torch torchaudio librosa numpy

Quick Start

import torch
import librosa
import numpy as np
from models.emotion_cnn_v2 import ImprovedEmotionCNN
from data.prepare_data import extract_features

# Load model
model = ImprovedEmotionCNN(num_classes=8)
checkpoint = torch.load('best_model_v2.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load and process audio
features = extract_features('path/to/audio.wav')
features_tensor = torch.FloatTensor(features).unsqueeze(0).unsqueeze(0)

# Predict
with torch.no_grad():
    output = model(features_tensor)
    probs = torch.softmax(output, dim=1)
    predicted_idx = output.argmax(1).item()

emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
print(f"Predicted: {emotions[predicted_idx]} ({probs[0][predicted_idx]:.2%})")

Limitations

Known Issues

  1. Fearful Emotion: Lower accuracy (41.38%) - often confused with other negative emotions
  2. Test-Validation Gap: 75% validation vs 66.2% test suggests some overfitting
  3. Dataset Bias: Trained on professional actors in controlled environment
  4. Language: English only
  5. Audio Quality: Requires clear speech without background noise

Ethical Considerations

  • Privacy: Emotion detection from voice raises privacy concerns
  • Bias: May not generalize well across different demographics, accents, or cultures
  • Misuse: Should not be used for surveillance or manipulation
  • Context: Emotions are complex and context-dependent; model provides probabilities, not certainties

Training Procedure

Hyperparameters

{
    'batch_size': 24,
    'learning_rate': 0.001,
    'epochs': 150,
    'optimizer': 'AdamW',
    'weight_decay': 1e-4,
    'loss': 'CrossEntropyLoss + Label Smoothing (0.1)',
    'lr_scheduler': 'ReduceLROnPlateau (patience=8, factor=0.5)',
    'early_stopping': 'patience=20',
    'mixed_precision': 'FP16',
    'gradient_clipping': 'max_norm=1.0'
}

Data Augmentation

  • SpecAugment (time and frequency masking)
  • Gaussian noise injection
  • Time shifting
  • Augmentation probability: 60%

Hardware

  • GPU: NVIDIA RTX 5060 Ti
  • Training Time: ~2.5 hours (150 epochs)
  • CUDA: 13.0
  • PyTorch: 2.0+

Citation

If you use this model, please cite:

@misc{speech-emotion-recognition-v2,
  title={Speech Emotion Recognition with Enhanced CNN},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/yourusername/speech-emotion-recognition}}
}

RAVDESS Dataset Citation

@article{livingstone2018ravdess,
  title={The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)},
  author={Livingstone, Steven R and Russo, Frank A},
  journal={PLoS ONE},
  volume={13},
  number={5},
  pages={e0196391},
  year={2018},
  publisher={Public Library of Science}
}

License

MIT License - See LICENSE file for details

Contact

For questions or issues, please open an issue on the GitHub repository.

Acknowledgments

  • RAVDESS dataset creators
  • PyTorch team
  • librosa developers
  • Hugging Face community