speech-emotion-recognition / MODEL_CARD.md

saadmannan

Upload MODEL_CARD.md with huggingface_hub

792d931 verified 2 months ago

preview code

raw

history blame contribute delete

6.51 kB

metadata

language: en
tags:
  - audio
  - emotion-recognition
  - speech
  - pytorch
  - cnn
  - ravdess
license: mit
datasets:
  - ravdess
metrics:
  - accuracy
  - f1
model-index:
  - name: speech-emotion-recognition-v2
    results:
      - task:
          type: audio-classification
          name: Speech Emotion Recognition
        dataset:
          name: RAVDESS
          type: ravdess
        metrics:
          - type: accuracy
            value: 75
            name: Validation Accuracy
          - type: accuracy
            value: 66.2
            name: Test Accuracy

Speech Emotion Recognition (Enhanced Model V2)

Model Description

This model is a deep CNN-based classifier for detecting emotions from speech audio. It achieves 75% validation accuracy and 66.2% test accuracy on the RAVDESS dataset through enhanced feature extraction, residual connections, and attention mechanisms.

Model Architecture

Type: Convolutional Neural Network with Residual Blocks
Parameters: 11,873,480
Input: 196-dimensional audio features × 128 time steps
Output: 8 emotion classes

Architecture Details:

4 Residual Layers (2 blocks each)
Channel Attention Mechanisms
Dual Global Pooling (Average + Max)
Fully Connected Layers: 1024 → 512 → 256 → 8

Features (196 dimensions)

Mel-spectrograms: 128 bands
MFCCs: 13 coefficients
Delta MFCCs: 13 (temporal dynamics)
Delta-Delta MFCCs: 13 (acceleration)
Chromagram: 12 (pitch content)
Spectral Contrast: 7 (texture)
Tonnetz: 6 (harmonic content)
Additional: 4 (ZCR, centroid, rolloff, bandwidth)

Intended Use

Primary Use Cases

Emotion detection from speech audio
Affective computing research
Human-computer interaction
Mental health monitoring
Call center analytics

Out-of-Scope Use

Real-time streaming audio (model requires 3-second clips)
Non-speech audio (music, environmental sounds)
Languages other than English
Clinical diagnosis without professional oversight

Training Data

RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)

1,440 speech files
8 emotion classes: neutral, calm, happy, sad, angry, fearful, disgust, surprised
24 professional actors (12 male, 12 female)
Controlled recording environment
Split: 70% train, 15% validation, 15% test

Performance

Overall Metrics

Metric	Value
Validation Accuracy	75.00%
Test Accuracy	66.20%
Macro F1-Score	0.660
Weighted F1-Score	0.658

Per-Class Performance (Test Set)

Emotion	Accuracy	Precision	Recall	F1-Score
Neutral	71.43%	0.667	0.714	0.690
Calm	85.71%	0.686	0.857	0.762
Happy	58.62%	0.531	0.586	0.557
Sad	51.72%	0.500	0.517	0.508
Angry	68.97%	0.769	0.690	0.727
Fearful	41.38%	0.706	0.414	0.522
Disgust	75.86%	0.688	0.759	0.721
Surprised	79.31%	0.793	0.793	0.793

Comparison with Baseline

Metric	Baseline	Enhanced	Improvement
Validation Acc	38.89%	75.00%	+36.11%
Test Acc	39.81%	66.20%	+26.39%
Parameters	536K	11.8M	22x

Usage

Installation

pip install torch torchaudio librosa numpy

Quick Start

import torch
import librosa
import numpy as np
from models.emotion_cnn_v2 import ImprovedEmotionCNN
from data.prepare_data import extract_features

# Load model
model = ImprovedEmotionCNN(num_classes=8)
checkpoint = torch.load('best_model_v2.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load and process audio
features = extract_features('path/to/audio.wav')
features_tensor = torch.FloatTensor(features).unsqueeze(0).unsqueeze(0)

# Predict
with torch.no_grad():
    output = model(features_tensor)
    probs = torch.softmax(output, dim=1)
    predicted_idx = output.argmax(1).item()

emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
print(f"Predicted: {emotions[predicted_idx]} ({probs[0][predicted_idx]:.2%})")

Limitations

Known Issues

Fearful Emotion: Lower accuracy (41.38%) - often confused with other negative emotions
Test-Validation Gap: 75% validation vs 66.2% test suggests some overfitting
Dataset Bias: Trained on professional actors in controlled environment
Language: English only
Audio Quality: Requires clear speech without background noise

Ethical Considerations

Privacy: Emotion detection from voice raises privacy concerns
Bias: May not generalize well across different demographics, accents, or cultures
Misuse: Should not be used for surveillance or manipulation
Context: Emotions are complex and context-dependent; model provides probabilities, not certainties

Training Procedure

Hyperparameters

{
    'batch_size': 24,
    'learning_rate': 0.001,
    'epochs': 150,
    'optimizer': 'AdamW',
    'weight_decay': 1e-4,
    'loss': 'CrossEntropyLoss + Label Smoothing (0.1)',
    'lr_scheduler': 'ReduceLROnPlateau (patience=8, factor=0.5)',
    'early_stopping': 'patience=20',
    'mixed_precision': 'FP16',
    'gradient_clipping': 'max_norm=1.0'
}

Data Augmentation

SpecAugment (time and frequency masking)
Gaussian noise injection
Time shifting
Augmentation probability: 60%

Hardware

GPU: NVIDIA RTX 5060 Ti
Training Time: ~2.5 hours (150 epochs)
CUDA: 13.0
PyTorch: 2.0+

Citation

If you use this model, please cite:

@misc{speech-emotion-recognition-v2,
  title={Speech Emotion Recognition with Enhanced CNN},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/yourusername/speech-emotion-recognition}}
}

RAVDESS Dataset Citation

@article{livingstone2018ravdess,
  title={The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)},
  author={Livingstone, Steven R and Russo, Frank A},
  journal={PLoS ONE},
  volume={13},
  number={5},
  pages={e0196391},
  year={2018},
  publisher={Public Library of Science}
}

License

MIT License - See LICENSE file for details

Contact

For questions or issues, please open an issue on the GitHub repository.

Acknowledgments

RAVDESS dataset creators
PyTorch team
librosa developers
Hugging Face community