speech-emotion-recognition / MODEL_CARD.md

Upload MODEL_CARD.md with huggingface_hub

792d931 verified 2 months ago

6.51 kB

	---
	language: en
	tags:
	- audio
	- emotion-recognition
	- speech
	- pytorch
	- cnn
	- ravdess
	license: mit
	datasets:
	- ravdess
	metrics:
	- accuracy
	- f1
	model-index:
	- name: speech-emotion-recognition-v2
	results:
	- task:
	type: audio-classification
	name: Speech Emotion Recognition
	dataset:
	name: RAVDESS
	type: ravdess
	metrics:
	- type: accuracy
	value: 75.0
	name: Validation Accuracy
	- type: accuracy
	value: 66.2
	name: Test Accuracy
	---

	# Speech Emotion Recognition (Enhanced Model V2)

	## Model Description

	This model is a deep CNN-based classifier for detecting emotions from speech audio. It achieves 75% validation accuracy and 66.2% test accuracy on the RAVDESS dataset through enhanced feature extraction, residual connections, and attention mechanisms.

	### Model Architecture

	- Type: Convolutional Neural Network with Residual Blocks
	- Parameters: 11,873,480
	- Input: 196-dimensional audio features × 128 time steps
	- Output: 8 emotion classes

	Architecture Details:
	- 4 Residual Layers (2 blocks each)
	- Channel Attention Mechanisms
	- Dual Global Pooling (Average + Max)
	- Fully Connected Layers: 1024 → 512 → 256 → 8

	### Features (196 dimensions)

	- Mel-spectrograms: 128 bands
	- MFCCs: 13 coefficients
	- Delta MFCCs: 13 (temporal dynamics)
	- Delta-Delta MFCCs: 13 (acceleration)
	- Chromagram: 12 (pitch content)
	- Spectral Contrast: 7 (texture)
	- Tonnetz: 6 (harmonic content)
	- Additional: 4 (ZCR, centroid, rolloff, bandwidth)

	## Intended Use

	### Primary Use Cases

	- Emotion detection from speech audio
	- Affective computing research
	- Human-computer interaction
	- Mental health monitoring
	- Call center analytics

	### Out-of-Scope Use

	- Real-time streaming audio (model requires 3-second clips)
	- Non-speech audio (music, environmental sounds)
	- Languages other than English
	- Clinical diagnosis without professional oversight

	## Training Data

	RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)
	- 1,440 speech files
	- 8 emotion classes: neutral, calm, happy, sad, angry, fearful, disgust, surprised
	- 24 professional actors (12 male, 12 female)
	- Controlled recording environment
	- Split: 70% train, 15% validation, 15% test

	## Performance

	### Overall Metrics

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Validation Accuracy \| 75.00% \|
	\| Test Accuracy \| 66.20% \|
	\| Macro F1-Score \| 0.660 \|
	\| Weighted F1-Score \| 0.658 \|

	### Per-Class Performance (Test Set)

	\| Emotion \| Accuracy \| Precision \| Recall \| F1-Score \|
	\|---------\|----------\|-----------\|--------\|----------\|
	\| Neutral \| 71.43% \| 0.667 \| 0.714 \| 0.690 \|
	\| Calm \| 85.71% \| 0.686 \| 0.857 \| 0.762 \|
	\| Happy \| 58.62% \| 0.531 \| 0.586 \| 0.557 \|
	\| Sad \| 51.72% \| 0.500 \| 0.517 \| 0.508 \|
	\| Angry \| 68.97% \| 0.769 \| 0.690 \| 0.727 \|
	\| Fearful \| 41.38% \| 0.706 \| 0.414 \| 0.522 \|
	\| Disgust \| 75.86% \| 0.688 \| 0.759 \| 0.721 \|
	\| Surprised \| 79.31% \| 0.793 \| 0.793 \| 0.793 \|

	### Comparison with Baseline

	\| Metric \| Baseline \| Enhanced \| Improvement \|
	\|--------\|----------\|----------\|-------------\|
	\| Validation Acc \| 38.89% \| 75.00% \| +36.11% \|
	\| Test Acc \| 39.81% \| 66.20% \| +26.39% \|
	\| Parameters \| 536K \| 11.8M \| 22x \|

	## Usage

	### Installation

	```bash
	pip install torch torchaudio librosa numpy
	```

	### Quick Start

	```python
	import torch
	import librosa
	import numpy as np
	from models.emotion_cnn_v2 import ImprovedEmotionCNN
	from data.prepare_data import extract_features

	# Load model
	model = ImprovedEmotionCNN(num_classes=8)
	checkpoint = torch.load('best_model_v2.pth', map_location='cpu')
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()

	# Load and process audio
	features = extract_features('path/to/audio.wav')
	features_tensor = torch.FloatTensor(features).unsqueeze(0).unsqueeze(0)

	# Predict
	with torch.no_grad():
	output = model(features_tensor)
	probs = torch.softmax(output, dim=1)
	predicted_idx = output.argmax(1).item()

	emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
	print(f"Predicted: {emotions[predicted_idx]} ({probs[0][predicted_idx]:.2%})")
	```

	## Limitations

	### Known Issues

	1. Fearful Emotion: Lower accuracy (41.38%) - often confused with other negative emotions
	2. Test-Validation Gap: 75% validation vs 66.2% test suggests some overfitting
	3. Dataset Bias: Trained on professional actors in controlled environment
	4. Language: English only
	5. Audio Quality: Requires clear speech without background noise

	### Ethical Considerations

	- Privacy: Emotion detection from voice raises privacy concerns
	- Bias: May not generalize well across different demographics, accents, or cultures
	- Misuse: Should not be used for surveillance or manipulation
	- Context: Emotions are complex and context-dependent; model provides probabilities, not certainties

	## Training Procedure

	### Hyperparameters

	```python
	{
	'batch_size': 24,
	'learning_rate': 0.001,
	'epochs': 150,
	'optimizer': 'AdamW',
	'weight_decay': 1e-4,
	'loss': 'CrossEntropyLoss + Label Smoothing (0.1)',
	'lr_scheduler': 'ReduceLROnPlateau (patience=8, factor=0.5)',
	'early_stopping': 'patience=20',
	'mixed_precision': 'FP16',
	'gradient_clipping': 'max_norm=1.0'
	}
	```

	### Data Augmentation

	- SpecAugment (time and frequency masking)
	- Gaussian noise injection
	- Time shifting
	- Augmentation probability: 60%

	### Hardware

	- GPU: NVIDIA RTX 5060 Ti
	- Training Time: ~2.5 hours (150 epochs)
	- CUDA: 13.0
	- PyTorch: 2.0+

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{speech-emotion-recognition-v2,
	title={Speech Emotion Recognition with Enhanced CNN},
	author={Your Name},
	year={2024},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/yourusername/speech-emotion-recognition}}
	}
	```

	### RAVDESS Dataset Citation

	```bibtex
	@article{livingstone2018ravdess,
	title={The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)},
	author={Livingstone, Steven R and Russo, Frank A},
	journal={PLoS ONE},
	volume={13},
	number={5},
	pages={e0196391},
	year={2018},
	publisher={Public Library of Science}
	}
	```

	## License

	MIT License - See LICENSE file for details

	## Contact

	For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/speech-emotion-recognition).

	## Acknowledgments

	- RAVDESS dataset creators
	- PyTorch team
	- librosa developers
	- Hugging Face community