🎀 Wav2Vec2 Speech Emotion Recognition for English

🧠 Model Overview

πŸ”Ή Model name: dihuzz/wav2vec2-ser-english-finetuned
✨ This model is fine-tuned for recognizing emotions in English speech using the Wav2Vec2 architecture. It can detect the following emotions:

  • 😒 Sadness
  • 😠 Anger
  • 🀒 Disgust
  • 😨 Fear
  • 😊 Happiness
  • 😐 Neutral

πŸ”§ The model was created by fine-tuning r-f/wav2vec-english-speech-emotion-recognition on several prominent Speech Emotion Recognition datasets containing English emotional speech samples.

πŸ“Š Performance Metrics:

  • 🎯 Accuracy: 92.42%
  • πŸ“‰ Loss: 0.219

πŸ‹οΈ Training Procedure

βš™οΈ Training Details

  • Base Model: r-f/wav2vec-english-speech-emotion-recognition
  • πŸ’» Hardware: P100 GPU on Kaggle
  • ⏱ Training Duration: 10 epochs
  • πŸ“š Learning Rate: 5e-4
  • 🧩 Batch Size: 4
  • πŸ“ˆ Gradient Accumulation Steps: 8
  • βš–οΈ Optimizer: AdamW (β₁=0.9, Ξ²β‚‚=0.999)
  • πŸ“‰ Loss Function: Cross Entropy Loss
  • ⏳ Learning Rate Scheduler: None

πŸ“œ Training Results

Epoch Loss Accuracy
1 1.0257 61.20%
2 0.7025 73.88%
3 0.5901 78.25%
4 0.4960 81.56%
5 0.4105 85.04%
6 0.3516 87.70%
7 0.3140 88.87%
8 0.2649 90.45%
9 0.2178 92.42%
10 0.2187 92.29%

πŸ›  How to Use

πŸ”Œ Installation

pip install transformers torch torchaudio

πŸ’» Example Usage

Here is an example of how to use the model to classify emotions in a .wav format English audio file:

import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import torchaudio  

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Load the fine-tuned model and feature extractor
model_name = "dihuzz/wav2vec2-ser-english-finetuned"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name).to(device)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)

# Set the model to evaluation mode
model.eval()

# Load and preprocess the audio file
def predict_emotion(audio_path):
    # Load audio
    waveform, sample_rate = torchaudio.load(audio_path) 
    # Alternatively, librosa can also be used to load the audio file

    # Resample if necessary
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
        waveform = resampler(waveform)

    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)

    # Extract features and move them to device
    inputs = feature_extractor(
        waveform.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt",
        padding=True
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Perform inference (here we are using a batch size of 1 but you can increase batch size for faster inference)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        predicted_class_id = torch.argmax(logits, dim=-1).item()

    # Map predicted class ID to emotion label
    label = model.config.id2label[predicted_class_id]
    return label

# Example usage
audio_file = "/path/to/your/audio.wav"
predicted_emotion = predict_emotion(audio_file)
print(f"Predicted Emotion: {predicted_emotion}")

πŸ“ Example Output

The model returns a string representing the predicted emotion:

Predicted Emotion: <emotion_label>

Limitations

πŸ“Œ Note: This model has several important limitations:

  • 🌐 Language Specificity: English-only support
  • πŸ—£οΈ Dialect Sensitivity: Variable performance across accents
  • 🎧 Audio Quality Needs: Requires clean, clear recordings
  • βš–οΈ Potential Biases: May reflect cultural biases in training data
  • 6️⃣ Limited Categories: Only detects 6 basic emotions
  • 🧠 Context Unaware: Doesn't consider speech content meaning
Downloads last month
12
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support