| # Fine-Tuned Wav2Vec2 for Speech Emotion Recognition | |
| # Model Details | |
| ``` | |
| Model Name: Fine-Tuned Wav2Vec2 for Speech Emotion Recognition | |
| Base Model: facebook/wav2vec2-base | |
| Dataset: narad/ravdess | |
| Quantization: Available as an optional FP16 version for optimized inference | |
| Training Device: CUDA (GPU) | |
| ``` | |
| # Dataset Information | |
| ``` | |
| Dataset Structure: | |
| DatasetDict({ | |
| train: Dataset({ | |
| features: ['audio', 'text', 'labels', 'speaker_id', 'speaker_gender'], | |
| num_rows: 1440 | |
| }) | |
| }) | |
| ``` | |
| **Note:** Split manually into 80% train (1,152 examples) and 20% validation (288 examples) during training, as the original dataset provides only a single "train" split. | |
| # Available Splits: | |
| - **Train:** 1,152 examples (after 80/20 split) | |
| - **Validation:** 288 examples (after 80/20 split) | |
| - **Test:** Not provided; external audio used for testing | |
| # Feature Representation: | |
| - **audio:** Raw waveform (48kHz, resampled to 16kHz during preprocessing) | |
| - **text:** Spoken sentence (e.g., "Dogs are sitting by the door") | |
| - **labels:** Integer labels for emotions (0–7) | |
| - **speaker_id:** Actor identifier (e.g., "9") | |
| - **speaker_gender:** Gender of speaker (e.g., "male") | |
| # Training Details | |
| - **Number of Classes:** 8 | |
| - **Class Names:** | |
| neutral, calm, happy, sad, angry, fearful, disgust, surprised | |
| - **Training Process:** | |
| Fine-tuned for 10 epochs (initially 3, revised to 10 for better convergence) | |
| - **Learning rate:** 3e-5, with warmup steps (100) and weight decay (0.1) | |
| -**Batch size:** 4 with gradient accumulation (effective batch size 8) | |
| - **Dropout added (attention_dropout=0.1, hidden_dropout=0.1) for regularization** | |
| - **Performance Metrics** | |
| - **Epochs:** 10 | |
| - **Training Loss:** ~0.8 | |
| - **Validation Loss:** ~1.2 | |
| - **Accuracy:** ~0.65 | |
| - **F1 Score:** ~0.63 | |
| # Inference Example | |
| ```python | |
| import torch | |
| from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor | |
| import librosa | |
| def load_model(model_path): | |
| model = Wav2Vec2ForSequenceClassification.from_pretrained(model_path) | |
| processor = Wav2Vec2Processor.from_pretrained(model_path) | |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| model.to(device) | |
| model.eval() | |
| return model, processor, device | |
| def predict_emotion(model_path, audio_path): | |
| model, processor, device = load_model(model_path) | |
| # Load and preprocess audio | |
| audio, sr = librosa.load(audio_path, sr=16000) | |
| inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True, max_length=160000, truncation=True) | |
| input_values = inputs["input_values"].to(device) | |
| # Inference | |
| with torch.no_grad(): | |
| outputs = model(input_values) | |
| logits = outputs.logits | |
| predicted_label = torch.argmax(logits, dim=1).item() | |
| probabilities = torch.softmax(logits, dim=1).squeeze().cpu().numpy() | |
| emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised'] | |
| return emotions[predicted_label], {emotion: prob for emotion, prob in zip(emotions, probabilities)} | |
| # Example usage | |
| if __name__ == "__main__": | |
| model_path = "path/to/wav2vec2-ravdess-emotion/final_model" # Update with your HF username/repo | |
| audio_path = "path/to/audio.wav" | |
| emotion, probs = predict_emotion(model_path, audio_path) | |
| print(f"Predicted Emotion: {emotion}") | |
| print("Probabilities:", probs) | |
| ``` | |
| # Quantization & Optimization | |
| - **Quantization:** Optional FP16 version created using PyTorch’s .half() for faster inference with reduced memory footprint. | |
| - **Optimized:** Suitable for deployment on GPU-enabled devices; FP16 version reduces model size by ~50%. | |
| # Usage | |
| - **Input:** Raw audio files (.wav) resampled to 16kHz | |
| - **Output:** Predicted emotion label (one of 8 classes) with confidence probabilities | |
| # Limitations | |
| - **Generalization:** Trained on acted speech (RAVDESS), may underperform on spontaneous or noisy real-world audio. | |
| - **Dataset Size:** Limited to 1,440 samples, potentially insufficient for robust emotion recognition across diverse conditions. | |
| - **Accuracy:** Performance on external audio varies; retraining with augmentation or larger datasets may be needed. | |
| # Future Improvements | |
| - **Data Augmentation:** Incorporate noise, pitch shift, or speed changes to improve robustness. | |
| - **Larger Dataset:** Combine with additional SER datasets (e.g., IEMOCAP, CREMA-D) for diversity. | |
| - **Model Tuning:** Experiment with freezing lower layers or using a model pre-trained for SER (e.g., facebook/wav2vec2-large-robust). |