# Fine-Tuned Wav2Vec2 for Speech Emotion Recognition # Model Details ``` Model Name: Fine-Tuned Wav2Vec2 for Speech Emotion Recognition Base Model: facebook/wav2vec2-base Dataset: narad/ravdess Quantization: Available as an optional FP16 version for optimized inference Training Device: CUDA (GPU) ``` # Dataset Information ``` Dataset Structure: DatasetDict({ train: Dataset({ features: ['audio', 'text', 'labels', 'speaker_id', 'speaker_gender'], num_rows: 1440 }) }) ``` **Note:** Split manually into 80% train (1,152 examples) and 20% validation (288 examples) during training, as the original dataset provides only a single "train" split. # Available Splits: - **Train:** 1,152 examples (after 80/20 split) - **Validation:** 288 examples (after 80/20 split) - **Test:** Not provided; external audio used for testing # Feature Representation: - **audio:** Raw waveform (48kHz, resampled to 16kHz during preprocessing) - **text:** Spoken sentence (e.g., "Dogs are sitting by the door") - **labels:** Integer labels for emotions (0–7) - **speaker_id:** Actor identifier (e.g., "9") - **speaker_gender:** Gender of speaker (e.g., "male") # Training Details - **Number of Classes:** 8 - **Class Names:** neutral, calm, happy, sad, angry, fearful, disgust, surprised - **Training Process:** Fine-tuned for 10 epochs (initially 3, revised to 10 for better convergence) - **Learning rate:** 3e-5, with warmup steps (100) and weight decay (0.1) -**Batch size:** 4 with gradient accumulation (effective batch size 8) - **Dropout added (attention_dropout=0.1, hidden_dropout=0.1) for regularization** - **Performance Metrics** - **Epochs:** 10 - **Training Loss:** ~0.8 - **Validation Loss:** ~1.2 - **Accuracy:** ~0.65 - **F1 Score:** ~0.63 # Inference Example ```python import torch from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor import librosa def load_model(model_path): model = Wav2Vec2ForSequenceClassification.from_pretrained(model_path) processor = Wav2Vec2Processor.from_pretrained(model_path) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) model.eval() return model, processor, device def predict_emotion(model_path, audio_path): model, processor, device = load_model(model_path) # Load and preprocess audio audio, sr = librosa.load(audio_path, sr=16000) inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True, max_length=160000, truncation=True) input_values = inputs["input_values"].to(device) # Inference with torch.no_grad(): outputs = model(input_values) logits = outputs.logits predicted_label = torch.argmax(logits, dim=1).item() probabilities = torch.softmax(logits, dim=1).squeeze().cpu().numpy() emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised'] return emotions[predicted_label], {emotion: prob for emotion, prob in zip(emotions, probabilities)} # Example usage if __name__ == "__main__": model_path = "path/to/wav2vec2-ravdess-emotion/final_model" # Update with your HF username/repo audio_path = "path/to/audio.wav" emotion, probs = predict_emotion(model_path, audio_path) print(f"Predicted Emotion: {emotion}") print("Probabilities:", probs) ``` # Quantization & Optimization - **Quantization:** Optional FP16 version created using PyTorch’s .half() for faster inference with reduced memory footprint. - **Optimized:** Suitable for deployment on GPU-enabled devices; FP16 version reduces model size by ~50%. # Usage - **Input:** Raw audio files (.wav) resampled to 16kHz - **Output:** Predicted emotion label (one of 8 classes) with confidence probabilities # Limitations - **Generalization:** Trained on acted speech (RAVDESS), may underperform on spontaneous or noisy real-world audio. - **Dataset Size:** Limited to 1,440 samples, potentially insufficient for robust emotion recognition across diverse conditions. - **Accuracy:** Performance on external audio varies; retraining with augmentation or larger datasets may be needed. # Future Improvements - **Data Augmentation:** Incorporate noise, pitch shift, or speed changes to improve robustness. - **Larger Dataset:** Combine with additional SER datasets (e.g., IEMOCAP, CREMA-D) for diversity. - **Model Tuning:** Experiment with freezing lower layers or using a model pre-trained for SER (e.g., facebook/wav2vec2-large-robust).