Create README.md

0f78217 verified 11 months ago

4.51 kB

	# Fine-Tuned Wav2Vec2 for Speech Emotion Recognition

	# Model Details
	```
	Model Name: Fine-Tuned Wav2Vec2 for Speech Emotion Recognition
	Base Model: facebook/wav2vec2-base
	Dataset: narad/ravdess
	Quantization: Available as an optional FP16 version for optimized inference
	Training Device: CUDA (GPU)
	```

	# Dataset Information
	```
	Dataset Structure:

	DatasetDict({
	train: Dataset({
	features: ['audio', 'text', 'labels', 'speaker_id', 'speaker_gender'],
	num_rows: 1440
	})
	})
	```
	Note: Split manually into 80% train (1,152 examples) and 20% validation (288 examples) during training, as the original dataset provides only a single "train" split.
	# Available Splits:

	- Train: 1,152 examples (after 80/20 split)
	- Validation: 288 examples (after 80/20 split)
	- Test: Not provided; external audio used for testing
	# Feature Representation:
	- audio: Raw waveform (48kHz, resampled to 16kHz during preprocessing)
	- text: Spoken sentence (e.g., "Dogs are sitting by the door")
	- labels: Integer labels for emotions (0–7)
	- speaker_id: Actor identifier (e.g., "9")
	- speaker_gender: Gender of speaker (e.g., "male")
	# Training Details
	- Number of Classes: 8
	- Class Names:
	neutral, calm, happy, sad, angry, fearful, disgust, surprised
	- Training Process:
	Fine-tuned for 10 epochs (initially 3, revised to 10 for better convergence)
	- Learning rate: 3e-5, with warmup steps (100) and weight decay (0.1)
	-Batch size: 4 with gradient accumulation (effective batch size 8)
	- Dropout added (attention_dropout=0.1, hidden_dropout=0.1) for regularization
	- Performance Metrics

	- Epochs: 10
	- Training Loss: ~0.8
	- Validation Loss: ~1.2
	- Accuracy: ~0.65
	- F1 Score: ~0.63


	# Inference Example
	```python
	import torch
	from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
	import librosa

	def load_model(model_path):
	model = Wav2Vec2ForSequenceClassification.from_pretrained(model_path)
	processor = Wav2Vec2Processor.from_pretrained(model_path)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	model.eval()
	return model, processor, device

	def predict_emotion(model_path, audio_path):
	model, processor, device = load_model(model_path)

	# Load and preprocess audio
	audio, sr = librosa.load(audio_path, sr=16000)
	inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True, max_length=160000, truncation=True)
	input_values = inputs["input_values"].to(device)

	# Inference
	with torch.no_grad():
	outputs = model(input_values)
	logits = outputs.logits
	predicted_label = torch.argmax(logits, dim=1).item()
	probabilities = torch.softmax(logits, dim=1).squeeze().cpu().numpy()

	emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
	return emotions[predicted_label], {emotion: prob for emotion, prob in zip(emotions, probabilities)}

	# Example usage
	if __name__ == "__main__":
	model_path = "path/to/wav2vec2-ravdess-emotion/final_model" # Update with your HF username/repo
	audio_path = "path/to/audio.wav"
	emotion, probs = predict_emotion(model_path, audio_path)
	print(f"Predicted Emotion: {emotion}")
	print("Probabilities:", probs)
	```
	# Quantization & Optimization
	- Quantization: Optional FP16 version created using PyTorch’s .half() for faster inference with reduced memory footprint.
	- Optimized: Suitable for deployment on GPU-enabled devices; FP16 version reduces model size by ~50%.
	# Usage
	- Input: Raw audio files (.wav) resampled to 16kHz
	- Output: Predicted emotion label (one of 8 classes) with confidence probabilities
	# Limitations
	- Generalization: Trained on acted speech (RAVDESS), may underperform on spontaneous or noisy real-world audio.
	- Dataset Size: Limited to 1,440 samples, potentially insufficient for robust emotion recognition across diverse conditions.
	- Accuracy: Performance on external audio varies; retraining with augmentation or larger datasets may be needed.
	# Future Improvements
	- Data Augmentation: Incorporate noise, pitch shift, or speed changes to improve robustness.
	- Larger Dataset: Combine with additional SER datasets (e.g., IEMOCAP, CREMA-D) for diversity.
	- Model Tuning: Experiment with freezing lower layers or using a model pre-trained for SER (e.g., facebook/wav2vec2-large-robust).