SER Wav2Vec2 Finetuned on GEMEP (French)
This model is a fine-tuned version of the audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim for Speech Emotion Recognition (SER) in French.
It specifically targets the prediction of emotional dimensions: Valence, Arousal, and Dominance (VAD).
Model Description
The model was fine-tuned using the GEMEP corpus (Geneva Multimodal Expression Portrayal), which contains pseudo-sentences uttered by professional actors. For the purpose of the associated research paper, we collected new annotations for Valence, Arousal, and Dominance through a dedicated user study. The target scores are the mean values of these human annotations.
- Base Model: Wav2Vec2-Large-Robust-12
- Pre-trained weights by: audeering
- Language: French (Pseudo-speech)
- Task: Dimensional Emotion Regression (VAD)
Usage
To use this model, you need the transformers library and the specific processing logic used by the Audeering architecture.
from transformers import AutoModel, AutoFeatureExtractor
import torch
import torch.nn as nn
# Load model and processor
model_name = "rosalied/ser-w2v2-finetuned"
processor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Example: Inference on a 16kHz audio array
# input_values = processor(audio_array, sampling_rate=16000, return_tensors="pt").input_values
# with torch.no_grad():
# outputs = model(input_values)
# # The outputs represent [Arousal, Dominance, Valence]