Whisper-base for Speech Emotion Recognition in Russian on Dusha dataset

Whisper-base encoder with classification head for speech emotion recognition.

Dusha dataset: https://github.com/salute-developers/golos/tree/master/dusha

Multiclass classification into 5 classes:

angry 0
sad 1
neutral 2
positive 3
other 4

Model was fine-tuned on full Dusha-crowd with

augmentations Time Shift, Time Masking and Colored Noise;
WeightedRandomSampler.

Usage

import torch
import torchaudio
from transformers import WhisperForAudioClassification, WhisperFeatureExtractor

# load model and feature extractor
model = WhisperForAudioClassification.from_pretrained("waveletdeboshir/whisper-base-ser-dusha")
model.eval()
feature_extractor = WhisperFeatureExtractor.from_pretrained("waveletdeboshir/whisper-base-ser-dusha")

# load audio and resample if necessary
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)

# compute predictions
features = feature_extractor(wav[0], sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
    preds = model(features)

# get emotion and its probability
probs = torch.nn.functional.softmax(preds.logits, dim=-1)
print(f"Predicted emotion: {model.config.id2label[probs.argmax().item()]} with probability {probs.max().item():.4f}")

Downloads last month: 12

Safetensors

Model size

20.7M params

Tensor type

F32

Model tree for waveletdeboshir/whisper-base-ser-dusha

Base model

openai/whisper-base

Finetuned

(713)

this model

Evaluation results

Test Weighted Accuracy on Sberdevices Dusha (crowd)
self-reported

0.836
Test F1 macro on Sberdevices Dusha (crowd)
self-reported

0.843
Test Recall macro on Sberdevices Dusha (crowd)
self-reported

0.830
Test Precision macro on Sberdevices Dusha (crowd)
self-reported

0.850