SER Wav2Vec2 — Speech Emotion Recognition

Model Summary

Architecture: Wav2Vec2 encoder with a sequence classification head (Wav2Vec2ForSequenceClassification).
Sampling rate: 16 kHz mono inputs.
Label set: angry, calm, disgust, fear, happy, neutral, sad, surprise.
Base: facebook/wav2vec2-base-960h fine‑tuned for emotion classification.

Intended Use

Input: short audio clips of speech.
Output: per‑class probabilities for the above emotions and a dominant label.
Use cases: UX feedback, call‑center analytics, demo apps, research prototypes.

Out‑of‑Scope

Clinical or safety‑critical decisions.
Non‑speech audio (music, noise) or multilingual speech without adaptation.

Data and Training

Dataset: CREMA‑D (via confit/cremad-parquet).
Preprocessing: resample to 16 kHz, convert to mono, optional padding for very short clips.
Fine‑tuning objective: cross‑entropy classification.

Configuration Snapshot

Hidden size: 768
Attention heads: 12
Hidden layers: 12
Classifier projection: 256
Final dropout: 0.0
Feature extractor norm: group

Metrics

Reported metrics: accuracy, F1‑score.
Benchmark numbers depend on split and preprocessing; fill in with your evaluation results.

Usage

Python (Transformers)

import torch, numpy as np, soundfile as sf, librosa
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor

model_dir = "path/to/best_model"  # local directory or Hub repo id
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_dir)
fe = AutoFeatureExtractor.from_pretrained(model_dir)
model.eval()

def load_audio(path, sr=fe.sampling_rate):
    y, s = sf.read(path, always_2d=False)
    if isinstance(y, np.ndarray):
        if y.ndim > 1:
            y = np.mean(y, axis=1)
        if s != sr:
            y = librosa.resample(y.astype(np.float32), orig_sr=s, target_sr=sr)
        y = y.astype(np.float32)
    else:
        y = np.array(y, dtype=np.float32)
    if y.size < sr // 10:
        y = np.pad(y, (0, max(0, sr - y.size)))
    return y

audio = load_audio("sample.wav")
inputs = fe(audio, sampling_rate=fe.sampling_rate, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0].cpu().numpy()
labels = [model.config.id2label[str(i)] if isinstance(list(model.config.id2label.keys())[0], str) else model.config.id2label[i] for i in range(len(probs))]
pairs = sorted(zip(labels, probs), key=lambda x: x[1], reverse=True)
print(pairs[:3])

FastAPI Integration

This repository includes a FastAPI server that exposes POST /predict and returns sorted label probabilities and the dominant emotion.

Input and Output Schema

Input: audio file (wav, mp3, m4a, etc.). Internally converted to 16 kHz mono.
Output JSON:

{
  "results": [{ "label": "happy", "score": 0.81 }, ...],
  "dominant": { "label": "happy", "score": 0.81 }
}

Limitations and Bias

Emotion labels are subjective; datasets may reflect staged emotions.
Performance can degrade with strong noise, reverberation, or accents not present in training.
Not suitable for sensitive decision‑making without rigorous validation.

Ethical Considerations

Obtain consent where required.
Be transparent about the model’s limitations and intended use.
Avoid deployment in contexts where misclassification can cause harm.

Citation

If you use this model, please cite:

Baevski et al., "wav2vec 2.0: A Framework for Self‑Supervised Learning of Speech Representations", NeurIPS 2020.
CREMA‑D: Cao et al., "CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset", IEEE Transactions on Affective Computing.

License

MIT

Downloads last month: 5

Safetensors

Model size

94.6M params

Tensor type

F32

Model tree for marshal-yash/SER_wav2vec

Base model

facebook/wav2vec2-base-960h

Finetuned

(181)

this model

marshal-yash
/

SER_wav2vec