--- license: mit datasets: - confit/cremad-parquet language: - en metrics: - accuracy - f1 base_model: - facebook/wav2vec2-base-960h library_name: transformers tags: - speech - speech-emotion-recognition - ser - wav2vec2 --- # SER Wav2Vec2 — Speech Emotion Recognition ## Model Summary - Architecture: Wav2Vec2 encoder with a sequence classification head (`Wav2Vec2ForSequenceClassification`). - Sampling rate: 16 kHz mono inputs. - Label set: `angry`, `calm`, `disgust`, `fear`, `happy`, `neutral`, `sad`, `surprise`. - Base: `facebook/wav2vec2-base-960h` fine‑tuned for emotion classification. ## Intended Use - Input: short audio clips of speech. - Output: per‑class probabilities for the above emotions and a dominant label. - Use cases: UX feedback, call‑center analytics, demo apps, research prototypes. ## Out‑of‑Scope - Clinical or safety‑critical decisions. - Non‑speech audio (music, noise) or multilingual speech without adaptation. ## Data and Training - Dataset: CREMA‑D (via `confit/cremad-parquet`). - Preprocessing: resample to 16 kHz, convert to mono, optional padding for very short clips. - Fine‑tuning objective: cross‑entropy classification. ## Configuration Snapshot - Hidden size: 768 - Attention heads: 12 - Hidden layers: 12 - Classifier projection: 256 - Final dropout: 0.0 - Feature extractor norm: group ## Metrics - Reported metrics: accuracy, F1‑score. - Benchmark numbers depend on split and preprocessing; fill in with your evaluation results. ## Usage ### Python (Transformers) ```python import torch, numpy as np, soundfile as sf, librosa from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor model_dir = "path/to/best_model" # local directory or Hub repo id model = Wav2Vec2ForSequenceClassification.from_pretrained(model_dir) fe = AutoFeatureExtractor.from_pretrained(model_dir) model.eval() def load_audio(path, sr=fe.sampling_rate): y, s = sf.read(path, always_2d=False) if isinstance(y, np.ndarray): if y.ndim > 1: y = np.mean(y, axis=1) if s != sr: y = librosa.resample(y.astype(np.float32), orig_sr=s, target_sr=sr) y = y.astype(np.float32) else: y = np.array(y, dtype=np.float32) if y.size < sr // 10: y = np.pad(y, (0, max(0, sr - y.size))) return y audio = load_audio("sample.wav") inputs = fe(audio, sampling_rate=fe.sampling_rate, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits probs = torch.softmax(logits, dim=-1)[0].cpu().numpy() labels = [model.config.id2label[str(i)] if isinstance(list(model.config.id2label.keys())[0], str) else model.config.id2label[i] for i in range(len(probs))] pairs = sorted(zip(labels, probs), key=lambda x: x[1], reverse=True) print(pairs[:3]) ``` ### FastAPI Integration This repository includes a FastAPI server that exposes `POST /predict` and returns sorted label probabilities and the dominant emotion. ## Input and Output Schema - Input: audio file (`wav`, `mp3`, `m4a`, etc.). Internally converted to 16 kHz mono. - Output JSON: ```json { "results": [{ "label": "happy", "score": 0.81 }, ...], "dominant": { "label": "happy", "score": 0.81 } } ``` ## Limitations and Bias - Emotion labels are subjective; datasets may reflect staged emotions. - Performance can degrade with strong noise, reverberation, or accents not present in training. - Not suitable for sensitive decision‑making without rigorous validation. ## Ethical Considerations - Obtain consent where required. - Be transparent about the model’s limitations and intended use. - Avoid deployment in contexts where misclassification can cause harm. ## Citation If you use this model, please cite: - Baevski et al., "wav2vec 2.0: A Framework for Self‑Supervised Learning of Speech Representations", NeurIPS 2020. - CREMA‑D: Cao et al., "CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset", IEEE Transactions on Affective Computing. ## License MIT