marshal-yash commited on
Commit
a8d2c02
·
verified ·
1 Parent(s): 8f27f50

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -0
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - confit/cremad-parquet
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ - f1
10
+ base_model:
11
+ - facebook/wav2vec2-base-960h
12
+ library_name: transformers
13
+ tags:
14
+ - speech
15
+ - speech-emotion-recognition
16
+ - ser
17
+ - wav2vec2
18
+ ---
19
+
20
+ # SER Wav2Vec2 — Speech Emotion Recognition
21
+
22
+ ## Model Summary
23
+ - Architecture: Wav2Vec2 encoder with a sequence classification head (`Wav2Vec2ForSequenceClassification`).
24
+ - Sampling rate: 16 kHz mono inputs.
25
+ - Label set: `angry`, `calm`, `disgust`, `fear`, `happy`, `neutral`, `sad`, `surprise`.
26
+ - Base: `facebook/wav2vec2-base-960h` fine‑tuned for emotion classification.
27
+
28
+ ## Intended Use
29
+ - Input: short audio clips of speech.
30
+ - Output: per‑class probabilities for the above emotions and a dominant label.
31
+ - Use cases: UX feedback, call‑center analytics, demo apps, research prototypes.
32
+
33
+ ## Out‑of‑Scope
34
+ - Clinical or safety‑critical decisions.
35
+ - Non‑speech audio (music, noise) or multilingual speech without adaptation.
36
+
37
+ ## Data and Training
38
+ - Dataset: CREMA‑D (via `confit/cremad-parquet`).
39
+ - Preprocessing: resample to 16 kHz, convert to mono, optional padding for very short clips.
40
+ - Fine‑tuning objective: cross‑entropy classification.
41
+
42
+ ## Configuration Snapshot
43
+ - Hidden size: 768
44
+ - Attention heads: 12
45
+ - Hidden layers: 12
46
+ - Classifier projection: 256
47
+ - Final dropout: 0.0
48
+ - Feature extractor norm: group
49
+
50
+ ## Metrics
51
+ - Reported metrics: accuracy, F1‑score.
52
+ - Benchmark numbers depend on split and preprocessing; fill in with your evaluation results.
53
+
54
+ ## Usage
55
+
56
+ ### Python (Transformers)
57
+ ```python
58
+ import torch, numpy as np, soundfile as sf, librosa
59
+ from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
60
+
61
+ model_dir = "path/to/best_model" # local directory or Hub repo id
62
+ model = Wav2Vec2ForSequenceClassification.from_pretrained(model_dir)
63
+ fe = AutoFeatureExtractor.from_pretrained(model_dir)
64
+ model.eval()
65
+
66
+ def load_audio(path, sr=fe.sampling_rate):
67
+ y, s = sf.read(path, always_2d=False)
68
+ if isinstance(y, np.ndarray):
69
+ if y.ndim > 1:
70
+ y = np.mean(y, axis=1)
71
+ if s != sr:
72
+ y = librosa.resample(y.astype(np.float32), orig_sr=s, target_sr=sr)
73
+ y = y.astype(np.float32)
74
+ else:
75
+ y = np.array(y, dtype=np.float32)
76
+ if y.size < sr // 10:
77
+ y = np.pad(y, (0, max(0, sr - y.size)))
78
+ return y
79
+
80
+ audio = load_audio("sample.wav")
81
+ inputs = fe(audio, sampling_rate=fe.sampling_rate, return_tensors="pt")
82
+ with torch.no_grad():
83
+ logits = model(**inputs).logits
84
+ probs = torch.softmax(logits, dim=-1)[0].cpu().numpy()
85
+ labels = [model.config.id2label[str(i)] if isinstance(list(model.config.id2label.keys())[0], str) else model.config.id2label[i] for i in range(len(probs))]
86
+ pairs = sorted(zip(labels, probs), key=lambda x: x[1], reverse=True)
87
+ print(pairs[:3])
88
+ ```
89
+
90
+ ### FastAPI Integration
91
+ This repository includes a FastAPI server that exposes `POST /predict` and returns sorted label probabilities and the dominant emotion.
92
+
93
+ ## Input and Output Schema
94
+ - Input: audio file (`wav`, `mp3`, `m4a`, etc.). Internally converted to 16 kHz mono.
95
+ - Output JSON:
96
+ ```json
97
+ {
98
+ "results": [{ "label": "happy", "score": 0.81 }, ...],
99
+ "dominant": { "label": "happy", "score": 0.81 }
100
+ }
101
+ ```
102
+
103
+ ## Limitations and Bias
104
+ - Emotion labels are subjective; datasets may reflect staged emotions.
105
+ - Performance can degrade with strong noise, reverberation, or accents not present in training.
106
+ - Not suitable for sensitive decision‑making without rigorous validation.
107
+
108
+ ## Ethical Considerations
109
+ - Obtain consent where required.
110
+ - Be transparent about the model’s limitations and intended use.
111
+ - Avoid deployment in contexts where misclassification can cause harm.
112
+
113
+ ## Citation
114
+ If you use this model, please cite:
115
+ - Baevski et al., "wav2vec 2.0: A Framework for Self‑Supervised Learning of Speech Representations", NeurIPS 2020.
116
+ - CREMA‑D: Cao et al., "CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset", IEEE Transactions on Affective Computing.
117
+
118
+ ## License
119
+ MIT
120
+