SER_wav2vec / README.md

Create README.md

a8d2c02 verified 3 months ago

4.01 kB

	---
	license: mit
	datasets:
	- confit/cremad-parquet
	language:
	- en
	metrics:
	- accuracy
	- f1
	base_model:
	- facebook/wav2vec2-base-960h
	library_name: transformers
	tags:
	- speech
	- speech-emotion-recognition
	- ser
	- wav2vec2
	---

	# SER Wav2Vec2 — Speech Emotion Recognition

	## Model Summary
	- Architecture: Wav2Vec2 encoder with a sequence classification head (`Wav2Vec2ForSequenceClassification`).
	- Sampling rate: 16 kHz mono inputs.
	- Label set: `angry`, `calm`, `disgust`, `fear`, `happy`, `neutral`, `sad`, `surprise`.
	- Base: `facebook/wav2vec2-base-960h` fine‑tuned for emotion classification.

	## Intended Use
	- Input: short audio clips of speech.
	- Output: per‑class probabilities for the above emotions and a dominant label.
	- Use cases: UX feedback, call‑center analytics, demo apps, research prototypes.

	## Out‑of‑Scope
	- Clinical or safety‑critical decisions.
	- Non‑speech audio (music, noise) or multilingual speech without adaptation.

	## Data and Training
	- Dataset: CREMA‑D (via `confit/cremad-parquet`).
	- Preprocessing: resample to 16 kHz, convert to mono, optional padding for very short clips.
	- Fine‑tuning objective: cross‑entropy classification.

	## Configuration Snapshot
	- Hidden size: 768
	- Attention heads: 12
	- Hidden layers: 12
	- Classifier projection: 256
	- Final dropout: 0.0
	- Feature extractor norm: group

	## Metrics
	- Reported metrics: accuracy, F1‑score.
	- Benchmark numbers depend on split and preprocessing; fill in with your evaluation results.

	## Usage

	### Python (Transformers)
	```python
	import torch, numpy as np, soundfile as sf, librosa
	from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor

	model_dir = "path/to/best_model" # local directory or Hub repo id
	model = Wav2Vec2ForSequenceClassification.from_pretrained(model_dir)
	fe = AutoFeatureExtractor.from_pretrained(model_dir)
	model.eval()

	def load_audio(path, sr=fe.sampling_rate):
	y, s = sf.read(path, always_2d=False)
	if isinstance(y, np.ndarray):
	if y.ndim > 1:
	y = np.mean(y, axis=1)
	if s != sr:
	y = librosa.resample(y.astype(np.float32), orig_sr=s, target_sr=sr)
	y = y.astype(np.float32)
	else:
	y = np.array(y, dtype=np.float32)
	if y.size < sr // 10:
	y = np.pad(y, (0, max(0, sr - y.size)))
	return y

	audio = load_audio("sample.wav")
	inputs = fe(audio, sampling_rate=fe.sampling_rate, return_tensors="pt")
	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.softmax(logits, dim=-1)[0].cpu().numpy()
	labels = [model.config.id2label[str(i)] if isinstance(list(model.config.id2label.keys())[0], str) else model.config.id2label[i] for i in range(len(probs))]
	pairs = sorted(zip(labels, probs), key=lambda x: x[1], reverse=True)
	print(pairs[:3])
	```

	### FastAPI Integration
	This repository includes a FastAPI server that exposes `POST /predict` and returns sorted label probabilities and the dominant emotion.

	## Input and Output Schema
	- Input: audio file (`wav`, `mp3`, `m4a`, etc.). Internally converted to 16 kHz mono.
	- Output JSON:
	```json
	{
	"results": [{ "label": "happy", "score": 0.81 }, ...],
	"dominant": { "label": "happy", "score": 0.81 }
	}
	```

	## Limitations and Bias
	- Emotion labels are subjective; datasets may reflect staged emotions.
	- Performance can degrade with strong noise, reverberation, or accents not present in training.
	- Not suitable for sensitive decision‑making without rigorous validation.

	## Ethical Considerations
	- Obtain consent where required.
	- Be transparent about the model’s limitations and intended use.
	- Avoid deployment in contexts where misclassification can cause harm.

	## Citation
	If you use this model, please cite:
	- Baevski et al., "wav2vec 2.0: A Framework for Self‑Supervised Learning of Speech Representations", NeurIPS 2020.
	- CREMA‑D: Cao et al., "CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset", IEEE Transactions on Affective Computing.

	## License
	MIT