DistilHuBERT SER (Speech Emotion Recognition)
Arousal-valence dimensional emotion recognition from speech, built on ntu-spml/distilhubert.
Model Details
- Backbone: DistilHuBERT (23.5M params)
- Head: Linear(768 β 256) β GELU β Dropout(0.3) β Linear(256 β 2) β Tanh
- Output: arousal β [-1, 1], valence β [-1, 1]
- Input: Raw 16kHz waveform, variable length
- Training data: CREMA-D (7,442 samples, 91 speakers)
- Loss: Concordance Correlation Coefficient (CCC)
- Best CCC: arousal=0.783, valence=0.737, avg=0.760
Files
| File | Size | Description |
|---|---|---|
| 90 MB | fp32 ONNX model | |
| 48 MB | INT8 dynamic quantized (deployment) |
Usage (ONNX Runtime Web)
Training
Trained with CCC loss on CREMA-D categorical emotions mapped to dimensional centroids:
- ANG β arousal=0.8, valence=-0.6
- DIS β arousal=0.3, valence=-0.7
- FEA β arousal=0.7, valence=-0.5
- HAP β arousal=0.6, valence=0.7
- NEU β arousal=0.0, valence=0.0
- SAD β arousal=-0.5, valence=-0.4
30 epochs, batch size 16, lr=1e-4, AdamW.