omote-ai
/

distilhubert-ser

Audio Classification

speech-emotion-recognition

arousal-valence

Model card Files Files and versions

DistilHuBERT SER (Speech Emotion Recognition)

Arousal-valence dimensional emotion recognition from speech, built on ntu-spml/distilhubert.

Model Details

Backbone: DistilHuBERT (23.5M params)
Head: Linear(768 → 256) → GELU → Dropout(0.3) → Linear(256 → 2) → Tanh
Output: arousal ∈ [-1, 1], valence ∈ [-1, 1]
Input: Raw 16kHz waveform, variable length
Training data: CREMA-D (7,442 samples, 91 speakers)
Loss: Concordance Correlation Coefficient (CCC)
Best CCC: arousal=0.783, valence=0.737, avg=0.760

Files

File	Size	Description
	90 MB	fp32 ONNX model
	48 MB	INT8 dynamic quantized (deployment)

Usage (ONNX Runtime Web)

Training

Trained with CCC loss on CREMA-D categorical emotions mapped to dimensional centroids:

ANG → arousal=0.8, valence=-0.6
DIS → arousal=0.3, valence=-0.7
FEA → arousal=0.7, valence=-0.5
HAP → arousal=0.6, valence=0.7
NEU → arousal=0.0, valence=0.0
SAD → arousal=-0.5, valence=-0.4

30 epochs, batch size 16, lr=1e-4, AdamW.

Downloads last month: -; Downloads are not tracked for this model. How to track