DistilHuBERT SER (Speech Emotion Recognition)

Arousal-valence dimensional emotion recognition from speech, built on ntu-spml/distilhubert.

Model Details

  • Backbone: DistilHuBERT (23.5M params)
  • Head: Linear(768 β†’ 256) β†’ GELU β†’ Dropout(0.3) β†’ Linear(256 β†’ 2) β†’ Tanh
  • Output: arousal ∈ [-1, 1], valence ∈ [-1, 1]
  • Input: Raw 16kHz waveform, variable length
  • Training data: CREMA-D (7,442 samples, 91 speakers)
  • Loss: Concordance Correlation Coefficient (CCC)
  • Best CCC: arousal=0.783, valence=0.737, avg=0.760

Files

File Size Description
90 MB fp32 ONNX model
48 MB INT8 dynamic quantized (deployment)

Usage (ONNX Runtime Web)

Training

Trained with CCC loss on CREMA-D categorical emotions mapped to dimensional centroids:

  • ANG β†’ arousal=0.8, valence=-0.6
  • DIS β†’ arousal=0.3, valence=-0.7
  • FEA β†’ arousal=0.7, valence=-0.5
  • HAP β†’ arousal=0.6, valence=0.7
  • NEU β†’ arousal=0.0, valence=0.0
  • SAD β†’ arousal=-0.5, valence=-0.4

30 epochs, batch size 16, lr=1e-4, AdamW.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support