Whisper → V/A/D Regressor
A lightweight regressor that maps Whisper-large-v3-turbo encoder hidden
states to continuous Valence / Arousal / Dominance coordinates in
[-1, +1]^3.
Trained on CREMA-D
under the Open Database License. Used as an external V/A/D auditor in
the remotemedia-sdk
affect-pipeline projects (LLM activation steering, audio→blendshape
diffusion).
Two variants ship in this repo
| Variant | File | Test Pearson r (V/A/D) | Notes |
|---|---|---|---|
| Ridge ⭐ | ridge.onnx |
0.58 / 0.66 / 0.69 | Recommended. Linear projection, ~15 KB, faster. |
| MLP | mlp.onnx + mlp.onnx.data |
0.28 / 0.62 / 0.55 | Reference. 2-layer MLP, ~1.4 MB external weights. |
Ridge wins on every axis on the test split. The MLP is shipped only as
a comparison baseline; production use should default to ridge.onnx.
I/O signature
Both models share the same ONNX I/O:
| Tensor | Name | Shape | dtype | Description |
|---|---|---|---|---|
| input | whisper_embed |
(batch, 1280) |
float32 | Mean-pooled Whisper encoder hidden state |
| output | vad |
(batch, 3) |
float32 | [valence, arousal, dominance] ∈ [-1, +1]^3 |
Critical: input is per-clip, not per-frame. Pool Whisper hidden states over time (mean across the time axis) before feeding to this regressor.
Inference snippet
import numpy as np
import onnxruntime as ort
from transformers import AutoModel, AutoProcessor
# 1. Load Whisper encoder + this regressor
processor = AutoProcessor.from_pretrained("openai/whisper-large-v3-turbo")
whisper = AutoModel.from_pretrained("openai/whisper-large-v3-turbo").get_encoder().eval()
sess = ort.InferenceSession("ridge.onnx", providers=["CPUExecutionProvider"])
# 2. Extract hidden states from layer -2 (penultimate); mean-pool over time
import torch, librosa
audio, _ = librosa.load("clip.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
out = whisper(**inputs, output_hidden_states=True)
embed = out.hidden_states[-2].mean(dim=1).numpy().astype(np.float32) # (1, 1280)
# 3. Predict V/A/D
vad = sess.run(["vad"], {"whisper_embed": embed})[0] # (1, 3)
print(f"valence={vad[0,0]:+.2f} arousal={vad[0,1]:+.2f} dominance={vad[0,2]:+.2f}")
Training
- Architecture (ridge):
W ∈ R^(1280 × 3)+ bias. Closed-form ridge regression with α=1.0, no regularization on bias. Total params: 3,843. - Architecture (mlp):
Linear(1280, 256) → GELU → Linear(256, 3), AdamW + cosine schedule, 100 epochs. Total params: 328,963. - Audio frontend: Whisper-large-v3-turbo encoder, layer −2 (penultimate, pre-layernorm) hidden states, mean-pooled over time.
- Dataset: CREMA-D, speaker-disjoint split (train=1391, val=305, test=304 clips).
- Targets: continuous V/A/D coordinates derived from CREMA-D
categorical labels via a fixed hand-authored anchor table
(Russell circumplex + Mehrabian PAD + NRC-VAD style; see
emotion_to_vad.py).
Detailed metrics
Ridge (recommended)
{
"dataset": "crema_d",
"whisper_model": "openai/whisper-large-v3-turbo",
"whisper_layer": -2,
"regressor": "ridge",
"n_train": 1391,
"n_val": 305,
"n_test": 304,
"speaker_disjoint": true,
"axes": [
"valence",
"arousal",
"dominance"
],
"val": {
"n": 305,
"rmse": {
"valence": 0.41578346490859985,
"arousal": 0.27047833800315857,
"dominance": 0.2509686052799225
},
"mae": {
"valence": 0.3144280016422272,
"arousal": 0.21582330763339996,
"dominance": 0.20134007930755615
},
"pearson_r": {
"valence": 0.6142750533217433,
"arousal": 0.6183148626337297,
"dominance": 0.7671747949052007
}
},
"test": {
"n": 304,
"rmse": {
"valence": 0.4229687750339508,
"arousal": 0.2829131484031677,
"dominance": 0.2870493531227112
},
"mae": {
"valence": 0.3345738351345062,
"arousal": 0.22188791632652283,
"dominance": 0.2326992005109787
},
"pearson_r": {
"valence": 0.5816889655848607,
"arousal": 0.6575878679123864,
"dominance": 0.6881453235310694
}
},
"dry_run": false
}
MLP (reference)
{
"dataset": "crema_d",
"whisper_model": "openai/whisper-large-v3-turbo",
"whisper_layer": -2,
"regressor": "mlp",
"n_train": 1391,
"n_val": 305,
"n_test": 304,
"speaker_disjoint": true,
"axes": [
"valence",
"arousal",
"dominance"
],
"val": {
"n": 305,
"rmse": {
"valence": 0.4990813732147217,
"arousal": 0.2946903705596924,
"dominance": 0.299040824174881
},
"mae": {
"valence": 0.42277732491493225,
"arousal": 0.24514958262443542,
"dominance": 0.25509873032569885
},
"pearson_r": {
"valence": 0.339805625682777,
"arousal": 0.518995070692627,
"dominance": 0.672127424657303
}
},
"test": {
"n": 304,
"rmse": {
"valence": 0.5018822550773621,
"arousal": 0.3011777400970459,
"dominance": 0.32821124792099
},
"mae": {
"valence": 0.4377283751964569,
"arousal": 0.24337027966976166,
"dominance": 0.28178495168685913
},
"pearson_r": {
"valence": 0.2784075767971956,
"arousal": 0.6199177507366342,
"dominance": 0.5484364257265157
}
},
"dry_run": false
}
License
MIT (this model). Upstream licenses:
- CREMA-D training data: Open Database License — commercial use allowed with attribution.
- Whisper-large-v3-turbo audio frontend: Apache 2.0.
If you publish work using this regressor, cite CREMA-D + the Whisper paper:
@article{cao2014cremad,
title={CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset},
author={Cao, Houwei and Cooper, David G and Keutmann, Michael K and Gur, Ruben C and Nenkova, Ani and Verma, Ragini},
journal={IEEE Transactions on Affective Computing},
year={2014}
}
@inproceedings{radford2023whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
booktitle={ICML},
year={2023}
}
Model tree for matbee/whisper-to-vad
Base model
openai/whisper-large-v3