Whisper → V/A/D Regressor

A lightweight regressor that maps Whisper-large-v3-turbo encoder hidden states to continuous Valence / Arousal / Dominance coordinates in [-1, +1]^3.

Trained on CREMA-D under the Open Database License. Used as an external V/A/D auditor in the remotemedia-sdk affect-pipeline projects (LLM activation steering, audio→blendshape diffusion).

Two variants ship in this repo

Variant File Test Pearson r (V/A/D) Notes
Ridge ridge.onnx 0.58 / 0.66 / 0.69 Recommended. Linear projection, ~15 KB, faster.
MLP mlp.onnx + mlp.onnx.data 0.28 / 0.62 / 0.55 Reference. 2-layer MLP, ~1.4 MB external weights.

Ridge wins on every axis on the test split. The MLP is shipped only as a comparison baseline; production use should default to ridge.onnx.

I/O signature

Both models share the same ONNX I/O:

Tensor Name Shape dtype Description
input whisper_embed (batch, 1280) float32 Mean-pooled Whisper encoder hidden state
output vad (batch, 3) float32 [valence, arousal, dominance][-1, +1]^3

Critical: input is per-clip, not per-frame. Pool Whisper hidden states over time (mean across the time axis) before feeding to this regressor.

Inference snippet

import numpy as np
import onnxruntime as ort
from transformers import AutoModel, AutoProcessor

# 1. Load Whisper encoder + this regressor
processor = AutoProcessor.from_pretrained("openai/whisper-large-v3-turbo")
whisper = AutoModel.from_pretrained("openai/whisper-large-v3-turbo").get_encoder().eval()
sess = ort.InferenceSession("ridge.onnx", providers=["CPUExecutionProvider"])

# 2. Extract hidden states from layer -2 (penultimate); mean-pool over time
import torch, librosa
audio, _ = librosa.load("clip.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    out = whisper(**inputs, output_hidden_states=True)
embed = out.hidden_states[-2].mean(dim=1).numpy().astype(np.float32)  # (1, 1280)

# 3. Predict V/A/D
vad = sess.run(["vad"], {"whisper_embed": embed})[0]  # (1, 3)
print(f"valence={vad[0,0]:+.2f}  arousal={vad[0,1]:+.2f}  dominance={vad[0,2]:+.2f}")

Training

  • Architecture (ridge): W ∈ R^(1280 × 3) + bias. Closed-form ridge regression with α=1.0, no regularization on bias. Total params: 3,843.
  • Architecture (mlp): Linear(1280, 256) → GELU → Linear(256, 3), AdamW + cosine schedule, 100 epochs. Total params: 328,963.
  • Audio frontend: Whisper-large-v3-turbo encoder, layer −2 (penultimate, pre-layernorm) hidden states, mean-pooled over time.
  • Dataset: CREMA-D, speaker-disjoint split (train=1391, val=305, test=304 clips).
  • Targets: continuous V/A/D coordinates derived from CREMA-D categorical labels via a fixed hand-authored anchor table (Russell circumplex + Mehrabian PAD + NRC-VAD style; see emotion_to_vad.py).

Detailed metrics

Ridge (recommended)

{
  "dataset": "crema_d",
  "whisper_model": "openai/whisper-large-v3-turbo",
  "whisper_layer": -2,
  "regressor": "ridge",
  "n_train": 1391,
  "n_val": 305,
  "n_test": 304,
  "speaker_disjoint": true,
  "axes": [
    "valence",
    "arousal",
    "dominance"
  ],
  "val": {
    "n": 305,
    "rmse": {
      "valence": 0.41578346490859985,
      "arousal": 0.27047833800315857,
      "dominance": 0.2509686052799225
    },
    "mae": {
      "valence": 0.3144280016422272,
      "arousal": 0.21582330763339996,
      "dominance": 0.20134007930755615
    },
    "pearson_r": {
      "valence": 0.6142750533217433,
      "arousal": 0.6183148626337297,
      "dominance": 0.7671747949052007
    }
  },
  "test": {
    "n": 304,
    "rmse": {
      "valence": 0.4229687750339508,
      "arousal": 0.2829131484031677,
      "dominance": 0.2870493531227112
    },
    "mae": {
      "valence": 0.3345738351345062,
      "arousal": 0.22188791632652283,
      "dominance": 0.2326992005109787
    },
    "pearson_r": {
      "valence": 0.5816889655848607,
      "arousal": 0.6575878679123864,
      "dominance": 0.6881453235310694
    }
  },
  "dry_run": false
}

MLP (reference)

{
  "dataset": "crema_d",
  "whisper_model": "openai/whisper-large-v3-turbo",
  "whisper_layer": -2,
  "regressor": "mlp",
  "n_train": 1391,
  "n_val": 305,
  "n_test": 304,
  "speaker_disjoint": true,
  "axes": [
    "valence",
    "arousal",
    "dominance"
  ],
  "val": {
    "n": 305,
    "rmse": {
      "valence": 0.4990813732147217,
      "arousal": 0.2946903705596924,
      "dominance": 0.299040824174881
    },
    "mae": {
      "valence": 0.42277732491493225,
      "arousal": 0.24514958262443542,
      "dominance": 0.25509873032569885
    },
    "pearson_r": {
      "valence": 0.339805625682777,
      "arousal": 0.518995070692627,
      "dominance": 0.672127424657303
    }
  },
  "test": {
    "n": 304,
    "rmse": {
      "valence": 0.5018822550773621,
      "arousal": 0.3011777400970459,
      "dominance": 0.32821124792099
    },
    "mae": {
      "valence": 0.4377283751964569,
      "arousal": 0.24337027966976166,
      "dominance": 0.28178495168685913
    },
    "pearson_r": {
      "valence": 0.2784075767971956,
      "arousal": 0.6199177507366342,
      "dominance": 0.5484364257265157
    }
  },
  "dry_run": false
}

License

MIT (this model). Upstream licenses:

  • CREMA-D training data: Open Database License — commercial use allowed with attribution.
  • Whisper-large-v3-turbo audio frontend: Apache 2.0.

If you publish work using this regressor, cite CREMA-D + the Whisper paper:

@article{cao2014cremad,
  title={CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset},
  author={Cao, Houwei and Cooper, David G and Keutmann, Michael K and Gur, Ruben C and Nenkova, Ani and Verma, Ragini},
  journal={IEEE Transactions on Affective Computing},
  year={2014}
}
@inproceedings{radford2023whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  booktitle={ICML},
  year={2023}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for matbee/whisper-to-vad

Quantized
(217)
this model