VoiceCLAP-Large-v2 (PCL)

A rank-16 LoRA finetune of LCO-Embedding-Omni-7B (Qwen2.5-Omni thinker) trained with InfoNCE + Prototypical Contrastive loss (PCL) on the VoiceCLAP 9-corpus mix with MOSS-Audio k=2 sampled captions. Successor to laion/voiceclap-largebetter on every benchmark we measure.

What's new vs voiceclap-large

  1. PCL auxiliary loss (weight 0.1): 39 learned emotion prototypes; cross-entropy of audio embeddings vs prototypes on pseudo-labeled clips.
  2. z-scored pseudo-labels: emolia's emotion_annotation scalars, argmax over per-emotion z-scores vs corpus base rates (raw argmax is degenerate — high-base-rate dimensions win ~99% of clips). z ≥ 1.5 labels ~80-98% of emolia across all 39 emotion classes.
  3. LoRA rank 16 (α=32) — rank shown equivalent to r=32 in a controlled A/B.

Evaluation (VoiceNet benchmark, human-annotated)

Same-commit comparison on the VoiceNet harness:

Model Emo bal@pp Emo ρ Ext bal@pp Ext ρ
voiceclap-small 0.6754 0.3176 0.6576 0.1116
voiceclap-large (anchor re-run) 0.6991 0.3598 0.6739 0.1897
this model (PCL ep1) 0.7069 0.3865 0.6816 0.2125

Controlled A/B vs its exact no-PCL twin (identical data/recipe, ep1):

emolia per-emo emonet top-1 emonet ρ
InfoNCE only 0.6984 0.1411 0.3651
+ PCL w=0.1 0.7053 0.1544 0.3993

Ensemble notes: averaging this model's similarities with gijs/voiceclap-lco-7b-lora and the k=10 MOSS variant sets the current VoiceNet records (Emo bal@pp 0.7102; Ext bal@pp 0.6883).

Training recipe

Data 9 corpora (emolia-balanced, Got Talent, Majestrino, bursts, EARS, Expresso, VoxCeleb1/2)
Captions original / k=2-sampled MOSS-Audio sentences, 50/50
Samples seen 76,000 (1 epoch; best checkpoint)
LoRA r=16, α=32, dropout 0.05, all-linear
PCL weight 0.1, 39 prototypes, temp 0.1, proto-lr 1e-3, z≥1.5 pseudo-labels
lr / wd 1e-4 / 0.01, warmup 200, cosine
Batch 4 × accum 8 × 4 GH200 = effective 128
Precision bf16

Quick start

import torch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "laion/voiceclap-large-v2",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": torch.bfloat16},
)
audio_emb = model.encode("clip.flac")
text_emb  = model.encode("A person speaking with quiet pride in their voice")
score     = (audio_emb @ text_emb.T).item()

License

Apache-2.0 (inherits from the base model).

Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for laion/voiceclap-large-v2

Adapter
(2)
this model