VoiceCLAP-Small-v2

Voice-text contrastive (CLAP-style) embedding model โ€” the successor to laion/voiceclap-small, trained with emotion-led MOSS-Audio short captions. Better than v1 on every benchmark we measure, at identical size and inference cost.

Same dual-tower architecture as v1: a BUD-E-Whisper_V1.1 audio encoder paired with sentence-transformers/all-MiniLM-L6-v2 on the text side, joined by an MLP projection on each side and trained with the SigLIP sigmoid contrastive loss.

Architecture dual-tower CLAP (BUD-E-Whisper-Small + MiniLM-L6-v2)
Audio encoder Whisper-style: 12 layers ร— 768 dim ร— 12 heads, 80-mel input @ 16 kHz
Text encoder BERT/MiniLM, 6 layers ร— 384 dim, mean-pooled
Joint embedding 768-d, L2-normalised
Loss SigLIP (sigmoid contrastive)
Total parameters ~110 M
Training 40 M samples (20 epochs ร— 2 M), best checkpoint epoch 19

What's new vs v1

v1 sampled k=2 uniformly-chosen MOSS-Audio attribute sentences per clip as the caption. v2 replaces this with an emotion-led short caption: the MOSS-Audio-8B-Thinking EMO sentence (a direct natural-language description of the emotional state) plus one randomly sampled talking-style sentence, re-drawn every epoch. Captions stay 50/50 blended with each corpus's original captions. The emotion-first structure concentrates contrastive signal on the emotion subspace without sacrificing style coverage.

Evaluation

Benchmark v1 (released) v2 (this model) ฮ”
EmoNet-Voice top-1 0.0902 0.1015 +13% rel
EmoNet-Voice Spearman ฯ 0.2280 0.2561 +12% rel
MAEB-voice mean (8 tasks) 0.3861 0.3893 +0.8%

The ฯ gain also clears every arm of the v1 caption-sampling sweep (best: 0.2399 at k=2). MAEB-voice shows no general-speech regression.

Training data

Trained on the open 9-corpus mixture used in the VoiceNet paper:

  • emolia-balanced-5M-subset (annotated subset of Emilia)
  • laions_got_talent_clean_with_captions
  • majestrino-data
  • synthetic_vocal_bursts + improved_synthetic_vocal_bursts
  • ears, expresso, voxceleb1, voxceleb2 (FCaps captions)

MOSS-Audio-8B-Thinking annotations (18 prompt groups, 61 attribute values per clip) provide the EMO + style sentences for the three large corpora.

Usage

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("laion/voiceclap-small-v2", trust_remote_code=True).eval()
tok = AutoTokenizer.from_pretrained("laion/voiceclap-small-v2")

# audio: raw mono waveform at 16 kHz
import soundfile as sf
wav, sr = sf.read("clip.wav", dtype="float32")
audio_emb = model.encode_waveform(torch.from_numpy(wav))

# text
t = tok(["a person speaking with quiet pride in their voice"], padding=True, return_tensors="pt")
text_emb = model.encode_text(t["input_ids"], attention_mask=t["attention_mask"])

score = (audio_emb @ text_emb.T).item()

Conversion from the training checkpoint was verified functionally against the original open_clip implementation (cosine โ‰ฅ 0.9999 on both towers).

Sibling models

License

cc-by-nc-4.0

Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for laion/voiceclap-small-v2

Finetuned
(1)
this model