Feature Extraction
sentence-transformers
Safetensors
qwen2_5_omni_thinker
voice
speech
emotion
audio-text
clap
contrastive
lora
prototypical-contrastive
Instructions to use laion/voiceclap-large-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use laion/voiceclap-large-v2 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("laion/voiceclap-large-v2") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
VoiceCLAP-Large-v2 (PCL)
A rank-16 LoRA finetune of LCO-Embedding-Omni-7B (Qwen2.5-Omni thinker)
trained with InfoNCE + Prototypical Contrastive loss (PCL) on the
VoiceCLAP 9-corpus mix with MOSS-Audio k=2 sampled captions. Successor to
laion/voiceclap-large —
better on every benchmark we measure.
What's new vs voiceclap-large
- PCL auxiliary loss (weight 0.1): 39 learned emotion prototypes; cross-entropy of audio embeddings vs prototypes on pseudo-labeled clips.
- z-scored pseudo-labels: emolia's
emotion_annotationscalars, argmax over per-emotion z-scores vs corpus base rates (raw argmax is degenerate — high-base-rate dimensions win ~99% of clips). z ≥ 1.5 labels ~80-98% of emolia across all 39 emotion classes. - LoRA rank 16 (α=32) — rank shown equivalent to r=32 in a controlled A/B.
Evaluation (VoiceNet benchmark, human-annotated)
Same-commit comparison on the VoiceNet harness:
| Model | Emo bal@pp | Emo ρ | Ext bal@pp | Ext ρ |
|---|---|---|---|---|
| voiceclap-small | 0.6754 | 0.3176 | 0.6576 | 0.1116 |
| voiceclap-large (anchor re-run) | 0.6991 | 0.3598 | 0.6739 | 0.1897 |
| this model (PCL ep1) | 0.7069 | 0.3865 | 0.6816 | 0.2125 |
Controlled A/B vs its exact no-PCL twin (identical data/recipe, ep1):
| emolia per-emo | emonet top-1 | emonet ρ | |
|---|---|---|---|
| InfoNCE only | 0.6984 | 0.1411 | 0.3651 |
| + PCL w=0.1 | 0.7053 | 0.1544 | 0.3993 |
Ensemble notes: averaging this model's similarities with
gijs/voiceclap-lco-7b-lora and the k=10 MOSS variant sets the current
VoiceNet records (Emo bal@pp 0.7102; Ext bal@pp 0.6883).
Training recipe
| Data | 9 corpora (emolia-balanced, Got Talent, Majestrino, bursts, EARS, Expresso, VoxCeleb1/2) |
| Captions | original / k=2-sampled MOSS-Audio sentences, 50/50 |
| Samples seen | 76,000 (1 epoch; best checkpoint) |
| LoRA | r=16, α=32, dropout 0.05, all-linear |
| PCL | weight 0.1, 39 prototypes, temp 0.1, proto-lr 1e-3, z≥1.5 pseudo-labels |
| lr / wd | 1e-4 / 0.01, warmup 200, cosine |
| Batch | 4 × accum 8 × 4 GH200 = effective 128 |
| Precision | bf16 |
Quick start
import torch
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"laion/voiceclap-large-v2",
trust_remote_code=True,
model_kwargs={"torch_dtype": torch.bfloat16},
)
audio_emb = model.encode("clip.flac")
text_emb = model.encode("A person speaking with quiet pride in their voice")
score = (audio_emb @ text_emb.T).item()
License
Apache-2.0 (inherits from the base model).
- Downloads last month
- -
Model tree for laion/voiceclap-large-v2
Base model
LCO-Embedding/LCO-Embedding-Omni-7B