VoiceCLAP-Small
Voice-text contrastive (CLAP-style) embedding model trained on dense vocal-style captions for the VoiceNet suite.
VoiceCLAP-Small is the smaller of the two voice-text contrastive anchors
released with VoiceNet. It is a dual-tower model: a
BUD-E-Whisper_V1.1 audio
encoder paired with
sentence-transformers/all-MiniLM-L6-v2
on the text side, joined by an MLP projection on each side and trained with
the SigLIP sigmoid contrastive loss.
| Architecture | dual-tower CLAP (BUD-E-Whisper-Small + MiniLM-L6-v2) |
| Audio encoder | Whisper-style: 12 layers × 768 dim × 12 heads, 80-mel input @ 16 kHz |
| Text encoder | BERT/MiniLM, 6 layers × 384 dim, mean-pooled |
| Joint embedding | 768-d, L2-normalised |
| Loss | SigLIP (sigmoid contrastive) |
| Total parameters | ~110 M |
| Epochs | 1 |
Training data
Trained for 1 epoch on the open mixture (9 datasets) used in the VoiceNet paper:
emolia-balanced-5M-subset(annotated subset of Emilia)laions_got_talent_clean_with_captionsmajestrino-datasynthetic_vocal_burstsimproved_synthetic_vocal_burstsearsexpressovoxceleb1voxceleb2
All clips are captioned with MOSS-Audio-8B-Thinking-derived dense vocal-style
captions covering emotions, talking-style attributes, and demographics.
Standalone load example
Only transformers and torchaudio are required (both on PyPI).
import torch, torchaudio
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("VoiceNet/voiceclap-small", trust_remote_code=True).eval()
tok = AutoTokenizer.from_pretrained("VoiceNet/voiceclap-small")
# Audio: any-length 16 kHz waveform, mono
wav, sr = torchaudio.load("clip.wav")
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
wav = wav.mean(0) # (T,)
audio_emb = model.encode_waveform(wav) # (1, 768), L2-normed
# Text: short caption(s)
enc = tok(["a calm and steady voice"], padding=True, return_tensors="pt")
text_emb = model.encode_text(enc.input_ids, enc.attention_mask)
# Cosine similarity (embeddings already L2-normalised)
print((audio_emb @ text_emb.T).item())
encode_waveform accepts clips up to 30 s; longer clips should be chunked or
truncated before being passed in. Embeddings are 768-d and unit-norm, so
a @ t.T is the cosine similarity used in zero-shot retrieval.
Citation
If you use this model, please cite the VoiceNet paper.
- Downloads last month
- -