voiceclap-small / README.md
gijs's picture
README: list all 9 training datasets (expresso/vox1/vox2 were missing)
5e01695 verified
metadata
license: cc-by-4.0
language:
  - en
library_name: transformers
pipeline_tag: feature-extraction
tags:
  - audio
  - speech
  - emotion
  - clap
  - contrastive
  - voice

VoiceCLAP-Small

Voice-text contrastive (CLAP-style) embedding model trained on dense vocal-style captions for the VoiceNet suite.

VoiceCLAP-Small is the smaller of the two voice-text contrastive anchors released with VoiceNet. It is a dual-tower model: a BUD-E-Whisper_V1.1 audio encoder paired with sentence-transformers/all-MiniLM-L6-v2 on the text side, joined by an MLP projection on each side and trained with the SigLIP sigmoid contrastive loss.

Architecture dual-tower CLAP (BUD-E-Whisper-Small + MiniLM-L6-v2)
Audio encoder Whisper-style: 12 layers × 768 dim × 12 heads, 80-mel input @ 16 kHz
Text encoder BERT/MiniLM, 6 layers × 384 dim, mean-pooled
Joint embedding 768-d, L2-normalised
Loss SigLIP (sigmoid contrastive)
Total parameters ~110 M
Epochs 1

Training data

Trained for 1 epoch on the open voiceclap_10_safe mixture (9 datasets) used in the VoiceNet paper:

  • emolia-balanced-5M-subset (annotated subset of Emilia)
  • laions_got_talent_clean_with_captions
  • majestrino-data
  • synthetic_vocal_bursts
  • improved_synthetic_vocal_bursts
  • ears
  • expresso
  • voxceleb1
  • voxceleb2

All clips are captioned with MOSS-Audio-8B-Thinking-derived dense vocal-style captions covering emotions, talking-style attributes, and demographics.

Standalone load example

Only transformers and torchaudio are required (both on PyPI).

import torch, torchaudio
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("VoiceNet/voiceclap-small", trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained("VoiceNet/voiceclap-small")

# Audio: any-length 16 kHz waveform, mono
wav, sr = torchaudio.load("clip.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)
wav = wav.mean(0)                                         # (T,)
audio_emb = model.encode_waveform(wav)                    # (1, 768), L2-normed

# Text: short caption(s)
enc      = tok(["a calm and steady voice"], padding=True, return_tensors="pt")
text_emb = model.encode_text(enc.input_ids, enc.attention_mask)

# Cosine similarity (embeddings already L2-normalised)
print((audio_emb @ text_emb.T).item())

encode_waveform accepts clips up to 30 s; longer clips should be chunked or truncated before being passed in. Embeddings are 768-d and unit-norm, so a @ t.T is the cosine similarity used in zero-shot retrieval.

Citation

If you use this model, please cite the VoiceNet paper.