README: list all 9 training datasets (expresso/vox1/vox2 were missing)

5e01695 verified 6 days ago

2.93 kB

	---
	license: cc-by-4.0
	language:
	- en
	library_name: transformers
	pipeline_tag: feature-extraction
	tags:
	- audio
	- speech
	- emotion
	- clap
	- contrastive
	- voice
	---

	# VoiceCLAP-Small

	Voice-text contrastive (CLAP-style) embedding model trained on dense vocal-style
	captions for the [VoiceNet](https://huggingface.co/VoiceNet) suite.

	VoiceCLAP-Small is the smaller of the two voice-text contrastive anchors
	released with VoiceNet. It is a dual-tower model: a
	[BUD-E-Whisper_V1.1](https://huggingface.co/laion/BUD-E-Whisper_V1.1) audio
	encoder paired with
	[`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
	on the text side, joined by an MLP projection on each side and trained with
	the SigLIP sigmoid contrastive loss.

	\| \| \|
	\| --- \| --- \|
	\| Architecture \| dual-tower CLAP (BUD-E-Whisper-Small + MiniLM-L6-v2) \|
	\| Audio encoder \| Whisper-style: 12 layers × 768 dim × 12 heads, 80-mel input @ 16 kHz \|
	\| Text encoder \| BERT/MiniLM, 6 layers × 384 dim, mean-pooled \|
	\| Joint embedding \| 768-d, L2-normalised \|
	\| Loss \| SigLIP (sigmoid contrastive) \|
	\| Total parameters \| ~110 M \|
	\| Epochs \| 1 \|

	## Training data

	Trained for 1 epoch on the open `voiceclap_10_safe` mixture (9 datasets)
	used in the VoiceNet paper:

	- `emolia-balanced-5M-subset` (annotated subset of [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset))
	- `laions_got_talent_clean_with_captions`
	- `majestrino-data`
	- `synthetic_vocal_bursts`
	- `improved_synthetic_vocal_bursts`
	- `ears`
	- `expresso`
	- `voxceleb1`
	- `voxceleb2`

	All clips are captioned with `MOSS-Audio-8B-Thinking`-derived dense vocal-style
	captions covering emotions, talking-style attributes, and demographics.

	## Standalone load example

	Only `transformers` and `torchaudio` are required (both on PyPI).

	```python
	import torch, torchaudio
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("VoiceNet/voiceclap-small", trust_remote_code=True).eval()
	tok = AutoTokenizer.from_pretrained("VoiceNet/voiceclap-small")

	# Audio: any-length 16 kHz waveform, mono
	wav, sr = torchaudio.load("clip.wav")
	if sr != 16000:
	wav = torchaudio.functional.resample(wav, sr, 16000)
	wav = wav.mean(0) # (T,)
	audio_emb = model.encode_waveform(wav) # (1, 768), L2-normed

	# Text: short caption(s)
	enc = tok(["a calm and steady voice"], padding=True, return_tensors="pt")
	text_emb = model.encode_text(enc.input_ids, enc.attention_mask)

	# Cosine similarity (embeddings already L2-normalised)
	print((audio_emb @ text_emb.T).item())
	```

	`encode_waveform` accepts clips up to 30 s; longer clips should be chunked or
	truncated before being passed in. Embeddings are 768-d and unit-norm, so
	`a @ t.T` is the cosine similarity used in zero-shot retrieval.

	## Citation

	If you use this model, please cite the VoiceNet paper.