Feature Extraction
Transformers
Safetensors
English
voiceclap-small
audio
speech
emotion
clap
contrastive
voice
custom_code
Instructions to use VoiceNet/voiceclap-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use VoiceNet/voiceclap-small with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="VoiceNet/voiceclap-small", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("VoiceNet/voiceclap-small", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 2,932 Bytes
f5fcbcd 7ca7b55 f5fcbcd 5e01695 f5fcbcd 5e01695 f5fcbcd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | ---
license: cc-by-4.0
language:
- en
library_name: transformers
pipeline_tag: feature-extraction
tags:
- audio
- speech
- emotion
- clap
- contrastive
- voice
---
# VoiceCLAP-Small
Voice-text contrastive (CLAP-style) embedding model trained on dense vocal-style
captions for the [VoiceNet](https://huggingface.co/VoiceNet) suite.
VoiceCLAP-Small is the smaller of the two voice-text contrastive anchors
released with VoiceNet. It is a **dual-tower** model: a
[BUD-E-Whisper_V1.1](https://huggingface.co/laion/BUD-E-Whisper_V1.1) audio
encoder paired with
[`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
on the text side, joined by an MLP projection on each side and trained with
the SigLIP sigmoid contrastive loss.
| | |
| --- | --- |
| Architecture | dual-tower CLAP (BUD-E-Whisper-Small + MiniLM-L6-v2) |
| Audio encoder | Whisper-style: 12 layers × 768 dim × 12 heads, 80-mel input @ 16 kHz |
| Text encoder | BERT/MiniLM, 6 layers × 384 dim, mean-pooled |
| Joint embedding | 768-d, L2-normalised |
| Loss | SigLIP (sigmoid contrastive) |
| Total parameters | ~110 M |
| Epochs | 1 |
## Training data
Trained for **1 epoch** on the open `voiceclap_10_safe` mixture (9 datasets)
used in the VoiceNet paper:
- `emolia-balanced-5M-subset` (annotated subset of [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset))
- `laions_got_talent_clean_with_captions`
- `majestrino-data`
- `synthetic_vocal_bursts`
- `improved_synthetic_vocal_bursts`
- `ears`
- `expresso`
- `voxceleb1`
- `voxceleb2`
All clips are captioned with `MOSS-Audio-8B-Thinking`-derived dense vocal-style
captions covering emotions, talking-style attributes, and demographics.
## Standalone load example
Only `transformers` and `torchaudio` are required (both on PyPI).
```python
import torch, torchaudio
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("VoiceNet/voiceclap-small", trust_remote_code=True).eval()
tok = AutoTokenizer.from_pretrained("VoiceNet/voiceclap-small")
# Audio: any-length 16 kHz waveform, mono
wav, sr = torchaudio.load("clip.wav")
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
wav = wav.mean(0) # (T,)
audio_emb = model.encode_waveform(wav) # (1, 768), L2-normed
# Text: short caption(s)
enc = tok(["a calm and steady voice"], padding=True, return_tensors="pt")
text_emb = model.encode_text(enc.input_ids, enc.attention_mask)
# Cosine similarity (embeddings already L2-normalised)
print((audio_emb @ text_emb.T).item())
```
`encode_waveform` accepts clips up to 30 s; longer clips should be chunked or
truncated before being passed in. Embeddings are 768-d and unit-norm, so
`a @ t.T` is the cosine similarity used in zero-shot retrieval.
## Citation
If you use this model, please cite the VoiceNet paper.
|