| --- |
| license: cc-by-4.0 |
| language: |
| - en |
| library_name: transformers |
| pipeline_tag: feature-extraction |
| tags: |
| - audio |
| - speech |
| - emotion |
| - clap |
| - contrastive |
| - voice |
| --- |
| |
| # VoiceCLAP-Small |
|
|
| Voice-text contrastive (CLAP-style) embedding model trained on dense vocal-style |
| captions for the [VoiceNet](https://huggingface.co/VoiceNet) suite. |
|
|
| VoiceCLAP-Small is the smaller of the two voice-text contrastive anchors |
| released with VoiceNet. It is a **dual-tower** model: a |
| [BUD-E-Whisper_V1.1](https://huggingface.co/laion/BUD-E-Whisper_V1.1) audio |
| encoder paired with |
| [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) |
| on the text side, joined by an MLP projection on each side and trained with |
| the SigLIP sigmoid contrastive loss. |
|
|
| | | | |
| | --- | --- | |
| | Architecture | dual-tower CLAP (BUD-E-Whisper-Small + MiniLM-L6-v2) | |
| | Audio encoder | Whisper-style: 12 layers × 768 dim × 12 heads, 80-mel input @ 16 kHz | |
| | Text encoder | BERT/MiniLM, 6 layers × 384 dim, mean-pooled | |
| | Joint embedding | 768-d, L2-normalised | |
| | Loss | SigLIP (sigmoid contrastive) | |
| | Total parameters | ~110 M | |
| | Epochs | 1 | |
|
|
| ## Training data |
|
|
| Trained for **1 epoch** on the open `voiceclap_10_safe` mixture (9 datasets) |
| used in the VoiceNet paper: |
|
|
| - `emolia-balanced-5M-subset` (annotated subset of [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset)) |
| - `laions_got_talent_clean_with_captions` |
| - `majestrino-data` |
| - `synthetic_vocal_bursts` |
| - `improved_synthetic_vocal_bursts` |
| - `ears` |
| - `expresso` |
| - `voxceleb1` |
| - `voxceleb2` |
|
|
| All clips are captioned with `MOSS-Audio-8B-Thinking`-derived dense vocal-style |
| captions covering emotions, talking-style attributes, and demographics. |
|
|
| ## Standalone load example |
|
|
| Only `transformers` and `torchaudio` are required (both on PyPI). |
|
|
| ```python |
| import torch, torchaudio |
| from transformers import AutoModel, AutoTokenizer |
| |
| model = AutoModel.from_pretrained("VoiceNet/voiceclap-small", trust_remote_code=True).eval() |
| tok = AutoTokenizer.from_pretrained("VoiceNet/voiceclap-small") |
| |
| # Audio: any-length 16 kHz waveform, mono |
| wav, sr = torchaudio.load("clip.wav") |
| if sr != 16000: |
| wav = torchaudio.functional.resample(wav, sr, 16000) |
| wav = wav.mean(0) # (T,) |
| audio_emb = model.encode_waveform(wav) # (1, 768), L2-normed |
| |
| # Text: short caption(s) |
| enc = tok(["a calm and steady voice"], padding=True, return_tensors="pt") |
| text_emb = model.encode_text(enc.input_ids, enc.attention_mask) |
| |
| # Cosine similarity (embeddings already L2-normalised) |
| print((audio_emb @ text_emb.T).item()) |
| ``` |
|
|
| `encode_waveform` accepts clips up to 30 s; longer clips should be chunked or |
| truncated before being passed in. Embeddings are 768-d and unit-norm, so |
| `a @ t.T` is the cosine similarity used in zero-shot retrieval. |
|
|
| ## Citation |
|
|
| If you use this model, please cite the VoiceNet paper. |
|
|