File size: 2,932 Bytes
f5fcbcd
 
 
 
 
7ca7b55
f5fcbcd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e01695
 
f5fcbcd
 
 
 
 
 
 
5e01695
 
 
f5fcbcd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
license: cc-by-4.0
language:
  - en
library_name: transformers
pipeline_tag: feature-extraction
tags:
  - audio
  - speech
  - emotion
  - clap
  - contrastive
  - voice
---

# VoiceCLAP-Small

Voice-text contrastive (CLAP-style) embedding model trained on dense vocal-style
captions for the [VoiceNet](https://huggingface.co/VoiceNet) suite.

VoiceCLAP-Small is the smaller of the two voice-text contrastive anchors
released with VoiceNet. It is a **dual-tower** model: a
[BUD-E-Whisper_V1.1](https://huggingface.co/laion/BUD-E-Whisper_V1.1) audio
encoder paired with
[`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
on the text side, joined by an MLP projection on each side and trained with
the SigLIP sigmoid contrastive loss.

| | |
| --- | --- |
| Architecture | dual-tower CLAP (BUD-E-Whisper-Small + MiniLM-L6-v2) |
| Audio encoder | Whisper-style: 12 layers × 768 dim × 12 heads, 80-mel input @ 16 kHz |
| Text encoder | BERT/MiniLM, 6 layers × 384 dim, mean-pooled |
| Joint embedding | 768-d, L2-normalised |
| Loss | SigLIP (sigmoid contrastive) |
| Total parameters | ~110 M |
| Epochs | 1 |

## Training data

Trained for **1 epoch** on the open `voiceclap_10_safe` mixture (9 datasets)
used in the VoiceNet paper:

- `emolia-balanced-5M-subset` (annotated subset of [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset))
- `laions_got_talent_clean_with_captions`
- `majestrino-data`
- `synthetic_vocal_bursts`
- `improved_synthetic_vocal_bursts`
- `ears`
- `expresso`
- `voxceleb1`
- `voxceleb2`

All clips are captioned with `MOSS-Audio-8B-Thinking`-derived dense vocal-style
captions covering emotions, talking-style attributes, and demographics.

## Standalone load example

Only `transformers` and `torchaudio` are required (both on PyPI).

```python
import torch, torchaudio
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("VoiceNet/voiceclap-small", trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained("VoiceNet/voiceclap-small")

# Audio: any-length 16 kHz waveform, mono
wav, sr = torchaudio.load("clip.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)
wav = wav.mean(0)                                         # (T,)
audio_emb = model.encode_waveform(wav)                    # (1, 768), L2-normed

# Text: short caption(s)
enc      = tok(["a calm and steady voice"], padding=True, return_tensors="pt")
text_emb = model.encode_text(enc.input_ids, enc.attention_mask)

# Cosine similarity (embeddings already L2-normalised)
print((audio_emb @ text_emb.T).item())
```

`encode_waveform` accepts clips up to 30 s; longer clips should be chunked or
truncated before being passed in. Embeddings are 768-d and unit-norm, so
`a @ t.T` is the cosine similarity used in zero-shot retrieval.

## Citation

If you use this model, please cite the VoiceNet paper.