Feature Extraction
sentence-transformers
Safetensors
English
qwen2_5_omni_thinker
audio
speech
emotion
clap
contrastive
voice
Instructions to use VoiceNet/voiceclap-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use VoiceNet/voiceclap-large with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("VoiceNet/voiceclap-large") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 2,731 Bytes
f6fc423 361141a f6fc423 361141a f6fc423 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | ---
license: cc-by-4.0
language:
- en
library_name: sentence-transformers
pipeline_tag: feature-extraction
base_model: LCO-Embedding/LCO-Embedding-Omni-7B
tags:
- audio
- speech
- emotion
- clap
- contrastive
- voice
- sentence-transformers
---
# VoiceCLAP-Large
Voice-text contrastive embedding model — the larger of the two anchors
released with [VoiceNet](https://huggingface.co/VoiceNet).
VoiceCLAP-Large is a **single-tower** model: a rank-16 LoRA finetune of
[LCO-Embedding-Omni-7B](https://huggingface.co/LCO-Embedding/LCO-Embedding-Omni-7B)
(Qwen2.5-Omni-Thinker-7B backbone with a sentence-transformer
last-token-pooling head) trained with the symmetric InfoNCE loss. The audio
and text embeddings are produced by the same backbone — the modality is
determined by what is fed in via the multimodal chat template.
| | |
| --- | --- |
| Architecture | single-tower Omni-Embedding (Qwen2.5-Omni-Thinker-7B + ST last-token-pool) |
| Adaptation | rank-16 LoRA (alpha 32, dropout 0.05), merged into the released weights |
| Joint embedding | 3 584-d, L2-normalised |
| Loss | symmetric InfoNCE (all-gather negatives) |
| Total parameters | ~7 B (full merged model) |
| Epochs | 1 |
## Training data
Trained for **1 epoch** on the open `voiceclap_10_safe` mixture (9 datasets)
used in the VoiceNet paper:
- `emolia-balanced-5M-subset` (annotated subset of [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset))
- `laions_got_talent_clean_with_captions`
- `majestrino-data`
- `synthetic_vocal_bursts`
- `improved_synthetic_vocal_bursts`
- `ears`
- `expresso`
- `voxceleb1`
- `voxceleb2`
All clips are captioned with `MOSS-Audio-8B-Thinking`-derived dense
vocal-style captions covering emotions, talking-style attributes, and
demographics.
## Standalone load example
The model uses the SentenceTransformer multimodal API — both
`sentence-transformers` and `transformers` are on PyPI; no other deps are
required.
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("VoiceNet/voiceclap-large", trust_remote_code=True)
# Text embedding (3 584-d, L2-normalised)
text_emb = model.encode(["a calm and steady voice"])
# Audio embedding — pass a dict with raw samples + sampling rate.
import soundfile as sf
arr, sr = sf.read("clip.wav")
audio_emb = model.encode([{"array": arr, "sampling_rate": sr}])
# Cosine similarity (embeddings already L2-normalised)
print((audio_emb @ text_emb.T).item())
```
For convenience the LoRA adapter is also shipped under `adapter/` so it can
be reapplied to other LCO-Embedding-Omni-7B forks; the merged
`model.safetensors` already contains it.
## Citation
If you use this model, please cite the VoiceNet paper.
|