ParaSpeechCLAP-Combined
ParaSpeechCLAP-Combined is a unified dual-encoder speech-text model trained on both intrinsic and situational speech style data. It supports the full breadth of ParaSpeechCLAP's style vocabulary that spans speaker-level attributes (pitch, texture, clarity, volume, rhythm) and utterance-level attributes (emotions and speaking styles) in a single model. It is the recommended choice for compositional evaluation, where style descriptions mix both attribute types. It is part of the ParaSpeechCLAP model family from the paper:
ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining Anuj Diwan, Eunsol Choi, David Harwath
Installation
git clone https://github.com/ajd12342/paraspeechclap.git
cd paraspeechclap
pip install -r requirements.txt
Download the model checkpoint:
mkdir -p checkpoints
huggingface-cli download ajd12342/paraspeechclap-combined paraspeechclap-combined.pth.tar --local-dir checkpoints
Quick Start
Command-line inference
# Compositional: similarity with a description mixing intrinsic and situational attributes
python scripts/inference.py \
--checkpoint_path ./checkpoints/paraspeechclap-combined.pth.tar \
--audio_path /path/to/audio.wav \
--text "A person with a deep, raspy voice is speaking in a whispered style."
# Intrinsic classification
python scripts/inference.py \
--checkpoint_path ./checkpoints/paraspeechclap-combined.pth.tar \
--audio_path /path/to/audio.wav \
--candidates deep shrill nasal husky raspy
# Situational classification
python scripts/inference.py \
--checkpoint_path ./checkpoints/paraspeechclap-combined.pth.tar \
--audio_path /path/to/audio.wav \
--candidates angry happy calm whispered enthusiastic saddened anxious
Python usage
import torch
import torchaudio
import torchaudio.transforms as T
from paraspeechclap.model import CLAP
from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor
SPEECH_MODEL = "microsoft/wavlm-large"
TEXT_MODEL = "ibm-granite/granite-embedding-278m-multilingual"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load ParaSpeechCLAP-Combined
model = CLAP(
speech_name=SPEECH_MODEL,
text_name=TEXT_MODEL,
embedding_dim=768,
)
state_dict = torch.load("./checkpoints/paraspeechclap-combined.pth.tar", map_location=DEVICE)
model.load_state_dict(state_dict, strict=False)
model.to(DEVICE).eval()
# Initialize preprocessors
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(SPEECH_MODEL)
tokenizer = AutoTokenizer.from_pretrained(TEXT_MODEL)
# Load and preprocess audio (resample to 16 kHz mono)
waveform, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
waveform = T.Resample(sr, 16000)(waveform)
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
audio = feature_extractor(
waveform.squeeze(0), sampling_rate=16000, return_tensors="pt"
).input_values.to(DEVICE) # (1, num_samples)
# Similarity with a compositional style description (both intrinsic + situational)
text_tokens = tokenizer(
"A person with a deep, raspy voice is speaking in a whispered style.",
return_tensors="pt", padding=True, truncation=True, max_length=512
)
text_tokens = {k: v.to(DEVICE) for k, v in text_tokens.items()}
with torch.no_grad():
audio_emb = model.get_audio_embedding(audio, normalize=True) # (1, 768)
text_emb = model.get_text_embedding(text_tokens, normalize=True) # (1, 768)
similarity = (audio_emb @ text_emb.T).item()
print(f"Similarity: {similarity:.4f}")
# Zero-shot classification across style candidates
candidates = ["angry", "happy", "calm", "whispered", "enthusiastic", "saddened", "anxious"]
prompts = [f"A person is speaking in a {s} style." for s in candidates]
tokens = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, max_length=512)
tokens = {k: v.to(DEVICE) for k, v in tokens.items()}
with torch.no_grad():
text_embs = model.get_text_embedding(tokens, normalize=True) # (7, 768)
scores = (audio_emb @ text_embs.T).squeeze(0) # (7,)
pred = candidates[scores.argmax().item()]
print(f"Predicted style: {pred}")
Best-of-N reranking
Use ParaSpeechCLAP-Combined as an inference-time reward model to select the best speech clip from N TTS candidates:
python scripts/best_of_n.py \
checkpoint_path=./checkpoints/paraspeechclap-combined.pth.tar \
input_base_dir=/path/to/tts_outputs \
output_dir_name=best_of_N_paraspeechclap_combined
Evaluation
Evaluate on the paraspeechclap-eval-combined benchmark (compositional retrieval):
python scripts/evaluate_retrieval.py \
--config-name eval/retrieval \
checkpoint_path=./checkpoints/paraspeechclap-combined.pth.tar \
data.dataset_name=ajd12342/paraspeechclap-eval-combined \
data.audio_root=/path/to/audio_root \
meta.results=./results_retrieval/paraspeechclap-eval-combined/ajd12342-paraspeechclap-combined
Choosing a Model
| Use case | Recommended model |
|---|---|
| Intrinsic (speaker-level) attributes only | ajd12342/paraspeechclap-intrinsic |
| Situational (utterance-level) attributes only | ajd12342/paraspeechclap-situational |
| Compositional descriptions (both types) | ajd12342/paraspeechclap-combined |
Related Resources
- GitHub Repository: https://github.com/ajd12342/paraspeechclap
- Models: ajd12342/paraspeechclap-intrinsic, ajd12342/paraspeechclap-situational and ajd12342/paraspeechclap-combined
- Training Datasets: ajd12342/paraspeechcaps-intrinsic-train and ajd12342/paraspeechcaps-situational-train
- Parent Dataset: ajd12342/paraspeechcaps
- Evaluation Datasets: ajd12342/paraspeechclap-eval-intrinsic, ajd12342/paraspeechclap-eval-situational, ajd12342/paraspeechclap-eval-combined
Citation
@inproceedings{diwan2026paraspeechclap,
title={{ParaSpeechCLAP}: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining},
author={Diwan, Anuj and Choi, Eunsol and Harwath, David},
journal={Under Review},
year={2026}
}