ParaSpeechCLAP-Combined

ParaSpeechCLAP-Combined is a unified dual-encoder speech-text model trained on both intrinsic and situational speech style data. It supports the full breadth of ParaSpeechCLAP's style vocabulary that spans speaker-level attributes (pitch, texture, clarity, volume, rhythm) and utterance-level attributes (emotions and speaking styles) in a single model. It is the recommended choice for compositional evaluation, where style descriptions mix both attribute types. It is part of the ParaSpeechCLAP model family from the paper:

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining Anuj Diwan, Eunsol Choi, David Harwath

Installation

git clone https://github.com/ajd12342/paraspeechclap.git
cd paraspeechclap
pip install -r requirements.txt

Download the model checkpoint:

mkdir -p checkpoints
huggingface-cli download ajd12342/paraspeechclap-combined paraspeechclap-combined.pth.tar --local-dir checkpoints

Quick Start

Command-line inference

# Compositional: similarity with a description mixing intrinsic and situational attributes
python scripts/inference.py \
  --checkpoint_path ./checkpoints/paraspeechclap-combined.pth.tar \
  --audio_path /path/to/audio.wav \
  --text "A person with a deep, raspy voice is speaking in a whispered style."

# Intrinsic classification
python scripts/inference.py \
  --checkpoint_path ./checkpoints/paraspeechclap-combined.pth.tar \
  --audio_path /path/to/audio.wav \
  --candidates deep shrill nasal husky raspy

# Situational classification
python scripts/inference.py \
  --checkpoint_path ./checkpoints/paraspeechclap-combined.pth.tar \
  --audio_path /path/to/audio.wav \
  --candidates angry happy calm whispered enthusiastic saddened anxious

Python usage

import torch
import torchaudio
import torchaudio.transforms as T
from paraspeechclap.model import CLAP
from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor

SPEECH_MODEL = "microsoft/wavlm-large"
TEXT_MODEL = "ibm-granite/granite-embedding-278m-multilingual"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load ParaSpeechCLAP-Combined
model = CLAP(
    speech_name=SPEECH_MODEL,
    text_name=TEXT_MODEL,
    embedding_dim=768,
)
state_dict = torch.load("./checkpoints/paraspeechclap-combined.pth.tar", map_location=DEVICE)
model.load_state_dict(state_dict, strict=False)
model.to(DEVICE).eval()

# Initialize preprocessors
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(SPEECH_MODEL)
tokenizer = AutoTokenizer.from_pretrained(TEXT_MODEL)

# Load and preprocess audio (resample to 16 kHz mono)
waveform, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
    waveform = T.Resample(sr, 16000)(waveform)
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
audio = feature_extractor(
    waveform.squeeze(0), sampling_rate=16000, return_tensors="pt"
).input_values.to(DEVICE)  # (1, num_samples)

# Similarity with a compositional style description (both intrinsic + situational)
text_tokens = tokenizer(
    "A person with a deep, raspy voice is speaking in a whispered style.",
    return_tensors="pt", padding=True, truncation=True, max_length=512
)
text_tokens = {k: v.to(DEVICE) for k, v in text_tokens.items()}

with torch.no_grad():
    audio_emb = model.get_audio_embedding(audio, normalize=True)   # (1, 768)
    text_emb = model.get_text_embedding(text_tokens, normalize=True)  # (1, 768)
    similarity = (audio_emb @ text_emb.T).item()
    print(f"Similarity: {similarity:.4f}")

# Zero-shot classification across style candidates
candidates = ["angry", "happy", "calm", "whispered", "enthusiastic", "saddened", "anxious"]
prompts = [f"A person is speaking in a {s} style." for s in candidates]
tokens = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, max_length=512)
tokens = {k: v.to(DEVICE) for k, v in tokens.items()}

with torch.no_grad():
    text_embs = model.get_text_embedding(tokens, normalize=True)  # (7, 768)
    scores = (audio_emb @ text_embs.T).squeeze(0)  # (7,)
    pred = candidates[scores.argmax().item()]
    print(f"Predicted style: {pred}")

Best-of-N reranking

Use ParaSpeechCLAP-Combined as an inference-time reward model to select the best speech clip from N TTS candidates:

python scripts/best_of_n.py \
  checkpoint_path=./checkpoints/paraspeechclap-combined.pth.tar \
  input_base_dir=/path/to/tts_outputs \
  output_dir_name=best_of_N_paraspeechclap_combined

Evaluation

Evaluate on the paraspeechclap-eval-combined benchmark (compositional retrieval):

python scripts/evaluate_retrieval.py \
  --config-name eval/retrieval \
  checkpoint_path=./checkpoints/paraspeechclap-combined.pth.tar \
  data.dataset_name=ajd12342/paraspeechclap-eval-combined \
  data.audio_root=/path/to/audio_root \
  meta.results=./results_retrieval/paraspeechclap-eval-combined/ajd12342-paraspeechclap-combined

Choosing a Model

Use case Recommended model
Intrinsic (speaker-level) attributes only ajd12342/paraspeechclap-intrinsic
Situational (utterance-level) attributes only ajd12342/paraspeechclap-situational
Compositional descriptions (both types) ajd12342/paraspeechclap-combined

Related Resources

Citation

@inproceedings{diwan2026paraspeechclap,
  title={{ParaSpeechCLAP}: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining},
  author={Diwan, Anuj and Choi, Eunsol and Harwath, David},
  journal={Under Review},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ajd12342/paraspeechclap-combined

Finetuned
(9)
this model

Datasets used to train ajd12342/paraspeechclap-combined

Collection including ajd12342/paraspeechclap-combined