ParaSpeechCLAP-Situational
ParaSpeechCLAP-Situational is a dual-encoder speech-text model specialized for situational (utterance-level) style attributes that are tied to individual utterances rather than the speaker's identity, including emotions and speaking styles such as angry, happy, calm, whispered, and enthusiastic. Given a speech clip and a text style description, ParaSpeechCLAP-Situational embeds both into a shared 768-dimensional space and produces a similarity score. It is part of the ParaSpeechCLAP model family from the paper:
ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining Anuj Diwan, Eunsol Choi, David Harwath Under review
Installation
git clone https://github.com/ajd12342/paraspeechclap.git
cd paraspeechclap
pip install -r requirements.txt
Download the model checkpoint:
mkdir -p checkpoints
huggingface-cli download ajd12342/paraspeechclap-situational paraspeechclap-situational.pth.tar --local-dir checkpoints
Quick Start
Command-line inference
# Compute similarity between audio and a text style description
python scripts/inference.py \
--checkpoint_path ./checkpoints/paraspeechclap-situational.pth.tar \
--audio_path /path/to/audio.wav \
--text "A person is speaking in a whispered style."
# Zero-shot classification across emotion/speaking-style candidates
python scripts/inference.py \
--checkpoint_path ./checkpoints/paraspeechclap-situational.pth.tar \
--audio_path /path/to/audio.wav \
--candidates angry happy calm whispered enthusiastic saddened anxious
Python usage
import torch
import torchaudio
import torchaudio.transforms as T
from paraspeechclap.model import CLAP
from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor
SPEECH_MODEL = "microsoft/wavlm-large"
TEXT_MODEL = "ibm-granite/granite-embedding-278m-multilingual"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load ParaSpeechCLAP-Situational
model = CLAP(
speech_name=SPEECH_MODEL,
text_name=TEXT_MODEL,
embedding_dim=768,
)
state_dict = torch.load("./checkpoints/paraspeechclap-situational.pth.tar", map_location=DEVICE)
model.load_state_dict(state_dict, strict=False)
model.to(DEVICE).eval()
# Initialize preprocessors
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(SPEECH_MODEL)
tokenizer = AutoTokenizer.from_pretrained(TEXT_MODEL)
# Load and preprocess audio (resample to 16 kHz mono)
waveform, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
waveform = T.Resample(sr, 16000)(waveform)
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
audio = feature_extractor(
waveform.squeeze(0), sampling_rate=16000, return_tensors="pt"
).input_values.to(DEVICE) # (1, num_samples)
# Similarity with a free-form situational description
text_tokens = tokenizer(
"A person is speaking in a whispered style.",
return_tensors="pt", padding=True, truncation=True, max_length=512
)
text_tokens = {k: v.to(DEVICE) for k, v in text_tokens.items()}
with torch.no_grad():
audio_emb = model.get_audio_embedding(audio, normalize=True) # (1, 768)
text_emb = model.get_text_embedding(text_tokens, normalize=True) # (1, 768)
similarity = (audio_emb @ text_emb.T).item()
print(f"Similarity: {similarity:.4f}")
# Zero-shot classification across situational candidate styles
candidates = ["angry", "happy", "calm", "whispered", "enthusiastic", "saddened", "anxious"]
prompts = [f"A person is speaking in a {s} style." for s in candidates]
tokens = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, max_length=512)
tokens = {k: v.to(DEVICE) for k, v in tokens.items()}
with torch.no_grad():
text_embs = model.get_text_embedding(tokens, normalize=True) # (7, 768)
scores = (audio_emb @ text_embs.T).squeeze(0) # (7,)
pred = candidates[scores.argmax().item()]
print(f"Predicted style: {pred}")
The full set of 21 situational candidate styles used in the ParaSpeechCLAP-Situational evaluation is:
angry, guilt, scared, happy, loud, sarcastic, sympathetic, desirous, enthusiastic, saddened, anxious, sleepy, admiring, disgusted, awed, pained, fast, calm, whispered, enunciated, confused.
Best-of-N reranking
Use ParaSpeechCLAP-Situational as an inference-time reward model to select the best speech clip from N TTS candidates:
python scripts/best_of_n.py \
checkpoint_path=./checkpoints/paraspeechclap-situational.pth.tar \
input_base_dir=/path/to/tts_outputs \
output_dir_name=best_of_N_paraspeechclap_situational
Evaluation
Evaluate on the paraspeechclap-eval-situational benchmark:
# Retrieval (R@1, R@10, Median Rank)
python scripts/evaluate_retrieval.py \
--config-name eval/retrieval \
checkpoint_path=./checkpoints/paraspeechclap-situational.pth.tar \
data.dataset_name=ajd12342/paraspeechclap-eval-situational \
data.audio_root=/path/to/audio_root \
meta.results=./results_retrieval/paraspeechclap-eval-situational/ajd12342-paraspeechclap-situational
# Classification (UAR, Macro F1 — 21 situational classes)
python scripts/evaluate_classification.py \
--config-name eval/classification/situational \
checkpoint_path=./checkpoints/paraspeechclap-situational.pth.tar \
data.audio_root=/path/to/audio_root \
meta.results=./results_classification/paraspeechclap-eval-situational/ajd12342-paraspeechclap-situational/
Choosing a Model
| Use case | Recommended model |
|---|---|
| Intrinsic (speaker-level) attributes only | ajd12342/paraspeechclap-intrinsic |
| Situational (utterance-level) attributes only | ajd12342/paraspeechclap-situational |
| Compositional descriptions (both types) | ajd12342/paraspeechclap-combined |
Related Resources
- GitHub Repository: https://github.com/ajd12342/paraspeechclap
- Models: ajd12342/paraspeechclap-intrinsic, ajd12342/paraspeechclap-situational and ajd12342/paraspeechclap-combined
- Training Dataset: ajd12342/paraspeechcaps-situational-train
- Parent Dataset: ajd12342/paraspeechcaps
- Evaluation Datasets: ajd12342/paraspeechclap-eval-intrinsic, ajd12342/paraspeechclap-eval-situational, ajd12342/paraspeechclap-eval-combined
Citation
@misc{diwan2026paraspeechclapdualencoderspeechtextmodel,
title={ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining},
author={Anuj Diwan and Eunsol Choi and David Harwath},
year={2026},
eprint={2603.28737},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.28737},
}