ParaSpeechCLAP-Situational

ParaSpeechCLAP-Situational is a dual-encoder speech-text model specialized for situational (utterance-level) style attributes that are tied to individual utterances rather than the speaker's identity, including emotions and speaking styles such as angry, happy, calm, whispered, and enthusiastic. Given a speech clip and a text style description, ParaSpeechCLAP-Situational embeds both into a shared 768-dimensional space and produces a similarity score. It is part of the ParaSpeechCLAP model family from the paper:

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining Anuj Diwan, Eunsol Choi, David Harwath Under review

Installation

git clone https://github.com/ajd12342/paraspeechclap.git
cd paraspeechclap
pip install -r requirements.txt

Download the model checkpoint:

mkdir -p checkpoints
huggingface-cli download ajd12342/paraspeechclap-situational paraspeechclap-situational.pth.tar --local-dir checkpoints

Quick Start

Command-line inference

# Compute similarity between audio and a text style description
python scripts/inference.py \
  --checkpoint_path ./checkpoints/paraspeechclap-situational.pth.tar \
  --audio_path /path/to/audio.wav \
  --text "A person is speaking in a whispered style."

# Zero-shot classification across emotion/speaking-style candidates
python scripts/inference.py \
  --checkpoint_path ./checkpoints/paraspeechclap-situational.pth.tar \
  --audio_path /path/to/audio.wav \
  --candidates angry happy calm whispered enthusiastic saddened anxious

Python usage

import torch
import torchaudio
import torchaudio.transforms as T
from paraspeechclap.model import CLAP
from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor

SPEECH_MODEL = "microsoft/wavlm-large"
TEXT_MODEL = "ibm-granite/granite-embedding-278m-multilingual"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load ParaSpeechCLAP-Situational
model = CLAP(
    speech_name=SPEECH_MODEL,
    text_name=TEXT_MODEL,
    embedding_dim=768,
)
state_dict = torch.load("./checkpoints/paraspeechclap-situational.pth.tar", map_location=DEVICE)
model.load_state_dict(state_dict, strict=False)
model.to(DEVICE).eval()

# Initialize preprocessors
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(SPEECH_MODEL)
tokenizer = AutoTokenizer.from_pretrained(TEXT_MODEL)

# Load and preprocess audio (resample to 16 kHz mono)
waveform, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
    waveform = T.Resample(sr, 16000)(waveform)
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
audio = feature_extractor(
    waveform.squeeze(0), sampling_rate=16000, return_tensors="pt"
).input_values.to(DEVICE)  # (1, num_samples)

# Similarity with a free-form situational description
text_tokens = tokenizer(
    "A person is speaking in a whispered style.",
    return_tensors="pt", padding=True, truncation=True, max_length=512
)
text_tokens = {k: v.to(DEVICE) for k, v in text_tokens.items()}

with torch.no_grad():
    audio_emb = model.get_audio_embedding(audio, normalize=True)   # (1, 768)
    text_emb = model.get_text_embedding(text_tokens, normalize=True)  # (1, 768)
    similarity = (audio_emb @ text_emb.T).item()
    print(f"Similarity: {similarity:.4f}")

# Zero-shot classification across situational candidate styles
candidates = ["angry", "happy", "calm", "whispered", "enthusiastic", "saddened", "anxious"]
prompts = [f"A person is speaking in a {s} style." for s in candidates]
tokens = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, max_length=512)
tokens = {k: v.to(DEVICE) for k, v in tokens.items()}

with torch.no_grad():
    text_embs = model.get_text_embedding(tokens, normalize=True)  # (7, 768)
    scores = (audio_emb @ text_embs.T).squeeze(0)  # (7,)
    pred = candidates[scores.argmax().item()]
    print(f"Predicted style: {pred}")

The full set of 21 situational candidate styles used in the ParaSpeechCLAP-Situational evaluation is: angry, guilt, scared, happy, loud, sarcastic, sympathetic, desirous, enthusiastic, saddened, anxious, sleepy, admiring, disgusted, awed, pained, fast, calm, whispered, enunciated, confused.

Best-of-N reranking

Use ParaSpeechCLAP-Situational as an inference-time reward model to select the best speech clip from N TTS candidates:

python scripts/best_of_n.py \
  checkpoint_path=./checkpoints/paraspeechclap-situational.pth.tar \
  input_base_dir=/path/to/tts_outputs \
  output_dir_name=best_of_N_paraspeechclap_situational

Evaluation

Evaluate on the paraspeechclap-eval-situational benchmark:

# Retrieval (R@1, R@10, Median Rank)
python scripts/evaluate_retrieval.py \
  --config-name eval/retrieval \
  checkpoint_path=./checkpoints/paraspeechclap-situational.pth.tar \
  data.dataset_name=ajd12342/paraspeechclap-eval-situational \
  data.audio_root=/path/to/audio_root \
  meta.results=./results_retrieval/paraspeechclap-eval-situational/ajd12342-paraspeechclap-situational

# Classification (UAR, Macro F1 — 21 situational classes)
python scripts/evaluate_classification.py \
  --config-name eval/classification/situational \
  checkpoint_path=./checkpoints/paraspeechclap-situational.pth.tar \
  data.audio_root=/path/to/audio_root \
  meta.results=./results_classification/paraspeechclap-eval-situational/ajd12342-paraspeechclap-situational/

Choosing a Model

Use case	Recommended model
Intrinsic (speaker-level) attributes only	ajd12342/paraspeechclap-intrinsic
Situational (utterance-level) attributes only	ajd12342/paraspeechclap-situational
Compositional descriptions (both types)	ajd12342/paraspeechclap-combined

Related Resources

GitHub Repository: https://github.com/ajd12342/paraspeechclap
Models: ajd12342/paraspeechclap-intrinsic, ajd12342/paraspeechclap-situational and ajd12342/paraspeechclap-combined
Training Dataset: ajd12342/paraspeechcaps-situational-train
Parent Dataset: ajd12342/paraspeechcaps
Evaluation Datasets: ajd12342/paraspeechclap-eval-intrinsic, ajd12342/paraspeechclap-eval-situational, ajd12342/paraspeechclap-eval-combined

Citation

@misc{diwan2026paraspeechclapdualencoderspeechtextmodel,
      title={ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining},
      author={Anuj Diwan and Eunsol Choi and David Harwath},
      year={2026},
      eprint={2603.28737},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2603.28737},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ajd12342/paraspeechclap-situational

Base model

ibm-granite/granite-embedding-278m-multilingual

Finetuned

(11)

this model

Datasets used to train ajd12342/paraspeechclap-situational

Collection including ajd12342/paraspeechclap-situational

ParaSpeechCLAP: Dual-Encoder Speech-Text Model

Collection

The ParaSpeechCLAP models and datasets used to train them. • 9 items • Updated Apr 14

Paper for ajd12342/paraspeechclap-situational

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Paper • 2603.28737 • Published Mar 30