DELULU

Discriminative Embedding Learning Using Latent Units
A Speaker-Aware Self-Trained Speech Foundational Model

📖 Read our Paper · 💬 Questions? Contact Us

Introduction

DELULU is a speaker-aware self-trained foundational model that achieves 62% relative improvement over HuBERT on speaker verification. While existing SSL models (HuBERT, WavLM, wav2vec 2.0) excel at content-driven tasks, they struggle with speaker-centric applications because their pseudo-labels prioritize phonetic similarity over speaker identity.

Key Innovation: DELULU uses frame-level embeddings from ReDimNet (a SOTA speaker verification model) to guide k-means clustering during pre-training, introducing a strong speaker-discriminative inductive bias.

Quick Start

Installation

pip install torch torchaudio transformers

Usage

from transformers import AutoModel
import torch
import torchaudio

# Load model
model = AutoModel.from_pretrained("cmu-mlsp/DELULU", trust_remote_code=True)
model.eval()

# Load audio (16kHz)
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
    waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)

# Extract features
with torch.no_grad():
    outputs = model(waveform)
    features = outputs.last_hidden_state        # [batch, time, 768]
    speaker_embedding = features.mean(dim=1)    # [batch, 768]

Speaker Verification Example

import torch.nn.functional as F

def verify_speaker(model, audio1, audio2, threshold=0.7):
    """Check if two audio samples are from the same speaker."""
    with torch.no_grad():
        emb1 = model(audio1).last_hidden_state.mean(dim=1)
        emb2 = model(audio2).last_hidden_state.mean(dim=1)
        similarity = F.cosine_similarity(emb1, emb2)
    return similarity.item() > threshold, similarity.item()

# Example
same_speaker, score = verify_speaker(model, waveform1, waveform2)
print(f"Same speaker: {same_speaker}, Score: {score:.4f}")

Performances on Benchmarks

Upstream Speaker Verification (Zero-Shot EER % ↓)

Model	VoxCeleb1-O	SITW	Rel. Improvement
wav2vec 2.0	43.17	42.20	-
HuBERT	34.05	42.60	-
WavLM	35.93	44.00	-
DELULU	13.53	25.40	62%

Zero-Shot Speaker Profiling (Macro-F1 % ↑)

Task	wav2vec 2.0	HuBERT	WavLM	DELULU
Gender	92.73	93.97	95.75	96.18
Accent	58.60	62.86	77.76	78.38
Speaker Count	64.20	64.83	62.71	67.13
Spoof Detection	52.88	53.51	51.44	57.20
Age Estimation	31.99	29.43	32.69	36.00

Downstream Speaker Verification (Fine-tuned EER % ↓)

Model	VoxCeleb1-O
MFCC baseline	13.00
HuBERT	7.45
DELULU	5.63

Architecture

DELULU follows the wav2vec 2.0 / HuBERT architecture with modified strides for optimal 16ms frame alignment:

Component	Configuration
Conv Layers	7 layers × 512 channels
Kernel Sizes	`[10, 3, 3, 3, 3, 2, 2]`
Strides	`[4, 2, 2, 2, 2, 2, 2]` ← Key difference
Frame Rate	16ms (256× downsampling)
Transformer	12 layers, 768 dim, 12 heads
Clusters	k=256 (ReDimNet-guided)

Training Details

Setting	Value
Pre-training Data	LibriSpeech 960h
Hardware	4× NVIDIA H100 GPUs
Training Steps	400k updates
Batch Size	87.5 sec audio/GPU
Optimizer	AdamW (lr=5e-4, β₁=0.9, β₂=0.98)
Warmup	32k steps (linear)
Clustering	MiniBatchKMeans, k=256

Use Cases

✅ Speaker Verification — Verify if two audio samples are from the same speaker
✅ Speaker Diarization — Segment audio by speaker identity
✅ Speaker Profiling — Predict age, gender, accent from voice
✅ Speaker Counting — Count unique speakers in audio
✅ Spoof Detection — Detect synthetic/manipulated speech
✅ Forensic Audio Analysis — Speaker identification in investigations

Citation

If you find DELULU useful, please cite our paper:

@article{baali2025delulu,
  title={{DELULU}: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Trained Speech Foundational Model},
  author={Baali, Massa and Singh, Rita and Raj, Bhiksha},
  journal={arXiv preprint arXiv:2510.17662},
  year={2025}
}

Authors

Massa Baali, Rita Singh, Bhiksha Raj
Carnegie Mellon University, Language Technologies Institute
📧 mbaali@cs.cmu.edu

Downloads last month: 434

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for cmu-mlsp/DELULU

DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model

Paper • 2510.17662 • Published Oct 20, 2025