DELULU
Discriminative Embedding Learning Using Latent Units
A Speaker-Aware Self-Trained Speech Foundational Model
π Read our Paper Β· π¬ Questions? Contact Us
Introduction
DELULU is a speaker-aware self-trained foundational model that achieves 62% relative improvement over HuBERT on speaker verification. While existing SSL models (HuBERT, WavLM, wav2vec 2.0) excel at content-driven tasks, they struggle with speaker-centric applications because their pseudo-labels prioritize phonetic similarity over speaker identity.
Key Innovation: DELULU uses frame-level embeddings from ReDimNet (a SOTA speaker verification model) to guide k-means clustering during pre-training, introducing a strong speaker-discriminative inductive bias.
Quick Start
Installation
pip install torch torchaudio transformers
Usage
from transformers import AutoModel
import torch
import torchaudio
# Load model
model = AutoModel.from_pretrained("cmu-mlsp/DELULU", trust_remote_code=True)
model.eval()
# Load audio (16kHz)
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)
# Extract features
with torch.no_grad():
outputs = model(waveform)
features = outputs.last_hidden_state # [batch, time, 768]
speaker_embedding = features.mean(dim=1) # [batch, 768]
Speaker Verification Example
import torch.nn.functional as F
def verify_speaker(model, audio1, audio2, threshold=0.7):
"""Check if two audio samples are from the same speaker."""
with torch.no_grad():
emb1 = model(audio1).last_hidden_state.mean(dim=1)
emb2 = model(audio2).last_hidden_state.mean(dim=1)
similarity = F.cosine_similarity(emb1, emb2)
return similarity.item() > threshold, similarity.item()
# Example
same_speaker, score = verify_speaker(model, waveform1, waveform2)
print(f"Same speaker: {same_speaker}, Score: {score:.4f}")
Performances on Benchmarks
Upstream Speaker Verification (Zero-Shot EER % β)
| Model | VoxCeleb1-O | SITW | Rel. Improvement |
|---|---|---|---|
| wav2vec 2.0 | 43.17 | 42.20 | - |
| HuBERT | 34.05 | 42.60 | - |
| WavLM | 35.93 | 44.00 | - |
| DELULU | 13.53 | 25.40 | 62% |
Zero-Shot Speaker Profiling (Macro-F1 % β)
| Task | wav2vec 2.0 | HuBERT | WavLM | DELULU |
|---|---|---|---|---|
| Gender | 92.73 | 93.97 | 95.75 | 96.18 |
| Accent | 58.60 | 62.86 | 77.76 | 78.38 |
| Speaker Count | 64.20 | 64.83 | 62.71 | 67.13 |
| Spoof Detection | 52.88 | 53.51 | 51.44 | 57.20 |
| Age Estimation | 31.99 | 29.43 | 32.69 | 36.00 |
Downstream Speaker Verification (Fine-tuned EER % β)
| Model | VoxCeleb1-O |
|---|---|
| MFCC baseline | 13.00 |
| HuBERT | 7.45 |
| DELULU | 5.63 |
Architecture
DELULU follows the wav2vec 2.0 / HuBERT architecture with modified strides for optimal 16ms frame alignment:
| Component | Configuration |
|---|---|
| Conv Layers | 7 layers Γ 512 channels |
| Kernel Sizes | [10, 3, 3, 3, 3, 2, 2] |
| Strides | [4, 2, 2, 2, 2, 2, 2] β Key difference |
| Frame Rate | 16ms (256Γ downsampling) |
| Transformer | 12 layers, 768 dim, 12 heads |
| Clusters | k=256 (ReDimNet-guided) |
Training Details
| Setting | Value |
|---|---|
| Pre-training Data | LibriSpeech 960h |
| Hardware | 4Γ NVIDIA H100 GPUs |
| Training Steps | 400k updates |
| Batch Size | 87.5 sec audio/GPU |
| Optimizer | AdamW (lr=5e-4, Ξ²β=0.9, Ξ²β=0.98) |
| Warmup | 32k steps (linear) |
| Clustering | MiniBatchKMeans, k=256 |
Use Cases
- β Speaker Verification β Verify if two audio samples are from the same speaker
- β Speaker Diarization β Segment audio by speaker identity
- β Speaker Profiling β Predict age, gender, accent from voice
- β Speaker Counting β Count unique speakers in audio
- β Spoof Detection β Detect synthetic/manipulated speech
- β Forensic Audio Analysis β Speaker identification in investigations
Citation
If you find DELULU useful, please cite our paper:
@article{baali2025delulu,
title={{DELULU}: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Trained Speech Foundational Model},
author={Baali, Massa and Singh, Rita and Raj, Bhiksha},
journal={arXiv preprint arXiv:2510.17662},
year={2025}
}
Authors
Massa Baali, Rita Singh, Bhiksha Raj
Carnegie Mellon University, Language Technologies Institute
π§ mbaali@cs.cmu.edu
- Downloads last month
- 129