PyTorch
English
delulu
custom_code

DELULU

Discriminative Embedding Learning Using Latent Units
A Speaker-Aware Self-Trained Speech Foundational Model

arXiv HuggingFace License

πŸ“– Read our Paper Β· πŸ’¬ Questions? Contact Us


Introduction

DELULU is a speaker-aware self-trained foundational model that achieves 62% relative improvement over HuBERT on speaker verification. While existing SSL models (HuBERT, WavLM, wav2vec 2.0) excel at content-driven tasks, they struggle with speaker-centric applications because their pseudo-labels prioritize phonetic similarity over speaker identity.

Key Innovation: DELULU uses frame-level embeddings from ReDimNet (a SOTA speaker verification model) to guide k-means clustering during pre-training, introducing a strong speaker-discriminative inductive bias.

Quick Start

Installation

pip install torch torchaudio transformers

Usage

from transformers import AutoModel
import torch
import torchaudio

# Load model
model = AutoModel.from_pretrained("cmu-mlsp/DELULU", trust_remote_code=True)
model.eval()

# Load audio (16kHz)
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
    waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)

# Extract features
with torch.no_grad():
    outputs = model(waveform)
    features = outputs.last_hidden_state        # [batch, time, 768]
    speaker_embedding = features.mean(dim=1)    # [batch, 768]

Speaker Verification Example

import torch.nn.functional as F

def verify_speaker(model, audio1, audio2, threshold=0.7):
    """Check if two audio samples are from the same speaker."""
    with torch.no_grad():
        emb1 = model(audio1).last_hidden_state.mean(dim=1)
        emb2 = model(audio2).last_hidden_state.mean(dim=1)
        similarity = F.cosine_similarity(emb1, emb2)
    return similarity.item() > threshold, similarity.item()

# Example
same_speaker, score = verify_speaker(model, waveform1, waveform2)
print(f"Same speaker: {same_speaker}, Score: {score:.4f}")

Performances on Benchmarks

Upstream Speaker Verification (Zero-Shot EER % ↓)

Model VoxCeleb1-O SITW Rel. Improvement
wav2vec 2.0 43.17 42.20 -
HuBERT 34.05 42.60 -
WavLM 35.93 44.00 -
DELULU 13.53 25.40 62%

Zero-Shot Speaker Profiling (Macro-F1 % ↑)

Task wav2vec 2.0 HuBERT WavLM DELULU
Gender 92.73 93.97 95.75 96.18
Accent 58.60 62.86 77.76 78.38
Speaker Count 64.20 64.83 62.71 67.13
Spoof Detection 52.88 53.51 51.44 57.20
Age Estimation 31.99 29.43 32.69 36.00

Downstream Speaker Verification (Fine-tuned EER % ↓)

Model VoxCeleb1-O
MFCC baseline 13.00
HuBERT 7.45
DELULU 5.63

Architecture

DELULU follows the wav2vec 2.0 / HuBERT architecture with modified strides for optimal 16ms frame alignment:

Component Configuration
Conv Layers 7 layers Γ— 512 channels
Kernel Sizes [10, 3, 3, 3, 3, 2, 2]
Strides [4, 2, 2, 2, 2, 2, 2] ← Key difference
Frame Rate 16ms (256Γ— downsampling)
Transformer 12 layers, 768 dim, 12 heads
Clusters k=256 (ReDimNet-guided)

Training Details

Setting Value
Pre-training Data LibriSpeech 960h
Hardware 4Γ— NVIDIA H100 GPUs
Training Steps 400k updates
Batch Size 87.5 sec audio/GPU
Optimizer AdamW (lr=5e-4, β₁=0.9, Ξ²β‚‚=0.98)
Warmup 32k steps (linear)
Clustering MiniBatchKMeans, k=256

Use Cases

  • βœ… Speaker Verification β€” Verify if two audio samples are from the same speaker
  • βœ… Speaker Diarization β€” Segment audio by speaker identity
  • βœ… Speaker Profiling β€” Predict age, gender, accent from voice
  • βœ… Speaker Counting β€” Count unique speakers in audio
  • βœ… Spoof Detection β€” Detect synthetic/manipulated speech
  • βœ… Forensic Audio Analysis β€” Speaker identification in investigations

Citation

If you find DELULU useful, please cite our paper:

@article{baali2025delulu,
  title={{DELULU}: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Trained Speech Foundational Model},
  author={Baali, Massa and Singh, Rita and Raj, Bhiksha},
  journal={arXiv preprint arXiv:2510.17662},
  year={2025}
}

Authors

Massa Baali, Rita Singh, Bhiksha Raj
Carnegie Mellon University, Language Technologies Institute
πŸ“§ mbaali@cs.cmu.edu

Downloads last month
129
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for cmu-mlsp/DELULU