Simple Speaker Diarization with SpeechBrain X-Vectors

Community Article

Published February 2, 2026

Upvote

Comparing Silence-Based VAD vs Neural VAD (pyannote)

Speaker diarization answers a simple question: “who spoke when?” While full diarization pipelines exist, they can feel heavy if you just want a clear, hackable baseline.

This post presents one simple diarization pipeline built on SpeechBrain x-vector embeddings, with two interchangeable Voice Activity Detection (VAD) methods:

Method A: Silence-based VAD (minimal dependencies)
Method B: Neural VAD using pyannote (more robust)

Both methods share the same embedding, clustering, and segment-merging logic.

Pipeline Overview (Shared)

Regardless of VAD choice, the diarization flow is:

Detect speech regions (VAD)
Extract SpeechBrain x-vector embeddings
Cluster segments by cosine similarity
Merge adjacent segments from the same speaker
Save speaker-separated WAV files

Only Step 1 (VAD) changes.

Why SpeechBrain X-Vectors?

We use:

speechbrain/spkrec-xvect-voxceleb

Why?

Widely used, well-understood speaker embeddings
CPU-friendly
Available directly on Hugging Face
Easy to integrate into custom pipelines

This makes it ideal for learning, prototyping, and lightweight production tasks.

VAD Options

Method A: Silence-Based VAD (Minimal)

Uses pydub.detect_nonsilent.

Pros

Extremely simple
No neural models
Fast and transparent

Cons

Sensitive to noise
Poor boundary precision
No overlap handling

Method B: Neural VAD (pyannote)

Uses:

norwoodsystems/norwood-voice-activity-detection

Pros

Much better speech boundaries
Robust to background noise
Handles quiet speakers well

Cons

Extra dependency
Slightly heavier runtime

Comparison Summary

Aspect	Silence-Based VAD	pyannote VAD
Dependencies	Very low	Moderate
Noise robustness	Low	High
Boundary accuracy	Rough	Clean
Recommended for	Prototypes	Real recordings

Unified Example Code

Below is a single diarization script. Choose your VAD by toggling one flag.

# diarization_speechbrain.py
# SpeechBrain x-vectors with pluggable VAD

import sys, os, numpy as np
from collections import defaultdict
from pydub import AudioSegment
from speechbrain.pretrained import EncoderClassifier
from sklearn.cluster import AgglomerativeClustering

USE_PYANNOTE_VAD = True  # ← toggle here

if len(sys.argv) < 2:
    print("Usage: python diarization_speechbrain.py <input_wav>")
    sys.exit(1)

audio_path = sys.argv[1]
audio = AudioSegment.from_wav(audio_path)
os.makedirs("segment", exist_ok=True)

# -----------------------------
# 1. Voice Activity Detection
# -----------------------------
segments = []

if USE_PYANNOTE_VAD:
    from pyannote.audio.pipelines import VoiceActivityDetection

    vad = VoiceActivityDetection(
        segmentation="norwoodsystems/norwood-voice-activity-detection"
    )
    vad.instantiate({
        "onset": 0.5,
        "offset": 0.5,
        "min_duration_on": 1.0,
        "min_duration_off": 0.5,
    })

    vad_annotation = vad(audio_path)
    for seg in vad_annotation.get_timeline():
        s, e = int(seg.start * 1000), int(seg.end * 1000)
        if e - s >= 1000:
            segments.append((s, e))

else:
    from pydub.silence import detect_nonsilent

    nonsilent = detect_nonsilent(
        audio,
        min_silence_len=500,
        silence_thresh=audio.dBFS - 16,
        seek_step=100,
    )
    segments = [(s, e) for s, e in nonsilent if e - s >= 1000]

if not segments:
    sys.exit("No speech detected")

# -----------------------------
# 2. Load x-vector model
# -----------------------------
classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-xvect-voxceleb",
    savedir="pretrained_models/spkrec-xvect-voxceleb",
    run_opts={"device": "cpu"},
)

# -----------------------------
# 3. Extract embeddings
# -----------------------------
embeddings, valid = [], []
os.makedirs("temp", exist_ok=True)

for i, (s, e) in enumerate(segments):
    path = f"temp/{i}.wav"
    audio[s:e].export(path, format="wav")

    wav = classifier.load_audio(path)
    emb = classifier.encode_batch(wav)

    embeddings.append(emb.squeeze().cpu().numpy())
    valid.append((s, e))
    os.remove(path)

os.rmdir("temp")

# -----------------------------
# 4. Cluster speakers
# -----------------------------
labels = AgglomerativeClustering(
    n_clusters=min(2, len(embeddings)),
    metric="cosine",
    linkage="average",
).fit_predict(np.array(embeddings))

# -----------------------------
# 5. Merge & save segments
# -----------------------------
groups = defaultdict(list)
for (s, e), lbl in zip(valid, labels):
    groups[lbl].append((s, e))

for lbl, segs in groups.items():
    segs.sort()
    start, end = segs[0]

    merged = []
    for s, e in segs[1:]:
        if s - end < 2000:
            end = e
        else:
            merged.append((start, end))
            start, end = s, e
    merged.append((start, end))

    for i, (s, e) in enumerate(merged):
        audio[s:e].export(
            f"segment/SPEAKER_{lbl:02d}_{i}.wav",
            format="wav"
        )

Which configuration should you use?

Learning / quick prototype → Silence-based VAD
Meetings, podcasts, noisy audio → pyannote VAD
Production diarization → Consider full pyannote pipelines

Final Thoughts

Most diarization errors come from poor VAD, not clustering. Keeping embeddings and clustering fixed while swapping VAD makes this pipeline easy to reason about and extend.

This pattern—simple core + pluggable components—is often the fastest way to build reliable speech systems.

What has been the biggest bottleneck in your diarization work: VAD, embeddings, or clustering?

Offline ASR Benchmark on Call-Center Audio

May 25, 2026

Running PersonaPlex-7B on Hugging Face ZeroGPU: A Complete Guide

April 8, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote