Simple Speaker Diarization with SpeechBrain X-Vectors

Community Article Published February 2, 2026

Comparing Silence-Based VAD vs Neural VAD (pyannote)

Speaker diarization answers a simple question: “who spoke when?” While full diarization pipelines exist, they can feel heavy if you just want a clear, hackable baseline.

This post presents one simple diarization pipeline built on SpeechBrain x-vector embeddings, with two interchangeable Voice Activity Detection (VAD) methods:

  • Method A: Silence-based VAD (minimal dependencies)
  • Method B: Neural VAD using pyannote (more robust)

Both methods share the same embedding, clustering, and segment-merging logic.


Pipeline Overview (Shared)

Regardless of VAD choice, the diarization flow is:

  1. Detect speech regions (VAD)
  2. Extract SpeechBrain x-vector embeddings
  3. Cluster segments by cosine similarity
  4. Merge adjacent segments from the same speaker
  5. Save speaker-separated WAV files

Only Step 1 (VAD) changes.


Why SpeechBrain X-Vectors?

We use:

speechbrain/spkrec-xvect-voxceleb

Why?

  • Widely used, well-understood speaker embeddings
  • CPU-friendly
  • Available directly on Hugging Face
  • Easy to integrate into custom pipelines

This makes it ideal for learning, prototyping, and lightweight production tasks.


VAD Options

Method A: Silence-Based VAD (Minimal)

Uses pydub.detect_nonsilent.

Pros

  • Extremely simple
  • No neural models
  • Fast and transparent

Cons

  • Sensitive to noise
  • Poor boundary precision
  • No overlap handling

Method B: Neural VAD (pyannote)

Uses:

norwoodsystems/norwood-voice-activity-detection

Pros

  • Much better speech boundaries
  • Robust to background noise
  • Handles quiet speakers well

Cons

  • Extra dependency
  • Slightly heavier runtime

Comparison Summary

Aspect Silence-Based VAD pyannote VAD
Dependencies Very low Moderate
Noise robustness Low High
Boundary accuracy Rough Clean
Recommended for Prototypes Real recordings

Unified Example Code

Below is a single diarization script. Choose your VAD by toggling one flag.

# diarization_speechbrain.py
# SpeechBrain x-vectors with pluggable VAD

import sys, os, numpy as np
from collections import defaultdict
from pydub import AudioSegment
from speechbrain.pretrained import EncoderClassifier
from sklearn.cluster import AgglomerativeClustering

USE_PYANNOTE_VAD = True  # ← toggle here

if len(sys.argv) < 2:
    print("Usage: python diarization_speechbrain.py <input_wav>")
    sys.exit(1)

audio_path = sys.argv[1]
audio = AudioSegment.from_wav(audio_path)
os.makedirs("segment", exist_ok=True)

# -----------------------------
# 1. Voice Activity Detection
# -----------------------------
segments = []

if USE_PYANNOTE_VAD:
    from pyannote.audio.pipelines import VoiceActivityDetection

    vad = VoiceActivityDetection(
        segmentation="norwoodsystems/norwood-voice-activity-detection"
    )
    vad.instantiate({
        "onset": 0.5,
        "offset": 0.5,
        "min_duration_on": 1.0,
        "min_duration_off": 0.5,
    })

    vad_annotation = vad(audio_path)
    for seg in vad_annotation.get_timeline():
        s, e = int(seg.start * 1000), int(seg.end * 1000)
        if e - s >= 1000:
            segments.append((s, e))

else:
    from pydub.silence import detect_nonsilent

    nonsilent = detect_nonsilent(
        audio,
        min_silence_len=500,
        silence_thresh=audio.dBFS - 16,
        seek_step=100,
    )
    segments = [(s, e) for s, e in nonsilent if e - s >= 1000]

if not segments:
    sys.exit("No speech detected")

# -----------------------------
# 2. Load x-vector model
# -----------------------------
classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-xvect-voxceleb",
    savedir="pretrained_models/spkrec-xvect-voxceleb",
    run_opts={"device": "cpu"},
)

# -----------------------------
# 3. Extract embeddings
# -----------------------------
embeddings, valid = [], []
os.makedirs("temp", exist_ok=True)

for i, (s, e) in enumerate(segments):
    path = f"temp/{i}.wav"
    audio[s:e].export(path, format="wav")

    wav = classifier.load_audio(path)
    emb = classifier.encode_batch(wav)

    embeddings.append(emb.squeeze().cpu().numpy())
    valid.append((s, e))
    os.remove(path)

os.rmdir("temp")

# -----------------------------
# 4. Cluster speakers
# -----------------------------
labels = AgglomerativeClustering(
    n_clusters=min(2, len(embeddings)),
    metric="cosine",
    linkage="average",
).fit_predict(np.array(embeddings))

# -----------------------------
# 5. Merge & save segments
# -----------------------------
groups = defaultdict(list)
for (s, e), lbl in zip(valid, labels):
    groups[lbl].append((s, e))

for lbl, segs in groups.items():
    segs.sort()
    start, end = segs[0]

    merged = []
    for s, e in segs[1:]:
        if s - end < 2000:
            end = e
        else:
            merged.append((start, end))
            start, end = s, e
    merged.append((start, end))

    for i, (s, e) in enumerate(merged):
        audio[s:e].export(
            f"segment/SPEAKER_{lbl:02d}_{i}.wav",
            format="wav"
        )

Which configuration should you use?

  • Learning / quick prototype → Silence-based VAD
  • Meetings, podcasts, noisy audio → pyannote VAD
  • Production diarization → Consider full pyannote pipelines

Final Thoughts

Most diarization errors come from poor VAD, not clustering. Keeping embeddings and clustering fixed while swapping VAD makes this pipeline easy to reason about and extend.

This pattern—simple core + pluggable components—is often the fastest way to build reliable speech systems.


What has been the biggest bottleneck in your diarization work: VAD, embeddings, or clustering?

Community

Sign up or log in to comment