Simple Speaker Diarization with SpeechBrain X-Vectors
Comparing Silence-Based VAD vs Neural VAD (pyannote)
Speaker diarization answers a simple question: “who spoke when?” While full diarization pipelines exist, they can feel heavy if you just want a clear, hackable baseline.
This post presents one simple diarization pipeline built on SpeechBrain x-vector embeddings, with two interchangeable Voice Activity Detection (VAD) methods:
- Method A: Silence-based VAD (minimal dependencies)
- Method B: Neural VAD using pyannote (more robust)
Both methods share the same embedding, clustering, and segment-merging logic.
Pipeline Overview (Shared)
Regardless of VAD choice, the diarization flow is:
- Detect speech regions (VAD)
- Extract SpeechBrain x-vector embeddings
- Cluster segments by cosine similarity
- Merge adjacent segments from the same speaker
- Save speaker-separated WAV files
Only Step 1 (VAD) changes.
Why SpeechBrain X-Vectors?
We use:
speechbrain/spkrec-xvect-voxceleb
Why?
- Widely used, well-understood speaker embeddings
- CPU-friendly
- Available directly on Hugging Face
- Easy to integrate into custom pipelines
This makes it ideal for learning, prototyping, and lightweight production tasks.
VAD Options
Method A: Silence-Based VAD (Minimal)
Uses pydub.detect_nonsilent.
Pros
- Extremely simple
- No neural models
- Fast and transparent
Cons
- Sensitive to noise
- Poor boundary precision
- No overlap handling
Method B: Neural VAD (pyannote)
Uses:
norwoodsystems/norwood-voice-activity-detection
Pros
- Much better speech boundaries
- Robust to background noise
- Handles quiet speakers well
Cons
- Extra dependency
- Slightly heavier runtime
Comparison Summary
| Aspect | Silence-Based VAD | pyannote VAD |
|---|---|---|
| Dependencies | Very low | Moderate |
| Noise robustness | Low | High |
| Boundary accuracy | Rough | Clean |
| Recommended for | Prototypes | Real recordings |
Unified Example Code
Below is a single diarization script. Choose your VAD by toggling one flag.
# diarization_speechbrain.py
# SpeechBrain x-vectors with pluggable VAD
import sys, os, numpy as np
from collections import defaultdict
from pydub import AudioSegment
from speechbrain.pretrained import EncoderClassifier
from sklearn.cluster import AgglomerativeClustering
USE_PYANNOTE_VAD = True # ← toggle here
if len(sys.argv) < 2:
print("Usage: python diarization_speechbrain.py <input_wav>")
sys.exit(1)
audio_path = sys.argv[1]
audio = AudioSegment.from_wav(audio_path)
os.makedirs("segment", exist_ok=True)
# -----------------------------
# 1. Voice Activity Detection
# -----------------------------
segments = []
if USE_PYANNOTE_VAD:
from pyannote.audio.pipelines import VoiceActivityDetection
vad = VoiceActivityDetection(
segmentation="norwoodsystems/norwood-voice-activity-detection"
)
vad.instantiate({
"onset": 0.5,
"offset": 0.5,
"min_duration_on": 1.0,
"min_duration_off": 0.5,
})
vad_annotation = vad(audio_path)
for seg in vad_annotation.get_timeline():
s, e = int(seg.start * 1000), int(seg.end * 1000)
if e - s >= 1000:
segments.append((s, e))
else:
from pydub.silence import detect_nonsilent
nonsilent = detect_nonsilent(
audio,
min_silence_len=500,
silence_thresh=audio.dBFS - 16,
seek_step=100,
)
segments = [(s, e) for s, e in nonsilent if e - s >= 1000]
if not segments:
sys.exit("No speech detected")
# -----------------------------
# 2. Load x-vector model
# -----------------------------
classifier = EncoderClassifier.from_hparams(
source="speechbrain/spkrec-xvect-voxceleb",
savedir="pretrained_models/spkrec-xvect-voxceleb",
run_opts={"device": "cpu"},
)
# -----------------------------
# 3. Extract embeddings
# -----------------------------
embeddings, valid = [], []
os.makedirs("temp", exist_ok=True)
for i, (s, e) in enumerate(segments):
path = f"temp/{i}.wav"
audio[s:e].export(path, format="wav")
wav = classifier.load_audio(path)
emb = classifier.encode_batch(wav)
embeddings.append(emb.squeeze().cpu().numpy())
valid.append((s, e))
os.remove(path)
os.rmdir("temp")
# -----------------------------
# 4. Cluster speakers
# -----------------------------
labels = AgglomerativeClustering(
n_clusters=min(2, len(embeddings)),
metric="cosine",
linkage="average",
).fit_predict(np.array(embeddings))
# -----------------------------
# 5. Merge & save segments
# -----------------------------
groups = defaultdict(list)
for (s, e), lbl in zip(valid, labels):
groups[lbl].append((s, e))
for lbl, segs in groups.items():
segs.sort()
start, end = segs[0]
merged = []
for s, e in segs[1:]:
if s - end < 2000:
end = e
else:
merged.append((start, end))
start, end = s, e
merged.append((start, end))
for i, (s, e) in enumerate(merged):
audio[s:e].export(
f"segment/SPEAKER_{lbl:02d}_{i}.wav",
format="wav"
)
Which configuration should you use?
- Learning / quick prototype → Silence-based VAD
- Meetings, podcasts, noisy audio → pyannote VAD
- Production diarization → Consider full pyannote pipelines
Final Thoughts
Most diarization errors come from poor VAD, not clustering. Keeping embeddings and clustering fixed while swapping VAD makes this pipeline easy to reason about and extend.
This pattern—simple core + pluggable components—is often the fastest way to build reliable speech systems.
What has been the biggest bottleneck in your diarization work: VAD, embeddings, or clustering?