Spaces:
Sleeping
Sleeping
metadata
title: Who Spoke When
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false
π Speaker Diarization System
Who Spoke When β Multi-Speaker Audio Segmentation
Tech Stack: Python Β· PyTorch Β· SpeechBrain Β· Pyannote.audio Β· Transformers Β· FastAPI
Architecture
Audio Input
β
βΌ
βββββββββββββββββββββββββββββββ
β Voice Activity Detection β β pyannote/voice-activity-detection
β (VAD) β fallback: energy-based VAD
ββββββββββββββ¬βββββββββββββββββ
β speech regions (start, end)
βΌ
βββββββββββββββββββββββββββββββ
β Sliding Window Segmentationβ β 1.5s windows, 50% overlap
β β
ββββββββββββββ¬βββββββββββββββββ
β segment list
βΌ
βββββββββββββββββββββββββββββββ
β ECAPA-TDNN Embedding β β speechbrain/spkrec-ecapa-voxceleb
β Extraction β 192-dim L2-normalized vectors
ββββββββββββββ¬βββββββββββββββββ
β embeddings (N Γ 192)
βΌ
βββββββββββββββββββββββββββββββ
β Agglomerative Hierarchical β β cosine distance metric
β Clustering (AHC) β silhouette-based auto k-selection
ββββββββββββββ¬βββββββββββββββββ
β speaker labels
βΌ
βββββββββββββββββββββββββββββββ
β Post-processing β β merge consecutive same-speaker segs
β & Output Formatting β timestamped JSON / RTTM / SRT
βββββββββββββββββββββββββββββββ
Project Structure
speaker-diarization/
βββ app/
β βββ main.py # FastAPI app β REST + WebSocket endpoints
β βββ pipeline.py # Core end-to-end diarization pipeline
βββ models/
β βββ embedder.py # ECAPA-TDNN speaker embedding extractor
β βββ clusterer.py # Agglomerative Hierarchical Clustering (AHC)
βββ utils/
β βββ audio.py # Audio loading, chunking, RTTM/SRT export
βββ tests/
β βββ test_diarization.py # Unit + integration tests
βββ static/
β βββ index.html # Web demo UI
βββ demo.py # CLI interface
βββ requirements.txt
Installation
# 1. Clone / navigate to project
cd speaker-diarization
# 2. Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. (Optional) Set HuggingFace token for pyannote VAD
# Accept terms at: https://huggingface.co/pyannote/voice-activity-detection
export HF_TOKEN=your_token_here
Usage
CLI Demo
# Basic usage (auto-detect speaker count)
python demo.py --audio meeting.wav
# Specify 3 speakers
python demo.py --audio call.wav --speakers 3
# Export all formats
python demo.py --audio audio.mp3 \
--output result.json \
--rttm output.rttm \
--srt subtitles.srt
Example output:
β
Done in 4.83s
Speakers found : 3
Audio duration : 120.50s
Segments : 42
START END DUR SPEAKER
ββββββββββββββββββββββββββββββββββββ
0.000 3.250 3.250 SPEAKER_00
3.500 8.120 4.620 SPEAKER_01
8.200 11.800 3.600 SPEAKER_00
12.000 17.340 5.340 SPEAKER_02
...
FastAPI Server
# Start the API server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
# Open the web UI
open http://localhost:8000
# Swagger documentation
open http://localhost:8000/docs
REST API
POST /diarize β Upload audio file
curl -X POST http://localhost:8000/diarize \
-F "file=@meeting.wav" \
-F "num_speakers=3"
Response:
{
"status": "success",
"num_speakers": 3,
"audio_duration": 120.5,
"processing_time": 4.83,
"sample_rate": 16000,
"speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
"segments": [
{ "start": 0.000, "end": 3.250, "duration": 3.250, "speaker": "SPEAKER_00" },
{ "start": 3.500, "end": 8.120, "duration": 4.620, "speaker": "SPEAKER_01" }
]
}
GET /health β Service health
curl http://localhost:8000/health
# {"status":"healthy","device":"cuda","version":"1.0.0"}
WebSocket Streaming
import asyncio, websockets, json, numpy as np
async def stream_audio():
async with websockets.connect("ws://localhost:8000/ws/stream") as ws:
# Send config
await ws.send(json.dumps({"sample_rate": 16000, "num_speakers": 2}))
# Send audio chunks (raw float32 PCM)
with open("audio.raw", "rb") as f:
while chunk := f.read(4096):
await ws.send(chunk)
# Signal end
await ws.send(json.dumps({"type": "eof"}))
# Receive results
async for msg in ws:
data = json.loads(msg)
if data["type"] == "segment":
print(f"[{data['data']['speaker']}] {data['data']['start']:.2f}s β {data['data']['end']:.2f}s")
elif data["type"] == "done":
break
asyncio.run(stream_audio())
Key Design Decisions
| Component | Choice | Rationale |
|---|---|---|
| Speaker Embeddings | ECAPA-TDNN (SpeechBrain) | State-of-the-art speaker verification accuracy on VoxCeleb |
| Clustering | AHC + cosine distance | No predefined k required; works well with L2-normalized embeddings |
| k-selection | Silhouette analysis | Unsupervised, parameter-free speaker count estimation |
| VAD | pyannote (energy fallback) | pyannote VAD reduces false embeddings on silence/noise |
| Embedding window | 1.5s, 50% overlap | Balances temporal resolution vs. embedding stability |
| Post-processing | Merge consecutive same-speaker | Reduces over-segmentation artifact |
Evaluation Metrics
Standard speaker diarization evaluation uses Diarization Error Rate (DER):
DER = (Miss + False Alarm + Speaker Error) / Total Speech Duration
Export RTTM files for evaluation with md-eval or dscore:
python demo.py --audio test.wav --rttm hypothesis.rttm
dscore -r reference.rttm -s hypothesis.rttm
Running Tests
pytest tests/ -v
pytest tests/ -v -k "clusterer" # run specific test class
Limitations & Future Work
- Long audio (>1hr) should use chunked processing (
utils.audio.chunk_audio) - Real-time streaming requires low-latency VAD (not yet implemented in WS endpoint)
- Speaker overlap (cross-talk) is assigned to a single speaker
- Consider fine-tuning ECAPA-TDNN on domain-specific data for call analytics