Who-Spoke-When / README.md
ConvxO2's picture
Add HuggingFace Spaces configuration
e04153f
|
raw
history blame
7.4 kB
metadata
title: Who Spoke When
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false

πŸŽ™ Speaker Diarization System

Who Spoke When β€” Multi-Speaker Audio Segmentation

Tech Stack: Python Β· PyTorch Β· SpeechBrain Β· Pyannote.audio Β· Transformers Β· FastAPI


Architecture

Audio Input
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Voice Activity Detection   β”‚  ← pyannote/voice-activity-detection
β”‚  (VAD)                      β”‚    fallback: energy-based VAD
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚  speech regions (start, end)
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Sliding Window Segmentationβ”‚  ← 1.5s windows, 50% overlap
β”‚                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚  segment list
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ECAPA-TDNN Embedding       β”‚  ← speechbrain/spkrec-ecapa-voxceleb
β”‚  Extraction                 β”‚    192-dim L2-normalized vectors
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚  embeddings (N Γ— 192)
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Agglomerative Hierarchical β”‚  ← cosine distance metric
β”‚  Clustering (AHC)           β”‚    silhouette-based auto k-selection
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚  speaker labels
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Post-processing            β”‚  ← merge consecutive same-speaker segs
β”‚  & Output Formatting        β”‚    timestamped JSON / RTTM / SRT
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Project Structure

speaker-diarization/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py          # FastAPI app β€” REST + WebSocket endpoints
β”‚   └── pipeline.py      # Core end-to-end diarization pipeline
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ embedder.py      # ECAPA-TDNN speaker embedding extractor
β”‚   └── clusterer.py     # Agglomerative Hierarchical Clustering (AHC)
β”œβ”€β”€ utils/
β”‚   └── audio.py         # Audio loading, chunking, RTTM/SRT export
β”œβ”€β”€ tests/
β”‚   └── test_diarization.py  # Unit + integration tests
β”œβ”€β”€ static/
β”‚   └── index.html       # Web demo UI
β”œβ”€β”€ demo.py              # CLI interface
└── requirements.txt

Installation

# 1. Clone / navigate to project
cd speaker-diarization

# 2. Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. (Optional) Set HuggingFace token for pyannote VAD
#    Accept terms at: https://huggingface.co/pyannote/voice-activity-detection
export HF_TOKEN=your_token_here

Usage

CLI Demo

# Basic usage (auto-detect speaker count)
python demo.py --audio meeting.wav

# Specify 3 speakers
python demo.py --audio call.wav --speakers 3

# Export all formats
python demo.py --audio audio.mp3 \
    --output result.json \
    --rttm output.rttm \
    --srt subtitles.srt

Example output:

βœ… Done in 4.83s
   Speakers found : 3
   Audio duration : 120.50s
   Segments       : 42

   START       END       DUR  SPEAKER
   ────────────────────────────────────
   0.000     3.250    3.250  SPEAKER_00
   3.500     8.120    4.620  SPEAKER_01
   8.200    11.800    3.600  SPEAKER_00
   12.000   17.340    5.340  SPEAKER_02
   ...

FastAPI Server

# Start the API server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

# Open the web UI
open http://localhost:8000

# Swagger documentation
open http://localhost:8000/docs

REST API

POST /diarize β€” Upload audio file

curl -X POST http://localhost:8000/diarize \
  -F "file=@meeting.wav" \
  -F "num_speakers=3"

Response:

{
  "status": "success",
  "num_speakers": 3,
  "audio_duration": 120.5,
  "processing_time": 4.83,
  "sample_rate": 16000,
  "speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
  "segments": [
    { "start": 0.000, "end": 3.250, "duration": 3.250, "speaker": "SPEAKER_00" },
    { "start": 3.500, "end": 8.120, "duration": 4.620, "speaker": "SPEAKER_01" }
  ]
}

GET /health β€” Service health

curl http://localhost:8000/health
# {"status":"healthy","device":"cuda","version":"1.0.0"}

WebSocket Streaming

import asyncio, websockets, json, numpy as np

async def stream_audio():
    async with websockets.connect("ws://localhost:8000/ws/stream") as ws:
        # Send config
        await ws.send(json.dumps({"sample_rate": 16000, "num_speakers": 2}))
        
        # Send audio chunks (raw float32 PCM)
        with open("audio.raw", "rb") as f:
            while chunk := f.read(4096):
                await ws.send(chunk)
        
        # Signal end
        await ws.send(json.dumps({"type": "eof"}))
        
        # Receive results
        async for msg in ws:
            data = json.loads(msg)
            if data["type"] == "segment":
                print(f"[{data['data']['speaker']}] {data['data']['start']:.2f}s – {data['data']['end']:.2f}s")
            elif data["type"] == "done":
                break

asyncio.run(stream_audio())

Key Design Decisions

Component Choice Rationale
Speaker Embeddings ECAPA-TDNN (SpeechBrain) State-of-the-art speaker verification accuracy on VoxCeleb
Clustering AHC + cosine distance No predefined k required; works well with L2-normalized embeddings
k-selection Silhouette analysis Unsupervised, parameter-free speaker count estimation
VAD pyannote (energy fallback) pyannote VAD reduces false embeddings on silence/noise
Embedding window 1.5s, 50% overlap Balances temporal resolution vs. embedding stability
Post-processing Merge consecutive same-speaker Reduces over-segmentation artifact

Evaluation Metrics

Standard speaker diarization evaluation uses Diarization Error Rate (DER):

DER = (Miss + False Alarm + Speaker Error) / Total Speech Duration

Export RTTM files for evaluation with md-eval or dscore:

python demo.py --audio test.wav --rttm hypothesis.rttm
dscore -r reference.rttm -s hypothesis.rttm

Running Tests

pytest tests/ -v
pytest tests/ -v -k "clusterer"  # run specific test class

Limitations & Future Work

  • Long audio (>1hr) should use chunked processing (utils.audio.chunk_audio)
  • Real-time streaming requires low-latency VAD (not yet implemented in WS endpoint)
  • Speaker overlap (cross-talk) is assigned to a single speaker
  • Consider fine-tuning ECAPA-TDNN on domain-specific data for call analytics