Spaces:

ConvxO2
/

Who-Spoke-When

Sleeping

App Files Files Community

Who-Spoke-When / README.md

ConvxO2

Add HuggingFace Spaces configuration

e04153f about 1 month ago

7.4 kB

title: Who Spoke When
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false

🎙 Speaker Diarization System

Who Spoke When — Multi-Speaker Audio Segmentation

Tech Stack: Python · PyTorch · SpeechBrain · Pyannote.audio · Transformers · FastAPI

Architecture

Audio Input
    │
    ▼
┌─────────────────────────────┐
│  Voice Activity Detection   │  ← pyannote/voice-activity-detection
│  (VAD)                      │    fallback: energy-based VAD
└────────────┬────────────────┘
             │  speech regions (start, end)
             ▼
┌─────────────────────────────┐
│  Sliding Window Segmentation│  ← 1.5s windows, 50% overlap
│                             │
└────────────┬────────────────┘
             │  segment list
             ▼
┌─────────────────────────────┐
│  ECAPA-TDNN Embedding       │  ← speechbrain/spkrec-ecapa-voxceleb
│  Extraction                 │    192-dim L2-normalized vectors
└────────────┬────────────────┘
             │  embeddings (N × 192)
             ▼
┌─────────────────────────────┐
│  Agglomerative Hierarchical │  ← cosine distance metric
│  Clustering (AHC)           │    silhouette-based auto k-selection
└────────────┬────────────────┘
             │  speaker labels
             ▼
┌─────────────────────────────┐
│  Post-processing            │  ← merge consecutive same-speaker segs
│  & Output Formatting        │    timestamped JSON / RTTM / SRT
└─────────────────────────────┘

Project Structure

speaker-diarization/
├── app/
│   ├── main.py          # FastAPI app — REST + WebSocket endpoints
│   └── pipeline.py      # Core end-to-end diarization pipeline
├── models/
│   ├── embedder.py      # ECAPA-TDNN speaker embedding extractor
│   └── clusterer.py     # Agglomerative Hierarchical Clustering (AHC)
├── utils/
│   └── audio.py         # Audio loading, chunking, RTTM/SRT export
├── tests/
│   └── test_diarization.py  # Unit + integration tests
├── static/
│   └── index.html       # Web demo UI
├── demo.py              # CLI interface
└── requirements.txt

Installation

# 1. Clone / navigate to project
cd speaker-diarization

# 2. Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. (Optional) Set HuggingFace token for pyannote VAD
#    Accept terms at: https://huggingface.co/pyannote/voice-activity-detection
export HF_TOKEN=your_token_here

Usage

CLI Demo

# Basic usage (auto-detect speaker count)
python demo.py --audio meeting.wav

# Specify 3 speakers
python demo.py --audio call.wav --speakers 3

# Export all formats
python demo.py --audio audio.mp3 \
    --output result.json \
    --rttm output.rttm \
    --srt subtitles.srt

Example output:

✅ Done in 4.83s
   Speakers found : 3
   Audio duration : 120.50s
   Segments       : 42

   START       END       DUR  SPEAKER
   ────────────────────────────────────
   0.000     3.250    3.250  SPEAKER_00
   3.500     8.120    4.620  SPEAKER_01
   8.200    11.800    3.600  SPEAKER_00
   12.000   17.340    5.340  SPEAKER_02
   ...

FastAPI Server

# Start the API server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

# Open the web UI
open http://localhost:8000

# Swagger documentation
open http://localhost:8000/docs

REST API

POST /diarize — Upload audio file

curl -X POST http://localhost:8000/diarize \
  -F "file=@meeting.wav" \
  -F "num_speakers=3"

Response:

{
  "status": "success",
  "num_speakers": 3,
  "audio_duration": 120.5,
  "processing_time": 4.83,
  "sample_rate": 16000,
  "speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
  "segments": [
    { "start": 0.000, "end": 3.250, "duration": 3.250, "speaker": "SPEAKER_00" },
    { "start": 3.500, "end": 8.120, "duration": 4.620, "speaker": "SPEAKER_01" }
  ]
}

GET /health — Service health

curl http://localhost:8000/health
# {"status":"healthy","device":"cuda","version":"1.0.0"}

WebSocket Streaming

import asyncio, websockets, json, numpy as np

async def stream_audio():
    async with websockets.connect("ws://localhost:8000/ws/stream") as ws:
        # Send config
        await ws.send(json.dumps({"sample_rate": 16000, "num_speakers": 2}))
        
        # Send audio chunks (raw float32 PCM)
        with open("audio.raw", "rb") as f:
            while chunk := f.read(4096):
                await ws.send(chunk)
        
        # Signal end
        await ws.send(json.dumps({"type": "eof"}))
        
        # Receive results
        async for msg in ws:
            data = json.loads(msg)
            if data["type"] == "segment":
                print(f"[{data['data']['speaker']}] {data['data']['start']:.2f}s – {data['data']['end']:.2f}s")
            elif data["type"] == "done":
                break

asyncio.run(stream_audio())

Key Design Decisions

Component	Choice	Rationale
Speaker Embeddings	ECAPA-TDNN (SpeechBrain)	State-of-the-art speaker verification accuracy on VoxCeleb
Clustering	AHC + cosine distance	No predefined k required; works well with L2-normalized embeddings
k-selection	Silhouette analysis	Unsupervised, parameter-free speaker count estimation
VAD	pyannote (energy fallback)	pyannote VAD reduces false embeddings on silence/noise
Embedding window	1.5s, 50% overlap	Balances temporal resolution vs. embedding stability
Post-processing	Merge consecutive same-speaker	Reduces over-segmentation artifact

Evaluation Metrics

Standard speaker diarization evaluation uses Diarization Error Rate (DER):

DER = (Miss + False Alarm + Speaker Error) / Total Speech Duration

Export RTTM files for evaluation with md-eval or dscore:

python demo.py --audio test.wav --rttm hypothesis.rttm
dscore -r reference.rttm -s hypothesis.rttm

Running Tests

pytest tests/ -v
pytest tests/ -v -k "clusterer"  # run specific test class

Limitations & Future Work

Long audio (>1hr) should use chunked processing (utils.audio.chunk_audio)
Real-time streaming requires low-latency VAD (not yet implemented in WS endpoint)
Speaker overlap (cross-talk) is assigned to a single speaker
Consider fine-tuning ECAPA-TDNN on domain-specific data for call analytics