Who-Spoke-When / README.md
ConvxO2's picture
Add HuggingFace Spaces configuration
e04153f
|
raw
history blame
7.4 kB
---
title: Who Spoke When
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false
---
# πŸŽ™ Speaker Diarization System
### *Who Spoke When β€” Multi-Speaker Audio Segmentation*
> **Tech Stack:** Python Β· PyTorch Β· SpeechBrain Β· Pyannote.audio Β· Transformers Β· FastAPI
---
## Architecture
```
Audio Input
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Voice Activity Detection β”‚ ← pyannote/voice-activity-detection
β”‚ (VAD) β”‚ fallback: energy-based VAD
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ speech regions (start, end)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sliding Window Segmentationβ”‚ ← 1.5s windows, 50% overlap
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ segment list
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ECAPA-TDNN Embedding β”‚ ← speechbrain/spkrec-ecapa-voxceleb
β”‚ Extraction β”‚ 192-dim L2-normalized vectors
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ embeddings (N Γ— 192)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Agglomerative Hierarchical β”‚ ← cosine distance metric
β”‚ Clustering (AHC) β”‚ silhouette-based auto k-selection
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ speaker labels
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Post-processing β”‚ ← merge consecutive same-speaker segs
β”‚ & Output Formatting β”‚ timestamped JSON / RTTM / SRT
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Project Structure
```
speaker-diarization/
β”œβ”€β”€ app/
β”‚ β”œβ”€β”€ main.py # FastAPI app β€” REST + WebSocket endpoints
β”‚ └── pipeline.py # Core end-to-end diarization pipeline
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ embedder.py # ECAPA-TDNN speaker embedding extractor
β”‚ └── clusterer.py # Agglomerative Hierarchical Clustering (AHC)
β”œβ”€β”€ utils/
β”‚ └── audio.py # Audio loading, chunking, RTTM/SRT export
β”œβ”€β”€ tests/
β”‚ └── test_diarization.py # Unit + integration tests
β”œβ”€β”€ static/
β”‚ └── index.html # Web demo UI
β”œβ”€β”€ demo.py # CLI interface
└── requirements.txt
```
---
## Installation
```bash
# 1. Clone / navigate to project
cd speaker-diarization
# 2. Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. (Optional) Set HuggingFace token for pyannote VAD
# Accept terms at: https://huggingface.co/pyannote/voice-activity-detection
export HF_TOKEN=your_token_here
```
---
## Usage
### CLI Demo
```bash
# Basic usage (auto-detect speaker count)
python demo.py --audio meeting.wav
# Specify 3 speakers
python demo.py --audio call.wav --speakers 3
# Export all formats
python demo.py --audio audio.mp3 \
--output result.json \
--rttm output.rttm \
--srt subtitles.srt
```
**Example output:**
```
βœ… Done in 4.83s
Speakers found : 3
Audio duration : 120.50s
Segments : 42
START END DUR SPEAKER
────────────────────────────────────
0.000 3.250 3.250 SPEAKER_00
3.500 8.120 4.620 SPEAKER_01
8.200 11.800 3.600 SPEAKER_00
12.000 17.340 5.340 SPEAKER_02
...
```
### FastAPI Server
```bash
# Start the API server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
# Open the web UI
open http://localhost:8000
# Swagger documentation
open http://localhost:8000/docs
```
### REST API
**POST /diarize** β€” Upload audio file
```bash
curl -X POST http://localhost:8000/diarize \
-F "file=@meeting.wav" \
-F "num_speakers=3"
```
**Response:**
```json
{
"status": "success",
"num_speakers": 3,
"audio_duration": 120.5,
"processing_time": 4.83,
"sample_rate": 16000,
"speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
"segments": [
{ "start": 0.000, "end": 3.250, "duration": 3.250, "speaker": "SPEAKER_00" },
{ "start": 3.500, "end": 8.120, "duration": 4.620, "speaker": "SPEAKER_01" }
]
}
```
**GET /health** β€” Service health
```bash
curl http://localhost:8000/health
# {"status":"healthy","device":"cuda","version":"1.0.0"}
```
### WebSocket Streaming
```python
import asyncio, websockets, json, numpy as np
async def stream_audio():
async with websockets.connect("ws://localhost:8000/ws/stream") as ws:
# Send config
await ws.send(json.dumps({"sample_rate": 16000, "num_speakers": 2}))
# Send audio chunks (raw float32 PCM)
with open("audio.raw", "rb") as f:
while chunk := f.read(4096):
await ws.send(chunk)
# Signal end
await ws.send(json.dumps({"type": "eof"}))
# Receive results
async for msg in ws:
data = json.loads(msg)
if data["type"] == "segment":
print(f"[{data['data']['speaker']}] {data['data']['start']:.2f}s – {data['data']['end']:.2f}s")
elif data["type"] == "done":
break
asyncio.run(stream_audio())
```
---
## Key Design Decisions
| Component | Choice | Rationale |
|-----------|--------|-----------|
| Speaker Embeddings | ECAPA-TDNN (SpeechBrain) | State-of-the-art speaker verification accuracy on VoxCeleb |
| Clustering | AHC + cosine distance | No predefined k required; works well with L2-normalized embeddings |
| k-selection | Silhouette analysis | Unsupervised, parameter-free speaker count estimation |
| VAD | pyannote (energy fallback) | pyannote VAD reduces false embeddings on silence/noise |
| Embedding window | 1.5s, 50% overlap | Balances temporal resolution vs. embedding stability |
| Post-processing | Merge consecutive same-speaker | Reduces over-segmentation artifact |
---
## Evaluation Metrics
Standard speaker diarization evaluation uses **Diarization Error Rate (DER)**:
```
DER = (Miss + False Alarm + Speaker Error) / Total Speech Duration
```
Export RTTM files for evaluation with `md-eval` or `dscore`:
```bash
python demo.py --audio test.wav --rttm hypothesis.rttm
dscore -r reference.rttm -s hypothesis.rttm
```
---
## Running Tests
```bash
pytest tests/ -v
pytest tests/ -v -k "clusterer" # run specific test class
```
---
## Limitations & Future Work
- Long audio (>1hr) should use chunked processing (`utils.audio.chunk_audio`)
- Real-time streaming requires low-latency VAD (not yet implemented in WS endpoint)
- Speaker overlap (cross-talk) is assigned to a single speaker
- Consider fine-tuning ECAPA-TDNN on domain-specific data for call analytics