Spaces:
Sleeping
Sleeping
| title: Who Spoke When | |
| emoji: ποΈ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_file: app/main.py | |
| pinned: false | |
| # π Speaker Diarization System | |
| ### *Who Spoke When β Multi-Speaker Audio Segmentation* | |
| > **Tech Stack:** Python Β· PyTorch Β· SpeechBrain Β· Pyannote.audio Β· Transformers Β· FastAPI | |
| --- | |
| ## Architecture | |
| ``` | |
| Audio Input | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Voice Activity Detection β β pyannote/voice-activity-detection | |
| β (VAD) β fallback: energy-based VAD | |
| ββββββββββββββ¬βββββββββββββββββ | |
| β speech regions (start, end) | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Sliding Window Segmentationβ β 1.5s windows, 50% overlap | |
| β β | |
| ββββββββββββββ¬βββββββββββββββββ | |
| β segment list | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β ECAPA-TDNN Embedding β β speechbrain/spkrec-ecapa-voxceleb | |
| β Extraction β 192-dim L2-normalized vectors | |
| ββββββββββββββ¬βββββββββββββββββ | |
| β embeddings (N Γ 192) | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Agglomerative Hierarchical β β cosine distance metric | |
| β Clustering (AHC) β silhouette-based auto k-selection | |
| ββββββββββββββ¬βββββββββββββββββ | |
| β speaker labels | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Post-processing β β merge consecutive same-speaker segs | |
| β & Output Formatting β timestamped JSON / RTTM / SRT | |
| βββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| speaker-diarization/ | |
| βββ app/ | |
| β βββ main.py # FastAPI app β REST + WebSocket endpoints | |
| β βββ pipeline.py # Core end-to-end diarization pipeline | |
| βββ models/ | |
| β βββ embedder.py # ECAPA-TDNN speaker embedding extractor | |
| β βββ clusterer.py # Agglomerative Hierarchical Clustering (AHC) | |
| βββ utils/ | |
| β βββ audio.py # Audio loading, chunking, RTTM/SRT export | |
| βββ tests/ | |
| β βββ test_diarization.py # Unit + integration tests | |
| βββ static/ | |
| β βββ index.html # Web demo UI | |
| βββ demo.py # CLI interface | |
| βββ requirements.txt | |
| ``` | |
| --- | |
| ## Installation | |
| ```bash | |
| # 1. Clone / navigate to project | |
| cd speaker-diarization | |
| # 2. Create virtual environment | |
| python -m venv .venv | |
| source .venv/bin/activate # Windows: .venv\Scripts\activate | |
| # 3. Install dependencies | |
| pip install -r requirements.txt | |
| # 4. (Optional) Set HuggingFace token for pyannote VAD | |
| # Accept terms at: https://huggingface.co/pyannote/voice-activity-detection | |
| export HF_TOKEN=your_token_here | |
| ``` | |
| --- | |
| ## Usage | |
| ### CLI Demo | |
| ```bash | |
| # Basic usage (auto-detect speaker count) | |
| python demo.py --audio meeting.wav | |
| # Specify 3 speakers | |
| python demo.py --audio call.wav --speakers 3 | |
| # Export all formats | |
| python demo.py --audio audio.mp3 \ | |
| --output result.json \ | |
| --rttm output.rttm \ | |
| --srt subtitles.srt | |
| ``` | |
| **Example output:** | |
| ``` | |
| β Done in 4.83s | |
| Speakers found : 3 | |
| Audio duration : 120.50s | |
| Segments : 42 | |
| START END DUR SPEAKER | |
| ββββββββββββββββββββββββββββββββββββ | |
| 0.000 3.250 3.250 SPEAKER_00 | |
| 3.500 8.120 4.620 SPEAKER_01 | |
| 8.200 11.800 3.600 SPEAKER_00 | |
| 12.000 17.340 5.340 SPEAKER_02 | |
| ... | |
| ``` | |
| ### FastAPI Server | |
| ```bash | |
| # Start the API server | |
| uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload | |
| # Open the web UI | |
| open http://localhost:8000 | |
| # Swagger documentation | |
| open http://localhost:8000/docs | |
| ``` | |
| ### REST API | |
| **POST /diarize** β Upload audio file | |
| ```bash | |
| curl -X POST http://localhost:8000/diarize \ | |
| -F "file=@meeting.wav" \ | |
| -F "num_speakers=3" | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "status": "success", | |
| "num_speakers": 3, | |
| "audio_duration": 120.5, | |
| "processing_time": 4.83, | |
| "sample_rate": 16000, | |
| "speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"], | |
| "segments": [ | |
| { "start": 0.000, "end": 3.250, "duration": 3.250, "speaker": "SPEAKER_00" }, | |
| { "start": 3.500, "end": 8.120, "duration": 4.620, "speaker": "SPEAKER_01" } | |
| ] | |
| } | |
| ``` | |
| **GET /health** β Service health | |
| ```bash | |
| curl http://localhost:8000/health | |
| # {"status":"healthy","device":"cuda","version":"1.0.0"} | |
| ``` | |
| ### WebSocket Streaming | |
| ```python | |
| import asyncio, websockets, json, numpy as np | |
| async def stream_audio(): | |
| async with websockets.connect("ws://localhost:8000/ws/stream") as ws: | |
| # Send config | |
| await ws.send(json.dumps({"sample_rate": 16000, "num_speakers": 2})) | |
| # Send audio chunks (raw float32 PCM) | |
| with open("audio.raw", "rb") as f: | |
| while chunk := f.read(4096): | |
| await ws.send(chunk) | |
| # Signal end | |
| await ws.send(json.dumps({"type": "eof"})) | |
| # Receive results | |
| async for msg in ws: | |
| data = json.loads(msg) | |
| if data["type"] == "segment": | |
| print(f"[{data['data']['speaker']}] {data['data']['start']:.2f}s β {data['data']['end']:.2f}s") | |
| elif data["type"] == "done": | |
| break | |
| asyncio.run(stream_audio()) | |
| ``` | |
| --- | |
| ## Key Design Decisions | |
| | Component | Choice | Rationale | | |
| |-----------|--------|-----------| | |
| | Speaker Embeddings | ECAPA-TDNN (SpeechBrain) | State-of-the-art speaker verification accuracy on VoxCeleb | | |
| | Clustering | AHC + cosine distance | No predefined k required; works well with L2-normalized embeddings | | |
| | k-selection | Silhouette analysis | Unsupervised, parameter-free speaker count estimation | | |
| | VAD | pyannote (energy fallback) | pyannote VAD reduces false embeddings on silence/noise | | |
| | Embedding window | 1.5s, 50% overlap | Balances temporal resolution vs. embedding stability | | |
| | Post-processing | Merge consecutive same-speaker | Reduces over-segmentation artifact | | |
| --- | |
| ## Evaluation Metrics | |
| Standard speaker diarization evaluation uses **Diarization Error Rate (DER)**: | |
| ``` | |
| DER = (Miss + False Alarm + Speaker Error) / Total Speech Duration | |
| ``` | |
| Export RTTM files for evaluation with `md-eval` or `dscore`: | |
| ```bash | |
| python demo.py --audio test.wav --rttm hypothesis.rttm | |
| dscore -r reference.rttm -s hypothesis.rttm | |
| ``` | |
| --- | |
| ## Running Tests | |
| ```bash | |
| pytest tests/ -v | |
| pytest tests/ -v -k "clusterer" # run specific test class | |
| ``` | |
| --- | |
| ## Limitations & Future Work | |
| - Long audio (>1hr) should use chunked processing (`utils.audio.chunk_audio`) | |
| - Real-time streaming requires low-latency VAD (not yet implemented in WS endpoint) | |
| - Speaker overlap (cross-talk) is assigned to a single speaker | |
| - Consider fine-tuning ECAPA-TDNN on domain-specific data for call analytics | |