File size: 7,399 Bytes
e04153f
 
 
 
 
 
 
 
 
 
d7a2919
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: Who Spoke When
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false
---

# πŸŽ™ Speaker Diarization System
### *Who Spoke When β€” Multi-Speaker Audio Segmentation*

> **Tech Stack:** Python Β· PyTorch Β· SpeechBrain Β· Pyannote.audio Β· Transformers Β· FastAPI

---

## Architecture

```
Audio Input
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Voice Activity Detection   β”‚  ← pyannote/voice-activity-detection
β”‚  (VAD)                      β”‚    fallback: energy-based VAD
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚  speech regions (start, end)
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Sliding Window Segmentationβ”‚  ← 1.5s windows, 50% overlap
β”‚                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚  segment list
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ECAPA-TDNN Embedding       β”‚  ← speechbrain/spkrec-ecapa-voxceleb
β”‚  Extraction                 β”‚    192-dim L2-normalized vectors
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚  embeddings (N Γ— 192)
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Agglomerative Hierarchical β”‚  ← cosine distance metric
β”‚  Clustering (AHC)           β”‚    silhouette-based auto k-selection
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚  speaker labels
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Post-processing            β”‚  ← merge consecutive same-speaker segs
β”‚  & Output Formatting        β”‚    timestamped JSON / RTTM / SRT
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## Project Structure

```
speaker-diarization/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py          # FastAPI app β€” REST + WebSocket endpoints
β”‚   └── pipeline.py      # Core end-to-end diarization pipeline
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ embedder.py      # ECAPA-TDNN speaker embedding extractor
β”‚   └── clusterer.py     # Agglomerative Hierarchical Clustering (AHC)
β”œβ”€β”€ utils/
β”‚   └── audio.py         # Audio loading, chunking, RTTM/SRT export
β”œβ”€β”€ tests/
β”‚   └── test_diarization.py  # Unit + integration tests
β”œβ”€β”€ static/
β”‚   └── index.html       # Web demo UI
β”œβ”€β”€ demo.py              # CLI interface
└── requirements.txt
```

---

## Installation

```bash
# 1. Clone / navigate to project
cd speaker-diarization

# 2. Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. (Optional) Set HuggingFace token for pyannote VAD
#    Accept terms at: https://huggingface.co/pyannote/voice-activity-detection
export HF_TOKEN=your_token_here
```

---

## Usage

### CLI Demo

```bash
# Basic usage (auto-detect speaker count)
python demo.py --audio meeting.wav

# Specify 3 speakers
python demo.py --audio call.wav --speakers 3

# Export all formats
python demo.py --audio audio.mp3 \
    --output result.json \
    --rttm output.rttm \
    --srt subtitles.srt
```

**Example output:**
```
βœ… Done in 4.83s
   Speakers found : 3
   Audio duration : 120.50s
   Segments       : 42

   START       END       DUR  SPEAKER
   ────────────────────────────────────
   0.000     3.250    3.250  SPEAKER_00
   3.500     8.120    4.620  SPEAKER_01
   8.200    11.800    3.600  SPEAKER_00
   12.000   17.340    5.340  SPEAKER_02
   ...
```

### FastAPI Server

```bash
# Start the API server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

# Open the web UI
open http://localhost:8000

# Swagger documentation
open http://localhost:8000/docs
```

### REST API

**POST /diarize** β€” Upload audio file
```bash
curl -X POST http://localhost:8000/diarize \
  -F "file=@meeting.wav" \
  -F "num_speakers=3"
```

**Response:**
```json
{
  "status": "success",
  "num_speakers": 3,
  "audio_duration": 120.5,
  "processing_time": 4.83,
  "sample_rate": 16000,
  "speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
  "segments": [
    { "start": 0.000, "end": 3.250, "duration": 3.250, "speaker": "SPEAKER_00" },
    { "start": 3.500, "end": 8.120, "duration": 4.620, "speaker": "SPEAKER_01" }
  ]
}
```

**GET /health** β€” Service health
```bash
curl http://localhost:8000/health
# {"status":"healthy","device":"cuda","version":"1.0.0"}
```

### WebSocket Streaming

```python
import asyncio, websockets, json, numpy as np

async def stream_audio():
    async with websockets.connect("ws://localhost:8000/ws/stream") as ws:
        # Send config
        await ws.send(json.dumps({"sample_rate": 16000, "num_speakers": 2}))
        
        # Send audio chunks (raw float32 PCM)
        with open("audio.raw", "rb") as f:
            while chunk := f.read(4096):
                await ws.send(chunk)
        
        # Signal end
        await ws.send(json.dumps({"type": "eof"}))
        
        # Receive results
        async for msg in ws:
            data = json.loads(msg)
            if data["type"] == "segment":
                print(f"[{data['data']['speaker']}] {data['data']['start']:.2f}s – {data['data']['end']:.2f}s")
            elif data["type"] == "done":
                break

asyncio.run(stream_audio())
```

---

## Key Design Decisions

| Component | Choice | Rationale |
|-----------|--------|-----------|
| Speaker Embeddings | ECAPA-TDNN (SpeechBrain) | State-of-the-art speaker verification accuracy on VoxCeleb |
| Clustering | AHC + cosine distance | No predefined k required; works well with L2-normalized embeddings |
| k-selection | Silhouette analysis | Unsupervised, parameter-free speaker count estimation |
| VAD | pyannote (energy fallback) | pyannote VAD reduces false embeddings on silence/noise |
| Embedding window | 1.5s, 50% overlap | Balances temporal resolution vs. embedding stability |
| Post-processing | Merge consecutive same-speaker | Reduces over-segmentation artifact |

---

## Evaluation Metrics

Standard speaker diarization evaluation uses **Diarization Error Rate (DER)**:

```
DER = (Miss + False Alarm + Speaker Error) / Total Speech Duration
```

Export RTTM files for evaluation with `md-eval` or `dscore`:
```bash
python demo.py --audio test.wav --rttm hypothesis.rttm
dscore -r reference.rttm -s hypothesis.rttm
```

---

## Running Tests

```bash
pytest tests/ -v
pytest tests/ -v -k "clusterer"  # run specific test class
```

---

## Limitations & Future Work

- Long audio (>1hr) should use chunked processing (`utils.audio.chunk_audio`)
- Real-time streaming requires low-latency VAD (not yet implemented in WS endpoint)
- Speaker overlap (cross-talk) is assigned to a single speaker
- Consider fine-tuning ECAPA-TDNN on domain-specific data for call analytics