ConvxO2 commited on
Commit
9c441b1
Β·
1 Parent(s): 8d04859

Rewrite README with clear setup, deployment, and troubleshooting

Browse files
Files changed (1) hide show
  1. README.md +141 -183
README.md CHANGED
@@ -1,250 +1,208 @@
1
- ---
2
  title: Who Spoke When
3
- emoji: πŸŽ™οΈ
4
  colorFrom: blue
5
- colorTo: purple
6
  sdk: docker
7
  app_file: app/main.py
8
  pinned: false
9
  ---
10
 
11
- # πŸŽ™ Speaker Diarization System
12
- ### *Who Spoke When β€” Multi-Speaker Audio Segmentation*
13
 
14
- > **Tech Stack:** Python Β· PyTorch Β· SpeechBrain Β· Pyannote.audio Β· Transformers Β· FastAPI
 
 
15
 
16
  ---
17
 
18
- ## Architecture
19
-
20
- ```
21
- Audio Input
22
- β”‚
23
- β–Ό
24
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25
- β”‚ Voice Activity Detection β”‚ ← pyannote/voice-activity-detection
26
- β”‚ (VAD) β”‚ fallback: energy-based VAD
27
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
28
- β”‚ speech regions (start, end)
29
- β–Ό
30
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
31
- β”‚ Sliding Window Segmentationβ”‚ ← 1.5s windows, 50% overlap
32
- β”‚ β”‚
33
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
34
- β”‚ segment list
35
- β–Ό
36
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
37
- β”‚ ECAPA-TDNN Embedding β”‚ ← speechbrain/spkrec-ecapa-voxceleb
38
- β”‚ Extraction β”‚ 192-dim L2-normalized vectors
39
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
40
- β”‚ embeddings (N Γ— 192)
41
- β–Ό
42
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
43
- β”‚ Agglomerative Hierarchical β”‚ ← cosine distance metric
44
- β”‚ Clustering (AHC) β”‚ silhouette-based auto k-selection
45
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
46
- β”‚ speaker labels
47
- β–Ό
48
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
49
- β”‚ Post-processing β”‚ ← merge consecutive same-speaker segs
50
- β”‚ & Output Formatting β”‚ timestamped JSON / RTTM / SRT
51
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
52
- ```
53
 
54
  ---
55
 
56
  ## Project Structure
57
-
58
- ```
59
- speaker-diarization/
60
- β”œβ”€β”€ app/
61
- β”‚ β”œβ”€β”€ main.py # FastAPI app β€” REST + WebSocket endpoints
62
- β”‚ └── pipeline.py # Core end-to-end diarization pipeline
63
- β”œβ”€β”€ models/
64
- β”‚ β”œβ”€β”€ embedder.py # ECAPA-TDNN speaker embedding extractor
65
- β”‚ └── clusterer.py # Agglomerative Hierarchical Clustering (AHC)
66
- β”œβ”€β”€ utils/
67
- β”‚ └── audio.py # Audio loading, chunking, RTTM/SRT export
68
- β”œβ”€β”€ tests/
69
- β”‚ └── test_diarization.py # Unit + integration tests
70
- β”œβ”€β”€ static/
71
- β”‚ └── index.html # Web demo UI
72
- β”œβ”€β”€ demo.py # CLI interface
73
- └── requirements.txt
74
  ```
75
 
76
  ---
77
 
78
- ## Installation
79
 
80
- ```bash
81
- # 1. Clone / navigate to project
82
- cd speaker-diarization
 
 
 
 
83
 
84
- # 2. Create virtual environment
 
85
  python -m venv .venv
86
- source .venv/bin/activate # Windows: .venv\Scripts\activate
 
87
 
88
- # 3. Install dependencies
 
89
  pip install -r requirements.txt
 
 
 
 
90
 
91
- # 4. (Optional) Set HuggingFace token for pyannote VAD
92
- # Accept terms at: https://huggingface.co/pyannote/voice-activity-detection
93
- export HF_TOKEN=your_token_here
 
 
 
 
 
94
  ```
95
 
 
 
 
 
 
 
 
 
 
96
  ---
97
 
98
- ## Usage
 
 
99
 
100
- ### CLI Demo
101
 
102
- ```bash
103
- # Basic usage (auto-detect speaker count)
104
- python demo.py --audio meeting.wav
105
 
106
- # Specify 3 speakers
107
- python demo.py --audio call.wav --speakers 3
 
 
 
 
108
 
109
- # Export all formats
110
- python demo.py --audio audio.mp3 \
111
- --output result.json \
112
- --rttm output.rttm \
113
- --srt subtitles.srt
114
  ```
115
 
116
- **Example output:**
117
- ```
118
- βœ… Done in 4.83s
119
- Speakers found : 3
120
- Audio duration : 120.50s
121
- Segments : 42
122
-
123
- START END DUR SPEAKER
124
- ────────────────────────────────────
125
- 0.000 3.250 3.250 SPEAKER_00
126
- 3.500 8.120 4.620 SPEAKER_01
127
- 8.200 11.800 3.600 SPEAKER_00
128
- 12.000 17.340 5.340 SPEAKER_02
129
- ...
130
- ```
131
 
132
- ### FastAPI Server
133
 
134
- ```bash
135
- # Start the API server
136
- uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
137
 
138
- # Open the web UI
139
- open http://localhost:8000
140
 
141
- # Swagger documentation
142
- open http://localhost:8000/docs
143
- ```
144
 
145
- ### REST API
 
 
146
 
147
- **POST /diarize** β€” Upload audio file
148
  ```bash
149
  curl -X POST http://localhost:8000/diarize \
150
- -F "file=@meeting.wav" \
151
- -F "num_speakers=3"
152
  ```
153
 
154
- **Response:**
155
- ```json
156
- {
157
- "status": "success",
158
- "num_speakers": 3,
159
- "audio_duration": 120.5,
160
- "processing_time": 4.83,
161
- "sample_rate": 16000,
162
- "speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
163
- "segments": [
164
- { "start": 0.000, "end": 3.250, "duration": 3.250, "speaker": "SPEAKER_00" },
165
- { "start": 3.500, "end": 8.120, "duration": 4.620, "speaker": "SPEAKER_01" }
166
- ]
167
- }
168
- ```
169
 
170
- **GET /health** β€” Service health
171
  ```bash
172
- curl http://localhost:8000/health
173
- # {"status":"healthy","device":"cuda","version":"1.0.0"}
174
  ```
175
 
176
- ### WebSocket Streaming
177
-
178
- ```python
179
- import asyncio, websockets, json, numpy as np
180
-
181
- async def stream_audio():
182
- async with websockets.connect("ws://localhost:8000/ws/stream") as ws:
183
- # Send config
184
- await ws.send(json.dumps({"sample_rate": 16000, "num_speakers": 2}))
185
-
186
- # Send audio chunks (raw float32 PCM)
187
- with open("audio.raw", "rb") as f:
188
- while chunk := f.read(4096):
189
- await ws.send(chunk)
190
-
191
- # Signal end
192
- await ws.send(json.dumps({"type": "eof"}))
193
-
194
- # Receive results
195
- async for msg in ws:
196
- data = json.loads(msg)
197
- if data["type"] == "segment":
198
- print(f"[{data['data']['speaker']}] {data['data']['start']:.2f}s – {data['data']['end']:.2f}s")
199
- elif data["type"] == "done":
200
- break
201
-
202
- asyncio.run(stream_audio())
203
  ```
204
 
205
  ---
206
 
207
- ## Key Design Decisions
208
 
209
- | Component | Choice | Rationale |
210
- |-----------|--------|-----------|
211
- | Speaker Embeddings | ECAPA-TDNN (SpeechBrain) | State-of-the-art speaker verification accuracy on VoxCeleb |
212
- | Clustering | AHC + cosine distance | No predefined k required; works well with L2-normalized embeddings |
213
- | k-selection | Silhouette analysis | Unsupervised, parameter-free speaker count estimation |
214
- | VAD | pyannote (energy fallback) | pyannote VAD reduces false embeddings on silence/noise |
215
- | Embedding window | 1.5s, 50% overlap | Balances temporal resolution vs. embedding stability |
216
- | Post-processing | Merge consecutive same-speaker | Reduces over-segmentation artifact |
217
 
218
  ---
219
 
220
- ## Evaluation Metrics
 
 
 
 
 
 
 
 
221
 
222
- Standard speaker diarization evaluation uses **Diarization Error Rate (DER)**:
223
 
224
- ```
225
- DER = (Miss + False Alarm + Speaker Error) / Total Speech Duration
226
- ```
227
 
228
- Export RTTM files for evaluation with `md-eval` or `dscore`:
229
- ```bash
230
- python demo.py --audio test.wav --rttm hypothesis.rttm
231
- dscore -r reference.rttm -s hypothesis.rttm
232
- ```
233
 
234
- ---
 
 
 
235
 
236
- ## Running Tests
 
 
 
237
 
238
- ```bash
239
- pytest tests/ -v
240
- pytest tests/ -v -k "clusterer" # run specific test class
241
- ```
242
 
243
  ---
244
 
245
- ## Limitations & Future Work
 
 
 
 
 
246
 
247
- - Long audio (>1hr) should use chunked processing (`utils.audio.chunk_audio`)
248
- - Real-time streaming requires low-latency VAD (not yet implemented in WS endpoint)
249
- - Speaker overlap (cross-talk) is assigned to a single speaker
250
- - Consider fine-tuning ECAPA-TDNN on domain-specific data for call analytics
 
1
+ ο»Ώ---
2
  title: Who Spoke When
3
+ emoji: 'πŸŽ™οΈ'
4
  colorFrom: blue
5
+ colorTo: cyan
6
  sdk: docker
7
  app_file: app/main.py
8
  pinned: false
9
  ---
10
 
11
+ # Who Spoke When
12
+ Speaker diarization service and web app: upload audio and get **who spoke when** segments.
13
 
14
+ The project now runs with a **hybrid pipeline**:
15
+ - Preferred: `pyannote/speaker-diarization-3.1` (best quality)
16
+ - Fallback: VAD + ECAPA-TDNN embeddings + agglomerative clustering
17
 
18
  ---
19
 
20
+ ## What You Get
21
+ - FastAPI backend (`/diarize`, `/diarize/url`, `/health`)
22
+ - Web UI (`/`) for file upload and timeline view
23
+ - CLI demo (`demo.py`)
24
+ - Automatic fallback if pyannote models are unavailable
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ---
27
 
28
  ## Project Structure
29
+ ```text
30
+ app/
31
+ main.py FastAPI app and endpoints
32
+ pipeline.py Hybrid diarization pipeline
33
+ models/
34
+ embedder.py ECAPA-TDNN embedding extractor
35
+ clusterer.py Speaker clustering logic
36
+ utils/
37
+ audio.py Audio and export helpers
38
+ static/
39
+ index.html Web UI
40
+ Dockerfile
41
+ requirements.txt
42
+ README.md
 
 
 
43
  ```
44
 
45
  ---
46
 
47
+ ## Quick Start (Local)
48
 
49
+ ### 1. Create and activate a virtual environment
50
+
51
+ Windows PowerShell:
52
+ ```powershell
53
+ python -m venv .venv
54
+ .\.venv\Scripts\Activate.ps1
55
+ ```
56
 
57
+ Linux/macOS:
58
+ ```bash
59
  python -m venv .venv
60
+ source .venv/bin/activate
61
+ ```
62
 
63
+ ### 2. Install dependencies
64
+ ```bash
65
  pip install -r requirements.txt
66
+ ```
67
+
68
+ ### 3. (Recommended) Set Hugging Face token
69
+ `pyannote` models are gated. Create a token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
70
 
71
+ Windows PowerShell:
72
+ ```powershell
73
+ $env:HF_TOKEN="your_token_here"
74
+ ```
75
+
76
+ Linux/macOS:
77
+ ```bash
78
+ export HF_TOKEN="your_token_here"
79
  ```
80
 
81
+ ### 4. Run API server
82
+ ```bash
83
+ uvicorn app.main:app --host 0.0.0.0 --port 8000
84
+ ```
85
+
86
+ Open:
87
+ - UI: `http://localhost:8000`
88
+ - API docs: `http://localhost:8000/docs`
89
+
90
  ---
91
 
92
+ ## Web UI Notes
93
+ - The UI now defaults to **same-origin** API (`/diarize`), so it works on Hugging Face Spaces.
94
+ - If you manually set a custom endpoint, ensure it allows CORS and is reachable from browser.
95
 
96
+ ---
97
 
98
+ ## Hugging Face Spaces Deployment
 
 
99
 
100
+ ### Requirements
101
+ 1. Space created (Docker SDK)
102
+ 2. Space secret `HF_TOKEN` configured
103
+ 3. Terms accepted for:
104
+ - [https://huggingface.co/pyannote/voice-activity-detection](https://huggingface.co/pyannote/voice-activity-detection)
105
+ - [https://huggingface.co/pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
106
 
107
+ ### Push code
108
+ Push `main` branch to your Space repo remote:
109
+ ```bash
110
+ git push huggingface main
 
111
  ```
112
 
113
+ If push fails with unauthorized:
114
+ - Use a token with **Write** role (not Read)
115
+ - Confirm token owner has access to the target namespace
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
+ ---
118
 
119
+ ## API
 
 
120
 
121
+ ### `GET /health`
122
+ Returns service health and device.
123
 
124
+ ### `POST /diarize`
125
+ Upload an audio file.
 
126
 
127
+ Form fields:
128
+ - `file`: audio file
129
+ - `num_speakers` (optional): force known number of speakers
130
 
131
+ Example:
132
  ```bash
133
  curl -X POST http://localhost:8000/diarize \
134
+ -F "file=@meeting.mp3" \
135
+ -F "num_speakers=2"
136
  ```
137
 
138
+ ### `POST /diarize/url`
139
+ Diarize audio from a remote URL.
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
+ Example:
142
  ```bash
143
+ curl -X POST "http://localhost:8000/diarize/url?audio_url=https://example.com/sample.wav"
 
144
  ```
145
 
146
+ ---
147
+
148
+ ## CLI Usage
149
+ ```bash
150
+ python demo.py --audio meeting.wav
151
+ python demo.py --audio meeting.wav --speakers 2
152
+ python demo.py --audio meeting.wav --output result.json --rttm result.rttm --srt result.srt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  ```
154
 
155
  ---
156
 
157
+ ## Configuration (Environment Variables)
158
 
159
+ | Variable | Default | Description |
160
+ |---|---|---|
161
+ | `HF_TOKEN` | unset | Hugging Face token for gated pyannote models |
162
+ | `CACHE_DIR` | temp model cache path | Model download/cache directory |
163
+ | `USE_PYANNOTE_DIARIZATION` | `true` | Enable full pyannote diarization first |
164
+ | `PYANNOTE_DIARIZATION_MODEL` | `pyannote/speaker-diarization-3.1` | pyannote diarization model id |
 
 
165
 
166
  ---
167
 
168
+ ## How the Pipeline Works
169
+ 1. Load and normalize audio
170
+ 2. Try full pyannote diarization (best quality)
171
+ 3. If unavailable/fails, fallback to:
172
+ - VAD (pyannote VAD or energy VAD)
173
+ - Sliding windows
174
+ - ECAPA embeddings
175
+ - Agglomerative clustering
176
+ 4. Merge adjacent same-speaker segments
177
 
178
+ ---
179
 
180
+ ## Troubleshooting
 
 
181
 
182
+ ### 1) UI shows `Error: Failed to fetch`
183
+ Likely wrong API endpoint. Use same-origin `/diarize` in deployed UI.
 
 
 
184
 
185
+ ### 2) Logs show pyannote download/auth warnings
186
+ You need:
187
+ - valid `HF_TOKEN`
188
+ - accepted model terms on both pyannote model pages
189
 
190
+ ### 3) Poor speaker separation
191
+ - Provide `num_speakers` when known
192
+ - Ensure clean audio (minimal background noise)
193
+ - Prefer pyannote path (set token + accept terms)
194
 
195
+ ### 4) `500` during embedding load
196
+ This is usually model download/cache/auth mismatch. Confirm `HF_TOKEN`, cache path write access, and internet connectivity.
 
 
197
 
198
  ---
199
 
200
+ ## Limitations
201
+ - Overlapped speech may still be imperfect in fallback mode
202
+ - Quality depends on audio clarity, language mix, and noise
203
+ - Very short utterances are harder to classify reliably
204
+
205
+ ---
206
 
207
+ ## License
208
+ Add your preferred license file (`LICENSE`) if this project is public.