Spaces:
Running
Running
File size: 4,893 Bytes
411e5d6 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 d7a2919 9c441b1 635c339 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | ---
title: Who Spoke When
emoji: 🎙️
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app/main.py
pinned: false
---
# Who Spoke When
Speaker diarization service and web app: upload audio and get **who spoke when** segments.
The project now runs with a **hybrid pipeline**:
- Preferred: `pyannote/speaker-diarization-3.1` (best quality)
- Fallback: VAD + ECAPA-TDNN embeddings + agglomerative clustering
---
## What You Get
- FastAPI backend (`/diarize`, `/diarize/url`, `/health`)
- Web UI (`/`) for file upload and timeline view
- CLI demo (`demo.py`)
- Automatic fallback if pyannote models are unavailable
---
## Project Structure
```text
app/
main.py FastAPI app and endpoints
pipeline.py Hybrid diarization pipeline
models/
embedder.py ECAPA-TDNN embedding extractor
clusterer.py Speaker clustering logic
utils/
audio.py Audio and export helpers
static/
index.html Web UI
Dockerfile
requirements.txt
README.md
```
---
## Quick Start (Local)
### 1. Create and activate a virtual environment
Windows PowerShell:
```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
```
Linux/macOS:
```bash
python -m venv .venv
source .venv/bin/activate
```
### 2. Install dependencies
```bash
pip install -r requirements.txt
```
### 3. (Recommended) Set Hugging Face token
`pyannote` models are gated. Create a token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
Windows PowerShell:
```powershell
$env:HF_TOKEN="your_token_here"
```
Linux/macOS:
```bash
export HF_TOKEN="your_token_here"
```
### 4. Run API server
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000
```
Open:
- UI: `http://localhost:8000`
- API docs: `http://localhost:8000/docs`
---
## Web UI Notes
- The UI now defaults to **same-origin** API (`/diarize`), so it works on Hugging Face Spaces.
- If you manually set a custom endpoint, ensure it allows CORS and is reachable from browser.
---
## Hugging Face Spaces Deployment
### Requirements
1. Space created (Docker SDK)
2. Space secret `HF_TOKEN` configured
3. Terms accepted for:
- [https://huggingface.co/pyannote/voice-activity-detection](https://huggingface.co/pyannote/voice-activity-detection)
- [https://huggingface.co/pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
### Push code
Push `main` branch to your Space repo remote:
```bash
git push huggingface main
```
If push fails with unauthorized:
- Use a token with **Write** role (not Read)
- Confirm token owner has access to the target namespace
---
## API
### `GET /health`
Returns service health and device.
### `POST /diarize`
Upload an audio file.
Form fields:
- `file`: audio file
- `num_speakers` (optional): force known number of speakers
Example:
```bash
curl -X POST http://localhost:8000/diarize \
-F "file=@meeting.mp3" \
-F "num_speakers=2"
```
### `POST /diarize/url`
Diarize audio from a remote URL.
Example:
```bash
curl -X POST "http://localhost:8000/diarize/url?audio_url=https://example.com/sample.wav"
```
---
## CLI Usage
```bash
python demo.py --audio meeting.wav
python demo.py --audio meeting.wav --speakers 2
python demo.py --audio meeting.wav --output result.json --rttm result.rttm --srt result.srt
```
---
## Configuration (Environment Variables)
| Variable | Default | Description |
|---|---|---|
| `HF_TOKEN` | unset | Hugging Face token for gated pyannote models |
| `CACHE_DIR` | temp model cache path | Model download/cache directory |
| `USE_PYANNOTE_DIARIZATION` | `true` | Enable full pyannote diarization first |
| `PYANNOTE_DIARIZATION_MODEL` | `pyannote/speaker-diarization-3.1` | pyannote diarization model id |
---
## How the Pipeline Works
1. Load and normalize audio
2. Try full pyannote diarization (best quality)
3. If unavailable/fails, fallback to:
- VAD (pyannote VAD or energy VAD)
- Sliding windows
- ECAPA embeddings
- Agglomerative clustering
4. Merge adjacent same-speaker segments
---
## Troubleshooting
### 1) UI shows `Error: Failed to fetch`
Likely wrong API endpoint. Use same-origin `/diarize` in deployed UI.
### 2) Logs show pyannote download/auth warnings
You need:
- valid `HF_TOKEN`
- accepted model terms on both pyannote model pages
### 3) Poor speaker separation
- Provide `num_speakers` when known
- Ensure clean audio (minimal background noise)
- Prefer pyannote path (set token + accept terms)
### 4) `500` during embedding load
This is usually model download/cache/auth mismatch. Confirm `HF_TOKEN`, cache path write access, and internet connectivity.
---
## Limitations
- Overlapped speech may still be imperfect in fallback mode
- Quality depends on audio clarity, language mix, and noise
- Very short utterances are harder to classify reliably
---
## License
Add your preferred license file (`LICENSE`) if this project is public.
|