Spaces:
Running
Running
| title: Who Spoke When | |
| emoji: 🎙️ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| app_file: app/main.py | |
| pinned: false | |
| # Who Spoke When | |
| Speaker diarization service and web app: upload audio and get **who spoke when** segments. | |
| The project now runs with a **hybrid pipeline**: | |
| - Preferred: `pyannote/speaker-diarization-3.1` (best quality) | |
| - Fallback: VAD + ECAPA-TDNN embeddings + agglomerative clustering | |
| --- | |
| ## What You Get | |
| - FastAPI backend (`/diarize`, `/diarize/url`, `/health`) | |
| - Web UI (`/`) for file upload and timeline view | |
| - CLI demo (`demo.py`) | |
| - Automatic fallback if pyannote models are unavailable | |
| --- | |
| ## Project Structure | |
| ```text | |
| app/ | |
| main.py FastAPI app and endpoints | |
| pipeline.py Hybrid diarization pipeline | |
| models/ | |
| embedder.py ECAPA-TDNN embedding extractor | |
| clusterer.py Speaker clustering logic | |
| utils/ | |
| audio.py Audio and export helpers | |
| static/ | |
| index.html Web UI | |
| Dockerfile | |
| requirements.txt | |
| README.md | |
| ``` | |
| --- | |
| ## Quick Start (Local) | |
| ### 1. Create and activate a virtual environment | |
| Windows PowerShell: | |
| ```powershell | |
| python -m venv .venv | |
| .\.venv\Scripts\Activate.ps1 | |
| ``` | |
| Linux/macOS: | |
| ```bash | |
| python -m venv .venv | |
| source .venv/bin/activate | |
| ``` | |
| ### 2. Install dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### 3. (Recommended) Set Hugging Face token | |
| `pyannote` models are gated. Create a token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). | |
| Windows PowerShell: | |
| ```powershell | |
| $env:HF_TOKEN="your_token_here" | |
| ``` | |
| Linux/macOS: | |
| ```bash | |
| export HF_TOKEN="your_token_here" | |
| ``` | |
| ### 4. Run API server | |
| ```bash | |
| uvicorn app.main:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| Open: | |
| - UI: `http://localhost:8000` | |
| - API docs: `http://localhost:8000/docs` | |
| --- | |
| ## Web UI Notes | |
| - The UI now defaults to **same-origin** API (`/diarize`), so it works on Hugging Face Spaces. | |
| - If you manually set a custom endpoint, ensure it allows CORS and is reachable from browser. | |
| --- | |
| ## Hugging Face Spaces Deployment | |
| ### Requirements | |
| 1. Space created (Docker SDK) | |
| 2. Space secret `HF_TOKEN` configured | |
| 3. Terms accepted for: | |
| - [https://huggingface.co/pyannote/voice-activity-detection](https://huggingface.co/pyannote/voice-activity-detection) | |
| - [https://huggingface.co/pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) | |
| ### Push code | |
| Push `main` branch to your Space repo remote: | |
| ```bash | |
| git push huggingface main | |
| ``` | |
| If push fails with unauthorized: | |
| - Use a token with **Write** role (not Read) | |
| - Confirm token owner has access to the target namespace | |
| --- | |
| ## API | |
| ### `GET /health` | |
| Returns service health and device. | |
| ### `POST /diarize` | |
| Upload an audio file. | |
| Form fields: | |
| - `file`: audio file | |
| - `num_speakers` (optional): force known number of speakers | |
| Example: | |
| ```bash | |
| curl -X POST http://localhost:8000/diarize \ | |
| -F "file=@meeting.mp3" \ | |
| -F "num_speakers=2" | |
| ``` | |
| ### `POST /diarize/url` | |
| Diarize audio from a remote URL. | |
| Example: | |
| ```bash | |
| curl -X POST "http://localhost:8000/diarize/url?audio_url=https://example.com/sample.wav" | |
| ``` | |
| --- | |
| ## CLI Usage | |
| ```bash | |
| python demo.py --audio meeting.wav | |
| python demo.py --audio meeting.wav --speakers 2 | |
| python demo.py --audio meeting.wav --output result.json --rttm result.rttm --srt result.srt | |
| ``` | |
| --- | |
| ## Configuration (Environment Variables) | |
| | Variable | Default | Description | | |
| |---|---|---| | |
| | `HF_TOKEN` | unset | Hugging Face token for gated pyannote models | | |
| | `CACHE_DIR` | temp model cache path | Model download/cache directory | | |
| | `USE_PYANNOTE_DIARIZATION` | `true` | Enable full pyannote diarization first | | |
| | `PYANNOTE_DIARIZATION_MODEL` | `pyannote/speaker-diarization-3.1` | pyannote diarization model id | | |
| --- | |
| ## How the Pipeline Works | |
| 1. Load and normalize audio | |
| 2. Try full pyannote diarization (best quality) | |
| 3. If unavailable/fails, fallback to: | |
| - VAD (pyannote VAD or energy VAD) | |
| - Sliding windows | |
| - ECAPA embeddings | |
| - Agglomerative clustering | |
| 4. Merge adjacent same-speaker segments | |
| --- | |
| ## Troubleshooting | |
| ### 1) UI shows `Error: Failed to fetch` | |
| Likely wrong API endpoint. Use same-origin `/diarize` in deployed UI. | |
| ### 2) Logs show pyannote download/auth warnings | |
| You need: | |
| - valid `HF_TOKEN` | |
| - accepted model terms on both pyannote model pages | |
| ### 3) Poor speaker separation | |
| - Provide `num_speakers` when known | |
| - Ensure clean audio (minimal background noise) | |
| - Prefer pyannote path (set token + accept terms) | |
| ### 4) `500` during embedding load | |
| This is usually model download/cache/auth mismatch. Confirm `HF_TOKEN`, cache path write access, and internet connectivity. | |
| --- | |
| ## Limitations | |
| - Overlapped speech may still be imperfect in fallback mode | |
| - Quality depends on audio clarity, language mix, and noise | |
| - Very short utterances are harder to classify reliably | |
| --- | |
| ## License | |
| Add your preferred license file (`LICENSE`) if this project is public. | |