Who-Spoke-When / README.md
ConvxO2's picture
Restore required Spaces YAML configuration in README
411e5d6
---
title: Who Spoke When
emoji: 🎙️
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app/main.py
pinned: false
---
# Who Spoke When
Speaker diarization service and web app: upload audio and get **who spoke when** segments.
The project now runs with a **hybrid pipeline**:
- Preferred: `pyannote/speaker-diarization-3.1` (best quality)
- Fallback: VAD + ECAPA-TDNN embeddings + agglomerative clustering
---
## What You Get
- FastAPI backend (`/diarize`, `/diarize/url`, `/health`)
- Web UI (`/`) for file upload and timeline view
- CLI demo (`demo.py`)
- Automatic fallback if pyannote models are unavailable
---
## Project Structure
```text
app/
main.py FastAPI app and endpoints
pipeline.py Hybrid diarization pipeline
models/
embedder.py ECAPA-TDNN embedding extractor
clusterer.py Speaker clustering logic
utils/
audio.py Audio and export helpers
static/
index.html Web UI
Dockerfile
requirements.txt
README.md
```
---
## Quick Start (Local)
### 1. Create and activate a virtual environment
Windows PowerShell:
```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
```
Linux/macOS:
```bash
python -m venv .venv
source .venv/bin/activate
```
### 2. Install dependencies
```bash
pip install -r requirements.txt
```
### 3. (Recommended) Set Hugging Face token
`pyannote` models are gated. Create a token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
Windows PowerShell:
```powershell
$env:HF_TOKEN="your_token_here"
```
Linux/macOS:
```bash
export HF_TOKEN="your_token_here"
```
### 4. Run API server
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000
```
Open:
- UI: `http://localhost:8000`
- API docs: `http://localhost:8000/docs`
---
## Web UI Notes
- The UI now defaults to **same-origin** API (`/diarize`), so it works on Hugging Face Spaces.
- If you manually set a custom endpoint, ensure it allows CORS and is reachable from browser.
---
## Hugging Face Spaces Deployment
### Requirements
1. Space created (Docker SDK)
2. Space secret `HF_TOKEN` configured
3. Terms accepted for:
- [https://huggingface.co/pyannote/voice-activity-detection](https://huggingface.co/pyannote/voice-activity-detection)
- [https://huggingface.co/pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
### Push code
Push `main` branch to your Space repo remote:
```bash
git push huggingface main
```
If push fails with unauthorized:
- Use a token with **Write** role (not Read)
- Confirm token owner has access to the target namespace
---
## API
### `GET /health`
Returns service health and device.
### `POST /diarize`
Upload an audio file.
Form fields:
- `file`: audio file
- `num_speakers` (optional): force known number of speakers
Example:
```bash
curl -X POST http://localhost:8000/diarize \
-F "file=@meeting.mp3" \
-F "num_speakers=2"
```
### `POST /diarize/url`
Diarize audio from a remote URL.
Example:
```bash
curl -X POST "http://localhost:8000/diarize/url?audio_url=https://example.com/sample.wav"
```
---
## CLI Usage
```bash
python demo.py --audio meeting.wav
python demo.py --audio meeting.wav --speakers 2
python demo.py --audio meeting.wav --output result.json --rttm result.rttm --srt result.srt
```
---
## Configuration (Environment Variables)
| Variable | Default | Description |
|---|---|---|
| `HF_TOKEN` | unset | Hugging Face token for gated pyannote models |
| `CACHE_DIR` | temp model cache path | Model download/cache directory |
| `USE_PYANNOTE_DIARIZATION` | `true` | Enable full pyannote diarization first |
| `PYANNOTE_DIARIZATION_MODEL` | `pyannote/speaker-diarization-3.1` | pyannote diarization model id |
---
## How the Pipeline Works
1. Load and normalize audio
2. Try full pyannote diarization (best quality)
3. If unavailable/fails, fallback to:
- VAD (pyannote VAD or energy VAD)
- Sliding windows
- ECAPA embeddings
- Agglomerative clustering
4. Merge adjacent same-speaker segments
---
## Troubleshooting
### 1) UI shows `Error: Failed to fetch`
Likely wrong API endpoint. Use same-origin `/diarize` in deployed UI.
### 2) Logs show pyannote download/auth warnings
You need:
- valid `HF_TOKEN`
- accepted model terms on both pyannote model pages
### 3) Poor speaker separation
- Provide `num_speakers` when known
- Ensure clean audio (minimal background noise)
- Prefer pyannote path (set token + accept terms)
### 4) `500` during embedding load
This is usually model download/cache/auth mismatch. Confirm `HF_TOKEN`, cache path write access, and internet connectivity.
---
## Limitations
- Overlapped speech may still be imperfect in fallback mode
- Quality depends on audio clarity, language mix, and noise
- Very short utterances are harder to classify reliably
---
## License
Add your preferred license file (`LICENSE`) if this project is public.