Spaces:

ConvxO2
/

Who-Spoke-When

Running

File size: 4,893 Bytes

---
title: Who Spoke When
emoji: 🎙️
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app/main.py
pinned: false
---

# Who Spoke When
Speaker diarization service and web app: upload audio and get **who spoke when** segments.

The project now runs with a **hybrid pipeline**:
- Preferred: `pyannote/speaker-diarization-3.1` (best quality)
- Fallback: VAD + ECAPA-TDNN embeddings + agglomerative clustering

---

## What You Get
- FastAPI backend (`/diarize`, `/diarize/url`, `/health`)
- Web UI (`/`) for file upload and timeline view
- CLI demo (`demo.py`)
- Automatic fallback if pyannote models are unavailable

---

## Project Structure
```text
app/
  main.py         FastAPI app and endpoints
  pipeline.py     Hybrid diarization pipeline
models/
  embedder.py     ECAPA-TDNN embedding extractor
  clusterer.py    Speaker clustering logic
utils/
  audio.py        Audio and export helpers
static/
  index.html      Web UI
Dockerfile
requirements.txt
README.md
```

---

## Quick Start (Local)

### 1. Create and activate a virtual environment

Windows PowerShell:
```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
```

Linux/macOS:
```bash
python -m venv .venv
source .venv/bin/activate
```

### 2. Install dependencies
```bash
pip install -r requirements.txt
```

### 3. (Recommended) Set Hugging Face token
`pyannote` models are gated. Create a token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).

Windows PowerShell:
```powershell
$env:HF_TOKEN="your_token_here"
```

Linux/macOS:
```bash
export HF_TOKEN="your_token_here"
```

### 4. Run API server
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000
```

Open:
- UI: `http://localhost:8000`
- API docs: `http://localhost:8000/docs`

---

## Web UI Notes
- The UI now defaults to **same-origin** API (`/diarize`), so it works on Hugging Face Spaces.
- If you manually set a custom endpoint, ensure it allows CORS and is reachable from browser.

---

## Hugging Face Spaces Deployment

### Requirements
1. Space created (Docker SDK)
2. Space secret `HF_TOKEN` configured
3. Terms accepted for:
   - [https://huggingface.co/pyannote/voice-activity-detection](https://huggingface.co/pyannote/voice-activity-detection)
   - [https://huggingface.co/pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)

### Push code
Push `main` branch to your Space repo remote:
```bash
git push huggingface main
```

If push fails with unauthorized:
- Use a token with **Write** role (not Read)
- Confirm token owner has access to the target namespace

---

## API

### `GET /health`
Returns service health and device.

### `POST /diarize`
Upload an audio file.

Form fields:
- `file`: audio file
- `num_speakers` (optional): force known number of speakers

Example:
```bash
curl -X POST http://localhost:8000/diarize \
  -F "file=@meeting.mp3" \
  -F "num_speakers=2"
```

### `POST /diarize/url`
Diarize audio from a remote URL.

Example:
```bash
curl -X POST "http://localhost:8000/diarize/url?audio_url=https://example.com/sample.wav"
```

---

## CLI Usage
```bash
python demo.py --audio meeting.wav
python demo.py --audio meeting.wav --speakers 2
python demo.py --audio meeting.wav --output result.json --rttm result.rttm --srt result.srt
```

---

## Configuration (Environment Variables)

| Variable | Default | Description |
|---|---|---|
| `HF_TOKEN` | unset | Hugging Face token for gated pyannote models |
| `CACHE_DIR` | temp model cache path | Model download/cache directory |
| `USE_PYANNOTE_DIARIZATION` | `true` | Enable full pyannote diarization first |
| `PYANNOTE_DIARIZATION_MODEL` | `pyannote/speaker-diarization-3.1` | pyannote diarization model id |

---

## How the Pipeline Works
1. Load and normalize audio
2. Try full pyannote diarization (best quality)
3. If unavailable/fails, fallback to:
   - VAD (pyannote VAD or energy VAD)
   - Sliding windows
   - ECAPA embeddings
   - Agglomerative clustering
4. Merge adjacent same-speaker segments

---

## Troubleshooting

### 1) UI shows `Error: Failed to fetch`
Likely wrong API endpoint. Use same-origin `/diarize` in deployed UI.

### 2) Logs show pyannote download/auth warnings
You need:
- valid `HF_TOKEN`
- accepted model terms on both pyannote model pages

### 3) Poor speaker separation
- Provide `num_speakers` when known
- Ensure clean audio (minimal background noise)
- Prefer pyannote path (set token + accept terms)

### 4) `500` during embedding load
This is usually model download/cache/auth mismatch. Confirm `HF_TOKEN`, cache path write access, and internet connectivity.

---

## Limitations
- Overlapped speech may still be imperfect in fallback mode
- Quality depends on audio clarity, language mix, and noise
- Very short utterances are harder to classify reliably

---

## License
Add your preferred license file (`LICENSE`) if this project is public.