Spaces:

ConvxO2
/

Who-Spoke-When

Running

App Files Files Community

Who-Spoke-When / README.md

ConvxO2

Restore required Spaces YAML configuration in README

411e5d6 7 days ago

preview code

raw

history blame contribute delete

4.89 kB

	---
	title: Who Spoke When
	emoji: 🎙️
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	app_file: app/main.py
	pinned: false
	---

	# Who Spoke When
	Speaker diarization service and web app: upload audio and get who spoke when segments.

	The project now runs with a hybrid pipeline:
	- Preferred: `pyannote/speaker-diarization-3.1` (best quality)
	- Fallback: VAD + ECAPA-TDNN embeddings + agglomerative clustering

	---

	## What You Get
	- FastAPI backend (`/diarize`, `/diarize/url`, `/health`)
	- Web UI (`/`) for file upload and timeline view
	- CLI demo (`demo.py`)
	- Automatic fallback if pyannote models are unavailable

	---

	## Project Structure
	```text
	app/
	main.py FastAPI app and endpoints
	pipeline.py Hybrid diarization pipeline
	models/
	embedder.py ECAPA-TDNN embedding extractor
	clusterer.py Speaker clustering logic
	utils/
	audio.py Audio and export helpers
	static/
	index.html Web UI
	Dockerfile
	requirements.txt
	README.md
	```

	---

	## Quick Start (Local)

	### 1. Create and activate a virtual environment

	Windows PowerShell:
	```powershell
	python -m venv .venv
	.\.venv\Scripts\Activate.ps1
	```

	Linux/macOS:
	```bash
	python -m venv .venv
	source .venv/bin/activate
	```

	### 2. Install dependencies
	```bash
	pip install -r requirements.txt
	```

	### 3. (Recommended) Set Hugging Face token
	`pyannote` models are gated. Create a token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).

	Windows PowerShell:
	```powershell
	$env:HF_TOKEN="your_token_here"
	```

	Linux/macOS:
	```bash
	export HF_TOKEN="your_token_here"
	```

	### 4. Run API server
	```bash
	uvicorn app.main:app --host 0.0.0.0 --port 8000
	```

	Open:
	- UI: `http://localhost:8000`
	- API docs: `http://localhost:8000/docs`

	---

	## Web UI Notes
	- The UI now defaults to same-origin API (`/diarize`), so it works on Hugging Face Spaces.
	- If you manually set a custom endpoint, ensure it allows CORS and is reachable from browser.

	---

	## Hugging Face Spaces Deployment

	### Requirements
	1. Space created (Docker SDK)
	2. Space secret `HF_TOKEN` configured
	3. Terms accepted for:
	- [https://huggingface.co/pyannote/voice-activity-detection](https://huggingface.co/pyannote/voice-activity-detection)
	- [https://huggingface.co/pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)

	### Push code
	Push `main` branch to your Space repo remote:
	```bash
	git push huggingface main
	```

	If push fails with unauthorized:
	- Use a token with Write role (not Read)
	- Confirm token owner has access to the target namespace

	---

	## API

	### `GET /health`
	Returns service health and device.

	### `POST /diarize`
	Upload an audio file.

	Form fields:
	- `file`: audio file
	- `num_speakers` (optional): force known number of speakers

	Example:
	```bash
	curl -X POST http://localhost:8000/diarize \
	-F "file=@meeting.mp3" \
	-F "num_speakers=2"
	```

	### `POST /diarize/url`
	Diarize audio from a remote URL.

	Example:
	```bash
	curl -X POST "http://localhost:8000/diarize/url?audio_url=https://example.com/sample.wav"
	```

	---

	## CLI Usage
	```bash
	python demo.py --audio meeting.wav
	python demo.py --audio meeting.wav --speakers 2
	python demo.py --audio meeting.wav --output result.json --rttm result.rttm --srt result.srt
	```

	---

	## Configuration (Environment Variables)

	\| Variable \| Default \| Description \|
	\|---\|---\|---\|
	\| `HF_TOKEN` \| unset \| Hugging Face token for gated pyannote models \|
	\| `CACHE_DIR` \| temp model cache path \| Model download/cache directory \|
	\| `USE_PYANNOTE_DIARIZATION` \| `true` \| Enable full pyannote diarization first \|
	\| `PYANNOTE_DIARIZATION_MODEL` \| `pyannote/speaker-diarization-3.1` \| pyannote diarization model id \|

	---

	## How the Pipeline Works
	1. Load and normalize audio
	2. Try full pyannote diarization (best quality)
	3. If unavailable/fails, fallback to:
	- VAD (pyannote VAD or energy VAD)
	- Sliding windows
	- ECAPA embeddings
	- Agglomerative clustering
	4. Merge adjacent same-speaker segments

	---

	## Troubleshooting

	### 1) UI shows `Error: Failed to fetch`
	Likely wrong API endpoint. Use same-origin `/diarize` in deployed UI.

	### 2) Logs show pyannote download/auth warnings
	You need:
	- valid `HF_TOKEN`
	- accepted model terms on both pyannote model pages

	### 3) Poor speaker separation
	- Provide `num_speakers` when known
	- Ensure clean audio (minimal background noise)
	- Prefer pyannote path (set token + accept terms)

	### 4) `500` during embedding load
	This is usually model download/cache/auth mismatch. Confirm `HF_TOKEN`, cache path write access, and internet connectivity.

	---

	## Limitations
	- Overlapped speech may still be imperfect in fallback mode
	- Quality depends on audio clarity, language mix, and noise
	- Very short utterances are harder to classify reliably

	---

	## License
	Add your preferred license file (`LICENSE`) if this project is public.