meetingmind-gpu / README.md
tantk's picture
Upload folder using huggingface_hub
55e14e8 verified
---
tags:
- audio
- speaker-diarization
- speaker-embedding
- pyannote
- funasr
- meetingmind
library_name: custom
pipeline_tag: audio-classification
---
# MeetingMind GPU Service
GPU-accelerated speaker diarization and embedding extraction for the MeetingMind pipeline. Runs as an HF Inference Endpoint on a T4 GPU with scale-to-zero.
## API
### `GET /health`
Returns service status and GPU availability.
```bash
curl -H "Authorization: Bearer $HF_TOKEN" $ENDPOINT_URL/health
```
```json
{"status": "ok", "gpu_available": true}
```
### `POST /diarize`
Speaker diarization using pyannote v4. Accepts any audio format (FLAC, WAV, MP3, etc.).
```bash
curl -X POST \
-H "Authorization: Bearer $HF_TOKEN" \
-F audio=@meeting.flac \
-F min_speakers=2 \
-F max_speakers=6 \
$ENDPOINT_URL/diarize
```
```json
{
"segments": [
{"speaker": "SPEAKER_00", "start": 0.5, "end": 3.2, "duration": 2.7},
{"speaker": "SPEAKER_01", "start": 3.4, "end": 7.1, "duration": 3.7}
]
}
```
### `POST /embed`
Speaker embedding extraction using FunASR CAM++. Returns L2-normalized 192-dim vectors for voiceprint matching.
```bash
curl -X POST \
-H "Authorization: Bearer $HF_TOKEN" \
-F audio=@meeting.flac \
-F start_time=1.0 \
-F end_time=5.0 \
$ENDPOINT_URL/embed
```
```json
{"embedding": [0.012, -0.034, ...], "dim": 192}
```
## Environment Variables
| Variable | Default | Description |
|---|---|---|
| `HF_TOKEN` | (required) | Hugging Face token for pyannote model access |
| `PYANNOTE_MIN_SPEAKERS` | `1` | Minimum speakers for diarization |
| `PYANNOTE_MAX_SPEAKERS` | `10` | Maximum speakers for diarization |
## Architecture
- **Base image**: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime`
- **Diarization**: pyannote/speaker-diarization-community-1 (~2GB VRAM)
- **Embeddings**: FunASR CAM++ sv_zh-cn_16k-common (~200MB)
- **Total VRAM**: ~3GB (fits T4 16GB with headroom)
- **Scale-to-zero**: 15 min idle timeout (~$0.60/hr when active)