tantk commited on
Commit
bc4519b
·
verified ·
1 Parent(s): 69c6400

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +3 -50
README.md CHANGED
@@ -2,19 +2,15 @@
2
  tags:
3
  - audio
4
  - speech-to-text
5
- - speaker-diarization
6
- - speaker-embedding
7
  - voxtral
8
- - pyannote
9
- - funasr
10
  - meetingmind
11
  library_name: custom
12
  pipeline_tag: automatic-speech-recognition
13
  ---
14
 
15
- # MeetingMind GPU Endpoint
16
 
17
- GPU-accelerated speech-to-text, speaker diarization, and embedding extraction for the MeetingMind pipeline. Runs as an HF Inference Endpoint on a T4 GPU with scale-to-zero.
18
 
19
  **Model weights**: [`mistral-hackaton-2026/voxtral_model`](https://huggingface.co/mistral-hackaton-2026/voxtral_model) — Voxtral Realtime 4B (BF16 safetensors, loaded from `/repository/voxtral-model/`)
20
 
@@ -60,58 +56,15 @@ curl -X POST \
60
 
61
  Events: `token` (partial), `done` (final text), `error`.
62
 
63
- ### `POST /diarize`
64
-
65
- Speaker diarization using pyannote v4. Accepts any audio format (FLAC, WAV, MP3, etc.).
66
-
67
- ```bash
68
- curl -X POST \
69
- -H "Authorization: Bearer $HF_TOKEN" \
70
- -F audio=@meeting.flac \
71
- -F min_speakers=2 \
72
- -F max_speakers=6 \
73
- $ENDPOINT_URL/diarize
74
- ```
75
-
76
- ```json
77
- {
78
- "segments": [
79
- {"speaker": "SPEAKER_00", "start": 0.5, "end": 3.2, "duration": 2.7},
80
- {"speaker": "SPEAKER_01", "start": 3.4, "end": 7.1, "duration": 3.7}
81
- ]
82
- }
83
- ```
84
-
85
- ### `POST /embed`
86
-
87
- Speaker embedding extraction using FunASR CAM++. Returns L2-normalized 192-dim vectors for voiceprint matching.
88
-
89
- ```bash
90
- curl -X POST \
91
- -H "Authorization: Bearer $HF_TOKEN" \
92
- -F audio=@meeting.flac \
93
- -F start_time=1.0 \
94
- -F end_time=5.0 \
95
- $ENDPOINT_URL/embed
96
- ```
97
-
98
- ```json
99
- {"embedding": [0.012, -0.034, ...], "dim": 192}
100
- ```
101
-
102
  ## Environment Variables
103
 
104
  | Variable | Default | Description |
105
  |---|---|---|
106
- | `HF_TOKEN` | (required) | Hugging Face token for pyannote model access |
107
  | `VOXTRAL_MODEL_DIR` | `/repository/voxtral-model` | Path to Voxtral model weights |
108
- | `PYANNOTE_MIN_SPEAKERS` | `1` | Minimum speakers for diarization |
109
- | `PYANNOTE_MAX_SPEAKERS` | `10` | Maximum speakers for diarization |
110
 
111
  ## Architecture
112
 
113
  - **Base image**: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime`
114
  - **Transcription**: Voxtral Realtime 4B via direct safetensors loading (~8GB VRAM)
115
- - **Diarization**: pyannote/speaker-diarization-community-1 (~2GB VRAM)
116
- - **Embeddings**: FunASR CAM++ sv_zh-cn_16k-common (~200MB)
117
  - **Scale-to-zero**: 15 min idle timeout (~$0.60/hr when active)
 
 
2
  tags:
3
  - audio
4
  - speech-to-text
 
 
5
  - voxtral
 
 
6
  - meetingmind
7
  library_name: custom
8
  pipeline_tag: automatic-speech-recognition
9
  ---
10
 
11
+ # MeetingMind Voxtral Transcription Endpoint
12
 
13
+ GPU-accelerated speech-to-text for the MeetingMind pipeline using Voxtral Realtime 4B. Runs as an HF Inference Endpoint on a T4 GPU with scale-to-zero.
14
 
15
  **Model weights**: [`mistral-hackaton-2026/voxtral_model`](https://huggingface.co/mistral-hackaton-2026/voxtral_model) — Voxtral Realtime 4B (BF16 safetensors, loaded from `/repository/voxtral-model/`)
16
 
 
56
 
57
  Events: `token` (partial), `done` (final text), `error`.
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ## Environment Variables
60
 
61
  | Variable | Default | Description |
62
  |---|---|---|
 
63
  | `VOXTRAL_MODEL_DIR` | `/repository/voxtral-model` | Path to Voxtral model weights |
 
 
64
 
65
  ## Architecture
66
 
67
  - **Base image**: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime`
68
  - **Transcription**: Voxtral Realtime 4B via direct safetensors loading (~8GB VRAM)
 
 
69
  - **Scale-to-zero**: 15 min idle timeout (~$0.60/hr when active)
70
+ - **Diarization & embeddings**: Served separately by the GPU service on machine "tanti"