tantk commited on
Commit
efbb752
·
verified ·
1 Parent(s): 2ae58cf

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +37 -4
README.md CHANGED
@@ -1,18 +1,22 @@
1
  ---
2
  tags:
3
  - audio
 
4
  - speaker-diarization
5
  - speaker-embedding
 
6
  - pyannote
7
  - funasr
8
  - meetingmind
9
  library_name: custom
10
- pipeline_tag: audio-classification
11
  ---
12
 
13
- # MeetingMind GPU Service
14
 
15
- GPU-accelerated speaker diarization and embedding extraction for the MeetingMind pipeline. Runs as an HF Inference Endpoint on a T4 GPU with scale-to-zero.
 
 
16
 
17
  ## API
18
 
@@ -28,6 +32,34 @@ curl -H "Authorization: Bearer $HF_TOKEN" $ENDPOINT_URL/health
28
  {"status": "ok", "gpu_available": true}
29
  ```
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ### `POST /diarize`
32
 
33
  Speaker diarization using pyannote v4. Accepts any audio format (FLAC, WAV, MP3, etc.).
@@ -72,13 +104,14 @@ curl -X POST \
72
  | Variable | Default | Description |
73
  |---|---|---|
74
  | `HF_TOKEN` | (required) | Hugging Face token for pyannote model access |
 
75
  | `PYANNOTE_MIN_SPEAKERS` | `1` | Minimum speakers for diarization |
76
  | `PYANNOTE_MAX_SPEAKERS` | `10` | Maximum speakers for diarization |
77
 
78
  ## Architecture
79
 
80
  - **Base image**: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime`
 
81
  - **Diarization**: pyannote/speaker-diarization-community-1 (~2GB VRAM)
82
  - **Embeddings**: FunASR CAM++ sv_zh-cn_16k-common (~200MB)
83
- - **Total VRAM**: ~3GB (fits T4 16GB with headroom)
84
  - **Scale-to-zero**: 15 min idle timeout (~$0.60/hr when active)
 
1
  ---
2
  tags:
3
  - audio
4
+ - speech-to-text
5
  - speaker-diarization
6
  - speaker-embedding
7
+ - voxtral
8
  - pyannote
9
  - funasr
10
  - meetingmind
11
  library_name: custom
12
+ pipeline_tag: automatic-speech-recognition
13
  ---
14
 
15
+ # MeetingMind GPU Endpoint
16
 
17
+ GPU-accelerated speech-to-text, speaker diarization, and embedding extraction for the MeetingMind pipeline. Runs as an HF Inference Endpoint on a T4 GPU with scale-to-zero.
18
+
19
+ **Model weights**: [`mistral-hackaton-2026/voxtral_model`](https://huggingface.co/mistral-hackaton-2026/voxtral_model) — Voxtral Realtime 4B (BF16 safetensors, loaded from `/repository/voxtral-model/`)
20
 
21
  ## API
22
 
 
32
  {"status": "ok", "gpu_available": true}
33
  ```
34
 
35
+ ### `POST /transcribe`
36
+
37
+ Speech-to-text using Voxtral Realtime 4B. Returns full transcription.
38
+
39
+ ```bash
40
+ curl -X POST \
41
+ -H "Authorization: Bearer $HF_TOKEN" \
42
+ -F audio=@speech.wav \
43
+ $ENDPOINT_URL/transcribe
44
+ ```
45
+
46
+ ```json
47
+ {"text": "Hello, this is a test of the voxtral speech to text system."}
48
+ ```
49
+
50
+ ### `POST /transcribe/stream`
51
+
52
+ Streaming speech-to-text via SSE. Tokens are emitted as they are generated.
53
+
54
+ ```bash
55
+ curl -X POST \
56
+ -H "Authorization: Bearer $HF_TOKEN" \
57
+ -F audio=@speech.wav \
58
+ $ENDPOINT_URL/transcribe/stream
59
+ ```
60
+
61
+ Events: `token` (partial), `done` (final text), `error`.
62
+
63
  ### `POST /diarize`
64
 
65
  Speaker diarization using pyannote v4. Accepts any audio format (FLAC, WAV, MP3, etc.).
 
104
  | Variable | Default | Description |
105
  |---|---|---|
106
  | `HF_TOKEN` | (required) | Hugging Face token for pyannote model access |
107
+ | `VOXTRAL_MODEL_DIR` | `/repository/voxtral-model` | Path to Voxtral model weights |
108
  | `PYANNOTE_MIN_SPEAKERS` | `1` | Minimum speakers for diarization |
109
  | `PYANNOTE_MAX_SPEAKERS` | `10` | Maximum speakers for diarization |
110
 
111
  ## Architecture
112
 
113
  - **Base image**: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime`
114
+ - **Transcription**: Voxtral Realtime 4B via direct safetensors loading (~8GB VRAM)
115
  - **Diarization**: pyannote/speaker-diarization-community-1 (~2GB VRAM)
116
  - **Embeddings**: FunASR CAM++ sv_zh-cn_16k-common (~200MB)
 
117
  - **Scale-to-zero**: 15 min idle timeout (~$0.60/hr when active)