| | ---
|
| | tags:
|
| | - audio
|
| | - speech-to-text
|
| | - voxtral
|
| | - meetingmind
|
| | library_name: custom
|
| | pipeline_tag: automatic-speech-recognition
|
| | ---
|
| |
|
| | # MeetingMind Voxtral Transcription Endpoint
|
| |
|
| | GPU-accelerated speech-to-text for the MeetingMind pipeline using Voxtral Realtime 4B. Runs as an HF Inference Endpoint on a T4 GPU with scale-to-zero.
|
| |
|
| | **Model weights**: [`mistral-hackaton-2026/voxtral_model`](https://huggingface.co/mistral-hackaton-2026/voxtral_model) — Voxtral Realtime 4B (BF16 safetensors, loaded from `/repository/voxtral-model/`)
|
| |
|
| | ## API
|
| |
|
| | ### `GET /health`
|
| |
|
| | Returns service status and GPU availability.
|
| |
|
| | ```bash
|
| | curl -H "Authorization: Bearer $HF_TOKEN" $ENDPOINT_URL/health
|
| | ```
|
| |
|
| | ```json
|
| | {"status": "ok", "gpu_available": true}
|
| | ```
|
| |
|
| | ### `POST /transcribe`
|
| |
|
| | Speech-to-text using Voxtral Realtime 4B. Returns full transcription.
|
| |
|
| | ```bash
|
| | curl -X POST \
|
| | -H "Authorization: Bearer $HF_TOKEN" \
|
| | -F audio=@speech.wav \
|
| | $ENDPOINT_URL/transcribe
|
| | ```
|
| |
|
| | ```json
|
| | {"text": "Hello, this is a test of the voxtral speech to text system."}
|
| | ```
|
| |
|
| | ### `POST /transcribe/stream`
|
| |
|
| | Streaming speech-to-text via SSE. Tokens are emitted as they are generated.
|
| |
|
| | ```bash
|
| | curl -X POST \
|
| | -H "Authorization: Bearer $HF_TOKEN" \
|
| | -F audio=@speech.wav \
|
| | $ENDPOINT_URL/transcribe/stream
|
| | ```
|
| |
|
| | Events: `token` (partial), `done` (final text), `error`.
|
| |
|
| | ## Environment Variables
|
| |
|
| | | Variable | Default | Description |
|
| | |---|---|---|
|
| | | `VOXTRAL_MODEL_DIR` | `/repository/voxtral-model` | Path to Voxtral model weights |
|
| |
|
| | ## Architecture
|
| |
|
| | - **Base image**: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime`
|
| | - **Transcription**: Voxtral Realtime 4B via direct safetensors loading (~8GB VRAM)
|
| | - **Scale-to-zero**: 15 min idle timeout (~$0.60/hr when active)
|
| | - **Diarization & embeddings**: Served separately by the GPU service on machine "tanti"
|
| |
|