--- tags: - audio - speech-to-text - voxtral - meetingmind library_name: custom pipeline_tag: automatic-speech-recognition --- # MeetingMind Voxtral Transcription Endpoint GPU-accelerated speech-to-text for the MeetingMind pipeline using Voxtral Realtime 4B. Runs as an HF Inference Endpoint on a T4 GPU with scale-to-zero. **Model weights**: [`mistral-hackaton-2026/voxtral_model`](https://huggingface.co/mistral-hackaton-2026/voxtral_model) — Voxtral Realtime 4B (BF16 safetensors, loaded from `/repository/voxtral-model/`) ## API ### `GET /health` Returns service status and GPU availability. ```bash curl -H "Authorization: Bearer $HF_TOKEN" $ENDPOINT_URL/health ``` ```json {"status": "ok", "gpu_available": true} ``` ### `POST /transcribe` Speech-to-text using Voxtral Realtime 4B. Returns full transcription. ```bash curl -X POST \ -H "Authorization: Bearer $HF_TOKEN" \ -F audio=@speech.wav \ $ENDPOINT_URL/transcribe ``` ```json {"text": "Hello, this is a test of the voxtral speech to text system."} ``` ### `POST /transcribe/stream` Streaming speech-to-text via SSE. Tokens are emitted as they are generated. ```bash curl -X POST \ -H "Authorization: Bearer $HF_TOKEN" \ -F audio=@speech.wav \ $ENDPOINT_URL/transcribe/stream ``` Events: `token` (partial), `done` (final text), `error`. ## Environment Variables | Variable | Default | Description | |---|---|---| | `VOXTRAL_MODEL_DIR` | `/repository/voxtral-model` | Path to Voxtral model weights | ## Architecture - **Base image**: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime` - **Transcription**: Voxtral Realtime 4B via direct safetensors loading (~8GB VRAM) - **Scale-to-zero**: 15 min idle timeout (~$0.60/hr when active) - **Diarization & embeddings**: Served separately by the GPU service on machine "tanti"