File size: 1,891 Bytes
6216b68
 
 
efbb752
 
6216b68
 
efbb752
6216b68
 
bc4519b
6216b68
bc4519b
efbb752
 
6216b68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
efbb752
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6216b68
 
 
 
efbb752
6216b68
 
 
 
efbb752
6216b68
bc4519b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---

tags:
  - audio
  - speech-to-text
  - voxtral
  - meetingmind
library_name: custom
pipeline_tag: automatic-speech-recognition
---


# MeetingMind Voxtral Transcription Endpoint

GPU-accelerated speech-to-text for the MeetingMind pipeline using Voxtral Realtime 4B. Runs as an HF Inference Endpoint on a T4 GPU with scale-to-zero.

**Model weights**: [`mistral-hackaton-2026/voxtral_model`](https://huggingface.co/mistral-hackaton-2026/voxtral_model) — Voxtral Realtime 4B (BF16 safetensors, loaded from `/repository/voxtral-model/`)

## API

### `GET /health`

Returns service status and GPU availability.

```bash

curl -H "Authorization: Bearer $HF_TOKEN" $ENDPOINT_URL/health

```

```json

{"status": "ok", "gpu_available": true}

```

### `POST /transcribe`

Speech-to-text using Voxtral Realtime 4B. Returns full transcription.

```bash

curl -X POST \

  -H "Authorization: Bearer $HF_TOKEN" \

  -F audio=@speech.wav \

  $ENDPOINT_URL/transcribe

```

```json

{"text": "Hello, this is a test of the voxtral speech to text system."}

```

### `POST /transcribe/stream`

Streaming speech-to-text via SSE. Tokens are emitted as they are generated.

```bash

curl -X POST \

  -H "Authorization: Bearer $HF_TOKEN" \

  -F audio=@speech.wav \

  $ENDPOINT_URL/transcribe/stream

```

Events: `token` (partial), `done` (final text), `error`.

## Environment Variables

| Variable | Default | Description |
|---|---|---|
| `VOXTRAL_MODEL_DIR` | `/repository/voxtral-model` | Path to Voxtral model weights |

## Architecture

- **Base image**: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime`
- **Transcription**: Voxtral Realtime 4B via direct safetensors loading (~8GB VRAM)
- **Scale-to-zero**: 15 min idle timeout (~$0.60/hr when active)
- **Diarization & embeddings**: Served separately by the GPU service on machine "tanti"