tantk
/

gpu_endpoint

Automatic Speech Recognition

Model card Files Files and versions

gpu_endpoint / README.md

tantk's picture

Upload README.md with huggingface_hub

bc4519b verified 6 days ago

|

history blame contribute delete

1.89 kB

	---
	tags:
	- audio
	- speech-to-text
	- voxtral
	- meetingmind
	library_name: custom
	pipeline_tag: automatic-speech-recognition
	---

	# MeetingMind Voxtral Transcription Endpoint

	GPU-accelerated speech-to-text for the MeetingMind pipeline using Voxtral Realtime 4B. Runs as an HF Inference Endpoint on a T4 GPU with scale-to-zero.

	Model weights: [`mistral-hackaton-2026/voxtral_model`](https://huggingface.co/mistral-hackaton-2026/voxtral_model) — Voxtral Realtime 4B (BF16 safetensors, loaded from `/repository/voxtral-model/`)

	## API

	### `GET /health`

	Returns service status and GPU availability.

	```bash
	curl -H "Authorization: Bearer $HF_TOKEN" $ENDPOINT_URL/health
	```

	```json
	{"status": "ok", "gpu_available": true}
	```

	### `POST /transcribe`

	Speech-to-text using Voxtral Realtime 4B. Returns full transcription.

	```bash
	curl -X POST \
	-H "Authorization: Bearer $HF_TOKEN" \
	-F audio=@speech.wav \
	$ENDPOINT_URL/transcribe
	```

	```json
	{"text": "Hello, this is a test of the voxtral speech to text system."}
	```

	### `POST /transcribe/stream`

	Streaming speech-to-text via SSE. Tokens are emitted as they are generated.

	```bash
	curl -X POST \
	-H "Authorization: Bearer $HF_TOKEN" \
	-F audio=@speech.wav \
	$ENDPOINT_URL/transcribe/stream
	```

	Events: `token` (partial), `done` (final text), `error`.

	## Environment Variables

	\| Variable \| Default \| Description \|
	\|---\|---\|---\|
	\| `VOXTRAL_MODEL_DIR` \| `/repository/voxtral-model` \| Path to Voxtral model weights \|

	## Architecture

	- Base image: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime`
	- Transcription: Voxtral Realtime 4B via direct safetensors loading (~8GB VRAM)
	- Scale-to-zero: 15 min idle timeout (~$0.60/hr when active)
	- Diarization & embeddings: Served separately by the GPU service on machine "tanti"