Voxtral Mini 4B Realtime β€” MLX fp16

This is a float16 MLX conversion of mistralai/Voxtral-Mini-4B-Realtime-2602, Mistral AI's streaming speech-to-text model.

Runs via mlx-audio.

Key Details

Parameters 4B (~3.4B LM + ~0.6B Audio Encoder)
Precision float16 (full precision)
Base model mistralai/Voxtral-Mini-4B-Realtime-2602
Languages 13 (Arabic, German, English, Spanish, French, Hindi, Italian, Dutch, Portuguese, Chinese, Japanese, Korean, Russian)
License Apache 2.0

See also: int4 variant (smaller, faster)

Usage

pip install mlx-audio[stt]
from mlx_audio.stt.utils import load

model = load("mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16")

# Transcribe audio
result = model.generate("audio.wav")
print(result.text)

# Streaming transcription
for chunk in model.generate("audio.wav", stream=True):
    print(chunk, end="", flush=True)

# Adjust transcription delay (lower = faster but less accurate)
result = model.generate("audio.wav", transcription_delay_ms=480)

Recommended Settings

Setting Value Notes
Temperature 0.0 Always use greedy decoding
Transcription delay 480ms Sweet spot of accuracy vs. latency
Delay range 80ms – 2400ms Multiples of 80ms

Benchmarks (from upstream)

FLEURS (13 languages, WER%)

Delay AVG EN FR DE ES ZH JA KO
160ms 12.60 6.46 9.75 9.50 5.34 17.67 19.17 19.81
480ms 8.72 4.90 6.42 6.19 3.31 10.45 9.59 15.74
960ms 7.70 4.34 5.68 4.87 2.98 8.99 6.80 14.90
2400ms 6.73 4.05 5.23 4.15 2.71 8.48 5.50 14.30

Long-form English (WER%)

Delay Meanwhile Earnings-21 Earnings-22 TEDLIUM
480ms 5.05 10.23 12.30 3.17

Architecture

  • Causal audio encoder (~0.6B) with sliding window attention β€” enables true streaming
  • Language model decoder (~3.4B) based on Ministral-3B with adaptive RMS norm conditioned on transcription delay
  • 4x downsampling from encoder to decoder (frame rate = 12.5 Hz)
  • Both components use sliding window attention for unbounded audio length

More Info

Downloads last month
33
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16