Voxtral 4B Realtime — VQF Quantized
Pre-quantized weights for Voxtral-Mini-4B-Realtime-2602 — Mistral AI's 4.4B streaming speech-to-text model. Packaged in the VQF (Voxtral Quantized Format) container, inspired by voxtral.c.
At the default 480ms delay, Voxtral matches Whisper large-v3; at 960ms+ it surpasses it. These quantizations reduce VRAM from ~8.8 GB (BF16) down to 2-5 GB with minimal quality loss.
| File | Quant | Size | Bits/Wt | VRAM | Use Case |
|---|---|---|---|---|---|
consolidated-q4_k.vqf |
Q4_K | 3.0 GB | 4.6 | ~4 GB | Recommended. Best quality-to-size ratio. |
consolidated-q4_0.vqf |
Q4_0 | 3.2 GB | 4.0 | ~4 GB | Simplest format, fastest dequant. |
consolidated-q6_k.vqf |
Q6_K | 3.8 GB | 6.4 | ~5 GB | Near-lossless. Sweet spot for 8 GB GPUs. |
consolidated-q8_0.vqf |
Q8_0 | 5.0 GB | 9.0 | ~7 GB | Reference quality. 12+ GB GPUs. |
VRAM includes ~1.5 GB overhead for KV caches and working buffers. Also includes the Tekken BPE tokenizer (tekken.json).
Quickstart
These weights work with SpeechVox (GPU-accelerated push-to-talk for Windows), voxtral.c, or any runtime supporting the VQF format. Requires an NVIDIA GPU with CUDA 12.0+.
What's Quantized
406 linear weight matrices across the 32 encoder + 26 decoder layers (wq/wk/wv/wo/w1/w2/w3). Everything else stays BF16: embeddings, norms, biases, adapter weights, conv stem.
Benchmarks
Per-language error rates (%) on Google FLEURS test split. Lower is better.
- Paper (BF16): From the Voxtral Realtime paper, full-precision model at 480ms delay
- Q4_K: Our benchmark of the Q4_K quantized model — 517 samples, 101.5 min audio, RTX 4070 Ti, normalized to -23 dBFS
| Language | Samples | Metric | Paper (BF16) | Q4_K |
|---|---|---|---|---|
| English | 60 (9.5 min) | WER | 4.9 | 5.4 |
| Spanish | 52 (10.5 min) | WER | 3.3 | 2.8 |
| French | 56 (10.2 min) | WER | 6.4 | 7.6 |
| German | 42 (10.1 min) | WER | 6.2 | 4.8 |
| Italian | 23 (5.2 min) | WER | 5.7 | 3.1 |
| Portuguese | 46 (10.1 min) | WER | 5.4 | 5.4 |
| Dutch | 32 (5.1 min) | WER | 8.4 | 5.3 |
| Russian | 26 (5.1 min) | WER | 5.4 | 5.8 |
| Arabic | 27 (5.1 min) | WER | 14.4 | 7.9 |
| Hindi | 25 (5.3 min) | WER | 12.9 | 15.0 |
| Korean | 26 (5.2 min) | WER | 11.4 | 15.3 |
| Japanese | 46 (10.1 min) | CER | 9.6 | 8.5 |
| Chinese | 56 (10.1 min) | CER | 10.5 | 9.6 |
Paper numbers are reference only (different test conditions).
Delay vs. Quality
| Delay | FLEURS WER | vs. Whisper large-v3 |
|---|---|---|
| 480ms | 8.72% | Matches (8.23%) |
| 960ms | 7.70% | Surpasses |
| 2400ms | 6.73% | Surpasses |
Speed (RTX 4070 Ti, Q4_K, 10s audio)
~340 ms total (29x real-time): encoder 120ms + decoder 190ms + overhead 30ms.
Architecture
4.4B parameter causal encoder-decoder: 32-layer audio encoder (1280D, full MHA) -> 4x adapter downsample -> 26-layer language decoder (3072D, GQA 32Q/8KV) -> Tekken tokenizer (131K vocab).
Detailed architecture & constants
16kHz mono PCM -> Mel (128 bins) -> Conv Stem (stride 2, 50Hz)
-> Encoder: 32 layers, 1280D, 32 heads, 64 head-dim, SwiGLU 5120, RoPE, window 750
-> Adapter: concat 4 frames (5120) -> Linear -> GELU -> Linear (3072), 12.5Hz
-> Decoder: 26 layers, 3072D, 32Q/8KV heads, 128 head-dim, SwiGLU 9216, RoPE, window 8192
-> Tekken BPE (131K vocab)
| Encoder | Decoder | |
|---|---|---|
| Dim | 1280 | 3072 |
| Layers | 32 | 26 |
| Q / KV heads | 32 / 32 | 32 / 8 |
| Head dim | 64 | 128 |
| FFN hidden | 5120 | 9216 |
| Window | 750 | 8192 |
VQF Format
Memory-mapped binary container: [VQF1 header] [tensor descriptors] [64B-aligned data]. Weights are read directly from the mapped file without copying.
Block format details
Q4_0 (20B / 32 values) — float scale + uint8 nibs[16]. Symmetric: val = scale * (nibble - 8)
Q4_K (148B / 256 values) — float super_scale, super_min + uint8 scales[12] (6-bit packed) + uint8 nibs[128]. Asymmetric with per-sub-block scale+min.
Q6_K (204B / 256 values) — float super_scale + int8 scales[8] + uint8 ql[128] + uint8 qh[64]. Symmetric: 6-bit = 4-bit low + 2-bit high. val = super_scale * scales[sub] * (q6 - 32)
Q8_0 (36B / 32 values) — float scale + int8 quants[32]. Symmetric: val = scale * quant
How to Quantize
pip install torch safetensors
# Q4_K (recommended)
python quantize/quantize.py /path/to/model_dir consolidated-q4_k.vqf --type Q4_K
# Also: --type Q4_0 | Q6_K | Q8_0
~15 seconds per variant on an RTX 4070 Ti. Requires PyTorch with CUDA.
Limitations
- Transcription only (no translation, no diarization, no timestamps)
- Accent/dialect coverage varies; noisy environments reduce accuracy
- Q4_K recommended over Q4_0 for low-resource languages
Links
Citation
@misc{voxtral2025realtime,
title={Voxtral Realtime},
author={Alexander H. Liu and Andy Ehrenberg and Andy Lo and Angad Kalra and Anna Googasian and Barret Zoph and Bilal Piot and Changil Kim and Daniel Haziza and Daphne Ippolito and David Grangier and Edouard Grave and Francisco Massa and Guillaume Lample and Jade Copet and Leo Boytsov and Luca Wehrstedt and Martin Sundermeyer and Marta R. Costa-jussà and Michael Auli and Mona Diab and Patrick von Platen and Paul-Ambroise Duquenne and Robin Algayres and Ruslan Mavlyutov and Sravya Popuri and Timothée Lacroix and Vineel Pratap},
year={2025},
eprint={2602.11298},
archivePrefix={arXiv}
}
- Downloads last month
- 23
Model tree for tantk/Voxtral-4B-Realtime-VQF
Base model
mistralai/Ministral-3-3B-Base-2512