Voxtral 4B Realtime — VQF Quantized

Pre-quantized weights for Voxtral-Mini-4B-Realtime-2602 — Mistral AI's 4.4B streaming speech-to-text model. Packaged in the VQF (Voxtral Quantized Format) container, inspired by voxtral.c.

At the default 480ms delay, Voxtral matches Whisper large-v3; at 960ms+ it surpasses it. These quantizations reduce VRAM from ~8.8 GB (BF16) down to 2-5 GB with minimal quality loss.

File	Quant	Size	Bits/Wt	VRAM	Use Case
`consolidated-q4_k.vqf`	Q4_K	3.0 GB	4.6	~4 GB	Recommended. Best quality-to-size ratio.
`consolidated-q4_0.vqf`	Q4_0	3.2 GB	4.0	~4 GB	Simplest format, fastest dequant.
`consolidated-q6_k.vqf`	Q6_K	3.8 GB	6.4	~5 GB	Near-lossless. Sweet spot for 8 GB GPUs.
`consolidated-q8_0.vqf`	Q8_0	5.0 GB	9.0	~7 GB	Reference quality. 12+ GB GPUs.

VRAM includes ~1.5 GB overhead for KV caches and working buffers. Also includes the Tekken BPE tokenizer (tekken.json).

Quickstart

These weights work with SpeechVox (GPU-accelerated push-to-talk for Windows), voxtral.c, or any runtime supporting the VQF format. Requires an NVIDIA GPU with CUDA 12.0+.

What's Quantized

406 linear weight matrices across the 32 encoder + 26 decoder layers (wq/wk/wv/wo/w1/w2/w3). Everything else stays BF16: embeddings, norms, biases, adapter weights, conv stem.

Benchmarks

Per-language error rates (%) on Google FLEURS test split. Lower is better.

Paper (BF16): From the Voxtral Realtime paper, full-precision model at 480ms delay
Q4_K: Our benchmark of the Q4_K quantized model — 517 samples, 101.5 min audio, RTX 4070 Ti, normalized to -23 dBFS

Language	Samples	Metric	Paper (BF16)	Q4_K
English	60 (9.5 min)	WER	4.9	5.4
Spanish	52 (10.5 min)	WER	3.3	2.8
French	56 (10.2 min)	WER	6.4	7.6
German	42 (10.1 min)	WER	6.2	4.8
Italian	23 (5.2 min)	WER	5.7	3.1
Portuguese	46 (10.1 min)	WER	5.4	5.4
Dutch	32 (5.1 min)	WER	8.4	5.3
Russian	26 (5.1 min)	WER	5.4	5.8
Arabic	27 (5.1 min)	WER	14.4	7.9
Hindi	25 (5.3 min)	WER	12.9	15.0
Korean	26 (5.2 min)	WER	11.4	15.3
Japanese	46 (10.1 min)	CER	9.6	8.5
Chinese	56 (10.1 min)	CER	10.5	9.6

Paper numbers are reference only (different test conditions).

Delay vs. Quality

Delay	FLEURS WER	vs. Whisper large-v3
480ms	8.72%	Matches (8.23%)
960ms	7.70%	Surpasses
2400ms	6.73%	Surpasses

Speed (RTX 4070 Ti, Q4_K, 10s audio)

~340 ms total (29x real-time): encoder 120ms + decoder 190ms + overhead 30ms.

Architecture

4.4B parameter causal encoder-decoder: 32-layer audio encoder (1280D, full MHA) -> 4x adapter downsample -> 26-layer language decoder (3072D, GQA 32Q/8KV) -> Tekken tokenizer (131K vocab).

Detailed architecture & constants

16kHz mono PCM -> Mel (128 bins) -> Conv Stem (stride 2, 50Hz)
  -> Encoder: 32 layers, 1280D, 32 heads, 64 head-dim, SwiGLU 5120, RoPE, window 750
  -> Adapter: concat 4 frames (5120) -> Linear -> GELU -> Linear (3072), 12.5Hz
  -> Decoder: 26 layers, 3072D, 32Q/8KV heads, 128 head-dim, SwiGLU 9216, RoPE, window 8192
  -> Tekken BPE (131K vocab)

	Encoder	Decoder
Dim	1280	3072
Layers	32	26
Q / KV heads	32 / 32	32 / 8
Head dim	64	128
FFN hidden	5120	9216
Window	750	8192

VQF Format

Memory-mapped binary container: [VQF1 header] [tensor descriptors] [64B-aligned data]. Weights are read directly from the mapped file without copying.

Block format details

Q4_0 (20B / 32 values) — float scale + uint8 nibs[16]. Symmetric: val = scale * (nibble - 8)

Q4_K (148B / 256 values) — float super_scale, super_min + uint8 scales[12] (6-bit packed) + uint8 nibs[128]. Asymmetric with per-sub-block scale+min.

Q6_K (204B / 256 values) — float super_scale + int8 scales[8] + uint8 ql[128] + uint8 qh[64]. Symmetric: 6-bit = 4-bit low + 2-bit high. val = super_scale * scales[sub] * (q6 - 32)

Q8_0 (36B / 32 values) — float scale + int8 quants[32]. Symmetric: val = scale * quant

How to Quantize

pip install torch safetensors

# Q4_K (recommended)
python quantize/quantize.py /path/to/model_dir consolidated-q4_k.vqf --type Q4_K

# Also: --type Q4_0 | Q6_K | Q8_0

~15 seconds per variant on an RTX 4070 Ti. Requires PyTorch with CUDA.

Limitations

Transcription only (no translation, no diarization, no timestamps)
Accent/dialect coverage varies; noisy environments reduce accuracy
Q4_K recommended over Q4_0 for low-resource languages

Citation

@misc{voxtral2025realtime,
    title={Voxtral Realtime},
    author={Alexander H. Liu and Andy Ehrenberg and Andy Lo and Angad Kalra and Anna Googasian and Barret Zoph and Bilal Piot and Changil Kim and Daniel Haziza and Daphne Ippolito and David Grangier and Edouard Grave and Francisco Massa and Guillaume Lample and Jade Copet and Leo Boytsov and Luca Wehrstedt and Martin Sundermeyer and Marta R. Costa-jussà and Michael Auli and Mona Diab and Patrick von Platen and Paul-Ambroise Duquenne and Robin Algayres and Ruslan Mavlyutov and Sravya Popuri and Timothée Lacroix and Vineel Pratap},
    year={2025},
    eprint={2602.11298},
    archivePrefix={arXiv}
}

Downloads last month: 9

Model tree for tantk/Voxtral-4B-Realtime-VQF

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Quantized

(30)

this model

Paper for tantk/Voxtral-4B-Realtime-VQF

Voxtral Realtime

Paper • 2602.11298 • Published Feb 11 • 28

tantk
/

Voxtral-4B-Realtime-VQF