Voxtral Mini 4B Realtime Q4 GGUF

Q4_0 quantized weights for Voxtral Mini 4B Realtime (ASR) in GGUF format. For use with voxtral-mini-realtime-rs.

Try the browser demo — runs entirely client-side via WASM + WebGPU.

Files

File	Size	Description
`voxtral-q4.gguf`	~2.5 GB	Full Q4 model (single file, for native use)
`shard-{aa..ae}`	5 × ≤512 MB	Sharded for browser (WASM ArrayBuffer limit)
`tekken.json`	14.9 MB	Tekken BPE tokenizer

Model Details

Base model: mistralai/Voxtral-Mini-4B-Realtime-2602
Quantization: Q4_0 (4-bit, 18 bytes per 32 elements)
File size: ~2.5 GB (vs ~9 GB BF16 original)
Format: GGUF v3
Inference: Burn ML framework with custom WGSL compute shaders
WER: 8.49% on FLEURS English (647 utterances), vs. Mistral's reported 4.90% at f32

Benchmarks

NVIDIA DGX Spark (GB10, LPDDR5x), 16s test audio:

Path	Encode	Decode	Total	RTF	Tok/s	Memory
Q4 GGUF native	1021 ms	5578 ms	6629 ms	0.416	19.4	703 MB
BF16 native	887 ms	23689 ms	24607 ms	1.543	4.6	9.2 GB
Q4 GGUF WASM	—	—	~225 s	~14.1	~0.5	(browser)

Q4 decode is 4.2x faster than BF16. Custom WGSL shaders with shared-memory tiled kernel for decode, naive kernel for encode.

Usage

Native CLI

# Download
uv run --with huggingface_hub \
  hf download TrevorJS/voxtral-mini-realtime-gguf --local-dir models

# Transcribe (unified voxtral CLI)
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
  transcribe --audio audio.wav --gguf models/voxtral-q4.gguf

Browser (WASM + WebGPU)

Shards are pre-split for browser loading. The ASR demo loads them automatically.

For local dev:

wasm-pack build --target web --no-default-features --features wasm
bun serve.mjs  # serves shards from models/voxtral-q4-shards/

Architecture

Audio (16kHz) → Mel [B, 128, T] → Encoder [B, T/4, 1280]
  → Reshape [B, T/16, 5120] → Adapter [B, T/16, 3072]
    → Decoder (autoregressive, 26 layers, GQA 32Q/8KV)
      → Token IDs → Text

WASM Constraints Solved

2 GB allocation limit — ShardedCursor over multiple Vec
4 GB address space — Two-phase loading (parse → drop reader → finalize)
1.5 GiB embedding table — Q4 on GPU + CPU bytes for row lookups
No sync GPU readback — into_data_async().await throughout
256 workgroup limit — Patched cubecl-wgpu to cap reduce kernels

Code: TrevorS/voxtral-mini-realtime-rs
TTS Model: TrevorJS/voxtral-tts-q4-gguf
ASR Demo: TrevorJS/voxtral-mini-realtime
TTS Demo: TrevorJS/voxtral-4b-tts

Downloads last month: 149

GGUF

Model size

4B params

Architecture

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TrevorJS/voxtral-mini-realtime-gguf

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Quantized

(23)

this model

TrevorJS
/

voxtral-mini-realtime-gguf

Voxtral Mini 4B Realtime Q4 GGUF

Files

Model Details

Benchmarks

Usage

Native CLI

Browser (WASM + WebGPU)

Architecture

WASM Constraints Solved

Related

Model tree for TrevorJS/voxtral-mini-realtime-gguf

Space using TrevorJS/voxtral-mini-realtime-gguf 1