Voxtral Mini 4B Realtime Q4 GGUF

Q4_0 quantized weights for Voxtral Mini 4B Realtime (ASR) in GGUF format. For use with voxtral-mini-realtime-rs.

Try the browser demo β€” runs entirely client-side via WASM + WebGPU.

Files

File Size Description
voxtral-q4.gguf ~2.5 GB Full Q4 model (single file, for native use)
shard-{aa..ae} 5 Γ— ≀512 MB Sharded for browser (WASM ArrayBuffer limit)
tekken.json 14.9 MB Tekken BPE tokenizer

Model Details

  • Base model: mistralai/Voxtral-Mini-4B-Realtime-2602
  • Quantization: Q4_0 (4-bit, 18 bytes per 32 elements)
  • File size: ~2.5 GB (vs ~9 GB BF16 original)
  • Format: GGUF v3
  • Inference: Burn ML framework with custom WGSL compute shaders
  • WER: 8.49% on FLEURS English (647 utterances), vs. Mistral's reported 4.90% at f32

Benchmarks

NVIDIA DGX Spark (GB10, LPDDR5x), 16s test audio:

Path Encode Decode Total RTF Tok/s Memory
Q4 GGUF native 1021 ms 5578 ms 6629 ms 0.416 19.4 703 MB
BF16 native 887 ms 23689 ms 24607 ms 1.543 4.6 9.2 GB
Q4 GGUF WASM β€” β€” ~225 s ~14.1 ~0.5 (browser)

Q4 decode is 4.2x faster than BF16. Custom WGSL shaders with shared-memory tiled kernel for decode, naive kernel for encode.

Usage

Native CLI

# Download
uv run --with huggingface_hub \
  hf download TrevorJS/voxtral-mini-realtime-gguf --local-dir models

# Transcribe (unified voxtral CLI)
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
  transcribe --audio audio.wav --gguf models/voxtral-q4.gguf

Browser (WASM + WebGPU)

Shards are pre-split for browser loading. The ASR demo loads them automatically.

For local dev:

wasm-pack build --target web --no-default-features --features wasm
bun serve.mjs  # serves shards from models/voxtral-q4-shards/

Architecture

Audio (16kHz) β†’ Mel [B, 128, T] β†’ Encoder [B, T/4, 1280]
  β†’ Reshape [B, T/16, 5120] β†’ Adapter [B, T/16, 3072]
    β†’ Decoder (autoregressive, 26 layers, GQA 32Q/8KV)
      β†’ Token IDs β†’ Text

WASM Constraints Solved

  1. 2 GB allocation limit β€” ShardedCursor over multiple Vec
  2. 4 GB address space β€” Two-phase loading (parse β†’ drop reader β†’ finalize)
  3. 1.5 GiB embedding table β€” Q4 on GPU + CPU bytes for row lookups
  4. No sync GPU readback β€” into_data_async().await throughout
  5. 256 workgroup limit β€” Patched cubecl-wgpu to cap reduce kernels

Related

Downloads last month
318
GGUF
Model size
4B params
Architecture
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TrevorJS/voxtral-mini-realtime-gguf

Space using TrevorJS/voxtral-mini-realtime-gguf 1