Voxtral Mini 4B Realtime β€” Q4 GGUF

Q4_0 quantized GGUF weights for Voxtral Mini 4B Realtime, converted for use with voxtral-mini-realtime-rs.

Files

File Size Description
voxtral-q4.gguf 2.51 GB Q4_0 quantized model weights (GGUF v3)
tekken.json 14.9 MB Tekken tokenizer (131,072 vocab)

Usage

Native CLI

cargo run --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
  --audio input.wav \
  --gguf models/voxtral-q4.gguf \
  --tokenizer models/voxtral/tekken.json

Browser (WASM + WebGPU)

The Q4 GGUF is designed to run entirely client-side in a browser tab via WebGPU. See the GitHub repo for the full WASM build and dev server setup.

Quantization Details

  • Method: Q4_0 (4-bit quantization, block size 32, 18 bytes per block)
  • Original model: mistralai/Voxtral-Mini-4B-Realtime-2602 (~16 GB F32)
  • Quantized size: ~2.5 GB (fits in browser memory)
  • Inference: Custom WGSL shader for fused GPU dequantize + matmul

Links

Downloads last month
26
GGUF
Model size
4B params
Architecture
undefined
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TrevorJS/voxtral-mini-realtime-gguf

Space using TrevorJS/voxtral-mini-realtime-gguf 1