Voxtral Mini 4B Realtime Q4 GGUF
Q4_0 quantized weights for Voxtral Mini 4B Realtime (ASR) in GGUF format. For use with voxtral-mini-realtime-rs.
Try the browser demo β runs entirely client-side via WASM + WebGPU.
Files
| File | Size | Description |
|---|---|---|
voxtral-q4.gguf |
~2.5 GB | Full Q4 model (single file, for native use) |
shard-{aa..ae} |
5 Γ β€512 MB | Sharded for browser (WASM ArrayBuffer limit) |
tekken.json |
14.9 MB | Tekken BPE tokenizer |
Model Details
- Base model: mistralai/Voxtral-Mini-4B-Realtime-2602
- Quantization: Q4_0 (4-bit, 18 bytes per 32 elements)
- File size: ~2.5 GB (vs ~9 GB BF16 original)
- Format: GGUF v3
- Inference: Burn ML framework with custom WGSL compute shaders
- WER: 8.49% on FLEURS English (647 utterances), vs. Mistral's reported 4.90% at f32
Benchmarks
NVIDIA DGX Spark (GB10, LPDDR5x), 16s test audio:
| Path | Encode | Decode | Total | RTF | Tok/s | Memory |
|---|---|---|---|---|---|---|
| Q4 GGUF native | 1021 ms | 5578 ms | 6629 ms | 0.416 | 19.4 | 703 MB |
| BF16 native | 887 ms | 23689 ms | 24607 ms | 1.543 | 4.6 | 9.2 GB |
| Q4 GGUF WASM | β | β | ~225 s | ~14.1 | ~0.5 | (browser) |
Q4 decode is 4.2x faster than BF16. Custom WGSL shaders with shared-memory tiled kernel for decode, naive kernel for encode.
Usage
Native CLI
# Download
uv run --with huggingface_hub \
hf download TrevorJS/voxtral-mini-realtime-gguf --local-dir models
# Transcribe (unified voxtral CLI)
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
transcribe --audio audio.wav --gguf models/voxtral-q4.gguf
Browser (WASM + WebGPU)
Shards are pre-split for browser loading. The ASR demo loads them automatically.
For local dev:
wasm-pack build --target web --no-default-features --features wasm
bun serve.mjs # serves shards from models/voxtral-q4-shards/
Architecture
Audio (16kHz) β Mel [B, 128, T] β Encoder [B, T/4, 1280]
β Reshape [B, T/16, 5120] β Adapter [B, T/16, 3072]
β Decoder (autoregressive, 26 layers, GQA 32Q/8KV)
β Token IDs β Text
WASM Constraints Solved
- 2 GB allocation limit β ShardedCursor over multiple Vec
- 4 GB address space β Two-phase loading (parse β drop reader β finalize)
- 1.5 GiB embedding table β Q4 on GPU + CPU bytes for row lookups
- No sync GPU readback β
into_data_async().awaitthroughout - 256 workgroup limit β Patched cubecl-wgpu to cap reduce kernels
Related
- Code: TrevorS/voxtral-mini-realtime-rs
- TTS Model: TrevorJS/voxtral-tts-q4-gguf
- ASR Demo: TrevorJS/voxtral-mini-realtime
- TTS Demo: TrevorJS/voxtral-4b-tts
- Downloads last month
- 318
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for TrevorJS/voxtral-mini-realtime-gguf
Base model
mistralai/Ministral-3-3B-Base-2512 Finetuned
mistralai/Voxtral-Mini-4B-Realtime-2602