TrevorJS's picture
Upload README.md with huggingface_hub
271adbc verified
metadata
license: apache-2.0
base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
tags:
  - asr
  - speech-recognition
  - voxtral
  - gguf
  - q4
  - burn
  - rust
  - wasm
  - webgpu
language:
  - en

Voxtral Mini 4B Realtime Q4 GGUF

Q4_0 quantized weights for Voxtral Mini 4B Realtime (ASR) in GGUF format. For use with voxtral-mini-realtime-rs.

Try the browser demo β€” runs entirely client-side via WASM + WebGPU.

Files

File Size Description
voxtral-q4.gguf ~2.5 GB Full Q4 model (single file, for native use)
shard-{aa..ae} 5 Γ— ≀512 MB Sharded for browser (WASM ArrayBuffer limit)
tekken.json 14.9 MB Tekken BPE tokenizer

Model Details

  • Base model: mistralai/Voxtral-Mini-4B-Realtime-2602
  • Quantization: Q4_0 (4-bit, 18 bytes per 32 elements)
  • File size: ~2.5 GB (vs ~9 GB BF16 original)
  • Format: GGUF v3
  • Inference: Burn ML framework with custom WGSL compute shaders
  • WER: 8.49% on FLEURS English (647 utterances), vs. Mistral's reported 4.90% at f32

Benchmarks

NVIDIA DGX Spark (GB10, LPDDR5x), 16s test audio:

Path Encode Decode Total RTF Tok/s Memory
Q4 GGUF native 1021 ms 5578 ms 6629 ms 0.416 19.4 703 MB
BF16 native 887 ms 23689 ms 24607 ms 1.543 4.6 9.2 GB
Q4 GGUF WASM β€” β€” ~225 s ~14.1 ~0.5 (browser)

Q4 decode is 4.2x faster than BF16. Custom WGSL shaders with shared-memory tiled kernel for decode, naive kernel for encode.

Usage

Native CLI

# Download
uv run --with huggingface_hub \
  hf download TrevorJS/voxtral-mini-realtime-gguf --local-dir models

# Transcribe (unified voxtral CLI)
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
  transcribe --audio audio.wav --gguf models/voxtral-q4.gguf

Browser (WASM + WebGPU)

Shards are pre-split for browser loading. The ASR demo loads them automatically.

For local dev:

wasm-pack build --target web --no-default-features --features wasm
bun serve.mjs  # serves shards from models/voxtral-q4-shards/

Architecture

Audio (16kHz) β†’ Mel [B, 128, T] β†’ Encoder [B, T/4, 1280]
  β†’ Reshape [B, T/16, 5120] β†’ Adapter [B, T/16, 3072]
    β†’ Decoder (autoregressive, 26 layers, GQA 32Q/8KV)
      β†’ Token IDs β†’ Text

WASM Constraints Solved

  1. 2 GB allocation limit β€” ShardedCursor over multiple Vec
  2. 4 GB address space β€” Two-phase loading (parse β†’ drop reader β†’ finalize)
  3. 1.5 GiB embedding table β€” Q4 on GPU + CPU bytes for row lookups
  4. No sync GPU readback β€” into_data_async().await throughout
  5. 256 workgroup limit β€” Patched cubecl-wgpu to cap reduce kernels

Related