Upload README.md with huggingface_hub

271adbc verified 30 days ago

3.58 kB

license: apache-2.0
base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
tags:
  - asr
  - speech-recognition
  - voxtral
  - gguf
  - q4
  - burn
  - rust
  - wasm
  - webgpu
language:
  - en

Voxtral Mini 4B Realtime Q4 GGUF

Q4_0 quantized weights for Voxtral Mini 4B Realtime (ASR) in GGUF format. For use with voxtral-mini-realtime-rs.

Try the browser demo — runs entirely client-side via WASM + WebGPU.

Files

File	Size	Description
`voxtral-q4.gguf`	~2.5 GB	Full Q4 model (single file, for native use)
`shard-{aa..ae}`	5 × ≤512 MB	Sharded for browser (WASM ArrayBuffer limit)
`tekken.json`	14.9 MB	Tekken BPE tokenizer

Model Details

Base model: mistralai/Voxtral-Mini-4B-Realtime-2602
Quantization: Q4_0 (4-bit, 18 bytes per 32 elements)
File size: ~2.5 GB (vs ~9 GB BF16 original)
Format: GGUF v3
Inference: Burn ML framework with custom WGSL compute shaders
WER: 8.49% on FLEURS English (647 utterances), vs. Mistral's reported 4.90% at f32

Benchmarks

NVIDIA DGX Spark (GB10, LPDDR5x), 16s test audio:

Path	Encode	Decode	Total	RTF	Tok/s	Memory
Q4 GGUF native	1021 ms	5578 ms	6629 ms	0.416	19.4	703 MB
BF16 native	887 ms	23689 ms	24607 ms	1.543	4.6	9.2 GB
Q4 GGUF WASM	—	—	~225 s	~14.1	~0.5	(browser)

Q4 decode is 4.2x faster than BF16. Custom WGSL shaders with shared-memory tiled kernel for decode, naive kernel for encode.

Usage

Native CLI

# Download
uv run --with huggingface_hub \
  hf download TrevorJS/voxtral-mini-realtime-gguf --local-dir models

# Transcribe (unified voxtral CLI)
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
  transcribe --audio audio.wav --gguf models/voxtral-q4.gguf

Browser (WASM + WebGPU)

Shards are pre-split for browser loading. The ASR demo loads them automatically.

For local dev:

wasm-pack build --target web --no-default-features --features wasm
bun serve.mjs  # serves shards from models/voxtral-q4-shards/

Architecture

Audio (16kHz) → Mel [B, 128, T] → Encoder [B, T/4, 1280]
  → Reshape [B, T/16, 5120] → Adapter [B, T/16, 3072]
    → Decoder (autoregressive, 26 layers, GQA 32Q/8KV)
      → Token IDs → Text

WASM Constraints Solved

2 GB allocation limit — ShardedCursor over multiple Vec
4 GB address space — Two-phase loading (parse → drop reader → finalize)
1.5 GiB embedding table — Q4 on GPU + CPU bytes for row lookups
No sync GPU readback — into_data_async().await throughout
256 workgroup limit — Patched cubecl-wgpu to cap reduce kernels

Code: TrevorS/voxtral-mini-realtime-rs
TTS Model: TrevorJS/voxtral-tts-q4-gguf
ASR Demo: TrevorJS/voxtral-mini-realtime
TTS Demo: TrevorJS/voxtral-4b-tts

TrevorJS
/

voxtral-mini-realtime-gguf