MiMo-V2.5-ASR β GGUF
GGUF conversion of XiaomiMiMo/MiMo-V2.5-ASR for CrispStrobe/CrispASR. Pure C++ inference β no Python, no Transformers, runs on Apple Silicon (Metal), CPU, and CUDA.
The runtime is functional end-to-end: greedy decode through the 36-layer Qwen2 LM, full prefill + step-decode KV-cached graphs, prompt construction matching the upstream MimoAudio.asr_sft reference exactly. JFK transcription test passes verbatim.
Available variants
| File | Quant | Size | Layout | Recommended |
|---|---|---|---|---|
mimo-asr-f16.gguf |
F16 | 14.9 GB | separate Q/K/V | Full precision; needs ~16 GB RAM during inference |
mimo-asr-q4_k.gguf |
Q4_K | 4.2 GB | fused QKV | Default β fits in 8 GB RAM, no quality loss visible on JFK |
The default mimo-asr-q4_k.gguf (re-uploaded May 2026, PLAN #60d) ships with per-LM-layer Q/K/V projections fused into a single model.layers.{i}.attn.qkv.{weight,bias} tensor pair, yielding ~1.7Γ faster per-step decode on M1 vs the prior unfused layout (3058 ms/step β 1806 ms/step on a contended-disk run; ~1.1-1.2Γ pure-compute on a quiet box). The CrispASR runtime auto-detects either layout: the F16 file above keeps working unchanged via the separate-Q/K/V fallback path. Re-upload of a fused F16 is queued behind disk-headroom availability.
Pair with cstr/mimo-tokenizer-GGUF β the audio tokenizer is a separate model that converts 16 kHz PCM β 8-channel RVQ codes that this LM consumes.
Architecture
- Audio path β 6-layer input_local_transformer (1024d, 64 heads, GS=4 group size, SiLU, sinusoidal RoPE on Q/K) + 8-channel RVQ codebook embeddings + linear group-downcast to 4096d
- LM β 36-layer Qwen2 (hidden=4096, 32 attn heads, 8 KV heads, intermediate=12288, RMSNorm, SwiGLU, RoPE ΞΈ=640K, max_pos=8192)
- LM head β untied, vocab=151680
- Total params β ~7.5B
- Languages β Mandarin (with Wu / Cantonese / Hokkien / Sichuanese dialect support), English, code-switching
- License β MIT (matches upstream)
Usage with CrispASR
# Build (one-time)
git clone https://github.com/CrispStrobe/CrispASR.git
cd CrispASR
cmake -B build-ninja-compile -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build-ninja-compile --target crispasr
# Download both halves
hf download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/
hf download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/
# Transcribe
build-ninja-compile/bin/crispasr \
--backend mimo-asr \
-m models/mimo-asr-q4_k.gguf \
--codec-model models/mimo-tokenizer-q4_k.gguf \
-f samples/jfk.wav
If --codec-model is omitted, the runtime auto-discovers mimo-tokenizer-q4_k.gguf (or mimo-tokenizer.gguf, mimo-audio-tokenizer.gguf) next to the LM file.
Expected output (JFK sample)
And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.
This matches the upstream Python MimoAudio.asr_sft reference verbatim.
Performance
On Apple M1, Metal backend, Q4_K, warm-cache:
| Phase | Time |
|---|---|
| LM load (mmap, lazy) | ~1 s |
| Audio tokenize (11 s sample) | ~0.5 s |
| Prefill (T_groups=71) | ~3 s |
| Step decode (~25 tokens) | ~20 s with the fused-QKV file (β0.8 s/token; was ~30 s pre-fusion) |
| End-to-end |
Per-step decode is the bottleneck; PLAN #60d (May 2026) fused the per-LM-layer Q/K/V projections into one matmul, replacing 3 mul_mat + 3 ggml_add per layer Γ 36 layers with 1 + 1, for a measured ~1.7Γ speedup at the same disk-pressure level. KV cache reuse via cached step graphs (PLAN #51b') is also live. Future perf wins: F16 with fused QKV (queued behind disk headroom), CRISPASR_KV_QUANT=q8_0 for hour-long inputs (PLAN #60e env-flag is already plumbed; default stays F16 until per-backend rollout completes).
Validation
Stage-by-stage cosine similarity against the bf16 PyTorch reference on JFK (Q4_K weights, bf16 ref):
| Stage | cos_mean | cos_min |
|---|---|---|
prefill_audio_features |
0.998 | 0.992 |
prefill_text_embeds |
0.996 | 0.901 |
prefill_inputs_embeds |
0.998 | 0.901 |
prefill_last_hidden |
0.963 | 0.963 |
prefill_text_logits_step0 |
0.981 | 0.981 |
Argmax of step-0 logits is token 1597 (' And'), matching the Python reference. The strict cosβ₯0.999 gate is tracked under F16+fp32 ref but requires >28 GB RAM; in practice the Q4_K + bf16-ref ceiling reflects quantisation noise, not bugs.
Conversion (reproducibility)
# Set OMP_NUM_THREADS=1 to avoid a torch+OpenMP deadlock during bf16βf16
OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 PYTHONUNBUFFERED=1 \
python models/convert-mimo-asr-to-gguf.py \
--input XiaomiMiMo/MiMo-V2.5-ASR \
--output mimo-asr-f16.gguf \
--outtype f16
build-ninja-compile/bin/crispasr-quantize \
mimo-asr-f16.gguf mimo-asr-q4_k.gguf q4_k
Vocab is padded to 151680 entries (151643 BPE + 30 special + 7 unused slots) and tokenizer.ggml.merges is populated (151291 entries). Earlier filenames (mimo-asr.gguf, mimo-asr-q2_k.gguf) shipped before commit 2191a70 with truncated vocab + missing merges and were removed from the repo on 2026-05-01.
Citation
@misc{mimo2025v25asr,
title = {MiMo-V2.5-ASR},
author = {Xiaomi MiMo Team},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR}
}
License
MIT β same as upstream.
- Downloads last month
- 1,195
16-bit
Model tree for cstr/mimo-asr-GGUF
Base model
XiaomiMiMo/MiMo-V2.5-ASR