| --- |
| license: mit |
| language: |
| - zh |
| - en |
| - yue |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - safetensors |
| - fp8 |
| - quantization |
| - speech-recognition |
| base_model: XiaomiMiMo/MiMo-V2.5-ASR |
| --- |
| |
| # MiMo-V2.5-ASR — FP8 (e4m3fn) |
|
|
| FP8-quantized build of [XiaomiMiMo/MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR), |
| the Xiaomi MiMo end-to-end ASR model with native Mandarin/English code-switching, |
| Chinese dialects, song lyrics, noisy/multi-speaker robustness, and native punctuation. |
|
|
| ## What this is |
|
|
| - **Weights:** `float8_e4m3fn`, per-output-channel absmax scaling (one fp32 scale per row), baked at save time. |
| - **Activations:** dynamic per-tensor fp8 quantization each forward pass. |
| - **Matmul:** `torch._scaled_mm` (FP8 tensor cores on Ada / Hopper / Blackwell). |
| - **Skipped (kept bf16):** embeddings, RMSNorm/LayerNorm, biases. 417 Linear layers converted. |
| - The audio encoder/tokenizer ([MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer)) is **not** quantized; download it separately for inference. |
|
|
| This roughly halves the LLM weight footprint (~32 GB bf16 → ~16-17 GB on disk). |
|
|
| ## Important: this is NOT a drop-in `from_pretrained` checkpoint |
| |
| `model.safetensors` stores custom `FP8Linear` buffers (`*.weight_fp8`, `*.weight_scale`), |
| not standard HF Linear weights. It must be loaded through the matching `FP8Linear` |
| modules. Use the loader below. |
| |
| ## Usage |
| |
| ```bash |
| git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git |
| cd MiMo-V2.5-ASR |
| pip install -r requirements.txt |
| pip install flash-attn==2.7.4.post1 # required by the audio tokenizer |
| hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer |
| hf download Infatoshi/MiMo-V2.5-ASR-FP8 --local-dir ./MiMo-V2.5-ASR-FP8 |
| ``` |
| |
| Then load with the `FP8Linear` loader (`quantize_fp8.py`, included here as `quantize_fp8.py`): |
| |
| ```python |
| from quantize_fp8 import load_fp8_model |
| mimo = load_fp8_model( |
| fp8_dir="./MiMo-V2.5-ASR-FP8", |
| tokenizer_path="./models/MiMo-Audio-Tokenizer", |
| repo_root=".", # the cloned MiMo-V2.5-ASR repo |
| ) |
| print(mimo.asr_sft("audio.wav", audio_tag="<english>")) |
| ``` |
| |
| ## Quantization fidelity |
| |
| Per-output-channel absmax dequant error vs the original fp32 weights, sampled across |
| depth (layers 0/17/35), all attn+mlp projections, lm_head, and the audio local transformer: |
| |
| - relative Frobenius error: **~0.026, uniform** across every sampled layer (max 0.027 on lm_head) |
| - no corrupted or outlier layers |
| |
| This is the expected magnitude for fp8 e4m3 with per-channel scaling (3 mantissa bits). |
| |
| ## Requirements |
| |
| - CUDA GPU with FP8 tensor cores (Ada / Hopper / Blackwell), CUDA >= 12.0 |
| - torch >= 2.6, safetensors |
| - **Blackwell (sm_120, e.g. RTX PRO 6000 / RTX 50xx):** use a torch build with CUDA 12.8+ |
| (torch >= 2.7, `cu128`). torch 2.6 `cu124` ships no sm_120 kernels and will fail with |
| "no kernel image is available for execution on the device". |
| |
| ## Notes / caveats |
| |
| - FP8 e4m3fn weight-only-style quantization is lossy; expect small WER deltas vs bf16. |
| - Per-tensor dynamic activation scaling is simple and fast but less accurate than |
| per-token scaling on activations with large outliers. |
| |
| Derivative of an MIT-licensed model; original credit to the Xiaomi MiMo team. |
| |