File size: 3,258 Bytes
04e43d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: mit
language:
- zh
- en
- yue
pipeline_tag: automatic-speech-recognition
tags:
- safetensors
- fp8
- quantization
- speech-recognition
base_model: XiaomiMiMo/MiMo-V2.5-ASR
---

# MiMo-V2.5-ASR — FP8 (e4m3fn)

FP8-quantized build of [XiaomiMiMo/MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR),
the Xiaomi MiMo end-to-end ASR model with native Mandarin/English code-switching,
Chinese dialects, song lyrics, noisy/multi-speaker robustness, and native punctuation.

## What this is

- **Weights:** `float8_e4m3fn`, per-output-channel absmax scaling (one fp32 scale per row), baked at save time.
- **Activations:** dynamic per-tensor fp8 quantization each forward pass.
- **Matmul:** `torch._scaled_mm` (FP8 tensor cores on Ada / Hopper / Blackwell).
- **Skipped (kept bf16):** embeddings, RMSNorm/LayerNorm, biases. 417 Linear layers converted.
- The audio encoder/tokenizer ([MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer)) is **not** quantized; download it separately for inference.

This roughly halves the LLM weight footprint (~32 GB bf16 → ~16-17 GB on disk).

## Important: this is NOT a drop-in `from_pretrained` checkpoint

`model.safetensors` stores custom `FP8Linear` buffers (`*.weight_fp8`, `*.weight_scale`),
not standard HF Linear weights. It must be loaded through the matching `FP8Linear`
modules. Use the loader below.

## Usage

```bash
git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git
cd MiMo-V2.5-ASR
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1   # required by the audio tokenizer
hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
hf download Infatoshi/MiMo-V2.5-ASR-FP8 --local-dir ./MiMo-V2.5-ASR-FP8
```

Then load with the `FP8Linear` loader (`quantize_fp8.py`, included here as `quantize_fp8.py`):

```python
from quantize_fp8 import load_fp8_model
mimo = load_fp8_model(
    fp8_dir="./MiMo-V2.5-ASR-FP8",
    tokenizer_path="./models/MiMo-Audio-Tokenizer",
    repo_root=".",                     # the cloned MiMo-V2.5-ASR repo
)
print(mimo.asr_sft("audio.wav", audio_tag="<english>"))
```

## Quantization fidelity

Per-output-channel absmax dequant error vs the original fp32 weights, sampled across
depth (layers 0/17/35), all attn+mlp projections, lm_head, and the audio local transformer:

- relative Frobenius error: **~0.026, uniform** across every sampled layer (max 0.027 on lm_head)
- no corrupted or outlier layers

This is the expected magnitude for fp8 e4m3 with per-channel scaling (3 mantissa bits).

## Requirements

- CUDA GPU with FP8 tensor cores (Ada / Hopper / Blackwell), CUDA >= 12.0
- torch >= 2.6, safetensors
- **Blackwell (sm_120, e.g. RTX PRO 6000 / RTX 50xx):** use a torch build with CUDA 12.8+
  (torch >= 2.7, `cu128`). torch 2.6 `cu124` ships no sm_120 kernels and will fail with
  "no kernel image is available for execution on the device".

## Notes / caveats

- FP8 e4m3fn weight-only-style quantization is lossy; expect small WER deltas vs bf16.
- Per-tensor dynamic activation scaling is simple and fast but less accurate than
  per-token scaling on activations with large outliers.

Derivative of an MIT-licensed model; original credit to the Xiaomi MiMo team.