MiMo-V2.5-ASR-FP8 / README.md
Infatoshi's picture
Upload folder using huggingface_hub
04e43d3 verified
|
Raw
History Blame Contribute Delete
3.26 kB
---
license: mit
language:
- zh
- en
- yue
pipeline_tag: automatic-speech-recognition
tags:
- safetensors
- fp8
- quantization
- speech-recognition
base_model: XiaomiMiMo/MiMo-V2.5-ASR
---
# MiMo-V2.5-ASR — FP8 (e4m3fn)
FP8-quantized build of [XiaomiMiMo/MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR),
the Xiaomi MiMo end-to-end ASR model with native Mandarin/English code-switching,
Chinese dialects, song lyrics, noisy/multi-speaker robustness, and native punctuation.
## What this is
- **Weights:** `float8_e4m3fn`, per-output-channel absmax scaling (one fp32 scale per row), baked at save time.
- **Activations:** dynamic per-tensor fp8 quantization each forward pass.
- **Matmul:** `torch._scaled_mm` (FP8 tensor cores on Ada / Hopper / Blackwell).
- **Skipped (kept bf16):** embeddings, RMSNorm/LayerNorm, biases. 417 Linear layers converted.
- The audio encoder/tokenizer ([MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer)) is **not** quantized; download it separately for inference.
This roughly halves the LLM weight footprint (~32 GB bf16 → ~16-17 GB on disk).
## Important: this is NOT a drop-in `from_pretrained` checkpoint
`model.safetensors` stores custom `FP8Linear` buffers (`*.weight_fp8`, `*.weight_scale`),
not standard HF Linear weights. It must be loaded through the matching `FP8Linear`
modules. Use the loader below.
## Usage
```bash
git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git
cd MiMo-V2.5-ASR
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 # required by the audio tokenizer
hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
hf download Infatoshi/MiMo-V2.5-ASR-FP8 --local-dir ./MiMo-V2.5-ASR-FP8
```
Then load with the `FP8Linear` loader (`quantize_fp8.py`, included here as `quantize_fp8.py`):
```python
from quantize_fp8 import load_fp8_model
mimo = load_fp8_model(
fp8_dir="./MiMo-V2.5-ASR-FP8",
tokenizer_path="./models/MiMo-Audio-Tokenizer",
repo_root=".", # the cloned MiMo-V2.5-ASR repo
)
print(mimo.asr_sft("audio.wav", audio_tag="<english>"))
```
## Quantization fidelity
Per-output-channel absmax dequant error vs the original fp32 weights, sampled across
depth (layers 0/17/35), all attn+mlp projections, lm_head, and the audio local transformer:
- relative Frobenius error: **~0.026, uniform** across every sampled layer (max 0.027 on lm_head)
- no corrupted or outlier layers
This is the expected magnitude for fp8 e4m3 with per-channel scaling (3 mantissa bits).
## Requirements
- CUDA GPU with FP8 tensor cores (Ada / Hopper / Blackwell), CUDA >= 12.0
- torch >= 2.6, safetensors
- **Blackwell (sm_120, e.g. RTX PRO 6000 / RTX 50xx):** use a torch build with CUDA 12.8+
(torch >= 2.7, `cu128`). torch 2.6 `cu124` ships no sm_120 kernels and will fail with
"no kernel image is available for execution on the device".
## Notes / caveats
- FP8 e4m3fn weight-only-style quantization is lossy; expect small WER deltas vs bf16.
- Per-tensor dynamic activation scaling is simple and fast but less accurate than
per-token scaling on activations with large outliers.
Derivative of an MIT-licensed model; original credit to the Xiaomi MiMo team.