RESMP-DEV
/

MiMo-V2.5-ASR-FP8

Automatic Speech Recognition

speech-recognition

Model card Files Files and versions

MiMo-V2.5-ASR-FP8 / README.md

Infatoshi's picture

Upload folder using huggingface_hub

04e43d3 verified 11 days ago

|

History Blame Contribute Delete

3.26 kB

	---
	license: mit
	language:
	- zh
	- en
	- yue
	pipeline_tag: automatic-speech-recognition
	tags:
	- safetensors
	- fp8
	- quantization
	- speech-recognition
	base_model: XiaomiMiMo/MiMo-V2.5-ASR
	---

	# MiMo-V2.5-ASR — FP8 (e4m3fn)

	FP8-quantized build of [XiaomiMiMo/MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR),
	the Xiaomi MiMo end-to-end ASR model with native Mandarin/English code-switching,
	Chinese dialects, song lyrics, noisy/multi-speaker robustness, and native punctuation.

	## What this is

	- Weights: `float8_e4m3fn`, per-output-channel absmax scaling (one fp32 scale per row), baked at save time.
	- Activations: dynamic per-tensor fp8 quantization each forward pass.
	- Matmul: `torch._scaled_mm` (FP8 tensor cores on Ada / Hopper / Blackwell).
	- Skipped (kept bf16): embeddings, RMSNorm/LayerNorm, biases. 417 Linear layers converted.
	- The audio encoder/tokenizer ([MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer)) is not quantized; download it separately for inference.

	This roughly halves the LLM weight footprint (~32 GB bf16 → ~16-17 GB on disk).

	## Important: this is NOT a drop-in `from_pretrained` checkpoint

	`model.safetensors` stores custom `FP8Linear` buffers (`.weight_fp8`, `.weight_scale`),
	not standard HF Linear weights. It must be loaded through the matching `FP8Linear`
	modules. Use the loader below.

	## Usage

	```bash
	git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git
	cd MiMo-V2.5-ASR
	pip install -r requirements.txt
	pip install flash-attn==2.7.4.post1 # required by the audio tokenizer
	hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
	hf download Infatoshi/MiMo-V2.5-ASR-FP8 --local-dir ./MiMo-V2.5-ASR-FP8
	```

	Then load with the `FP8Linear` loader (`quantize_fp8.py`, included here as `quantize_fp8.py`):

	```python
	from quantize_fp8 import load_fp8_model
	mimo = load_fp8_model(
	fp8_dir="./MiMo-V2.5-ASR-FP8",
	tokenizer_path="./models/MiMo-Audio-Tokenizer",
	repo_root=".", # the cloned MiMo-V2.5-ASR repo
	)
	print(mimo.asr_sft("audio.wav", audio_tag="<english>"))
	```

	## Quantization fidelity

	Per-output-channel absmax dequant error vs the original fp32 weights, sampled across
	depth (layers 0/17/35), all attn+mlp projections, lm_head, and the audio local transformer:

	- relative Frobenius error: ~0.026, uniform across every sampled layer (max 0.027 on lm_head)
	- no corrupted or outlier layers

	This is the expected magnitude for fp8 e4m3 with per-channel scaling (3 mantissa bits).

	## Requirements

	- CUDA GPU with FP8 tensor cores (Ada / Hopper / Blackwell), CUDA >= 12.0
	- torch >= 2.6, safetensors
	- Blackwell (sm_120, e.g. RTX PRO 6000 / RTX 50xx): use a torch build with CUDA 12.8+
	(torch >= 2.7, `cu128`). torch 2.6 `cu124` ships no sm_120 kernels and will fail with
	"no kernel image is available for execution on the device".

	## Notes / caveats

	- FP8 e4m3fn weight-only-style quantization is lossy; expect small WER deltas vs bf16.
	- Per-tensor dynamic activation scaling is simple and fast but less accurate than
	per-token scaling on activations with large outliers.

	Derivative of an MIT-licensed model; original credit to the Xiaomi MiMo team.