--- license: mit language: - zh - en - yue pipeline_tag: automatic-speech-recognition tags: - safetensors - fp8 - quantization - speech-recognition base_model: XiaomiMiMo/MiMo-V2.5-ASR --- # MiMo-V2.5-ASR — FP8 (e4m3fn) FP8-quantized build of [XiaomiMiMo/MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR), the Xiaomi MiMo end-to-end ASR model with native Mandarin/English code-switching, Chinese dialects, song lyrics, noisy/multi-speaker robustness, and native punctuation. ## What this is - **Weights:** `float8_e4m3fn`, per-output-channel absmax scaling (one fp32 scale per row), baked at save time. - **Activations:** dynamic per-tensor fp8 quantization each forward pass. - **Matmul:** `torch._scaled_mm` (FP8 tensor cores on Ada / Hopper / Blackwell). - **Skipped (kept bf16):** embeddings, RMSNorm/LayerNorm, biases. 417 Linear layers converted. - The audio encoder/tokenizer ([MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer)) is **not** quantized; download it separately for inference. This roughly halves the LLM weight footprint (~32 GB bf16 → ~16-17 GB on disk). ## Important: this is NOT a drop-in `from_pretrained` checkpoint `model.safetensors` stores custom `FP8Linear` buffers (`*.weight_fp8`, `*.weight_scale`), not standard HF Linear weights. It must be loaded through the matching `FP8Linear` modules. Use the loader below. ## Usage ```bash git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git cd MiMo-V2.5-ASR pip install -r requirements.txt pip install flash-attn==2.7.4.post1 # required by the audio tokenizer hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer hf download Infatoshi/MiMo-V2.5-ASR-FP8 --local-dir ./MiMo-V2.5-ASR-FP8 ``` Then load with the `FP8Linear` loader (`quantize_fp8.py`, included here as `quantize_fp8.py`): ```python from quantize_fp8 import load_fp8_model mimo = load_fp8_model( fp8_dir="./MiMo-V2.5-ASR-FP8", tokenizer_path="./models/MiMo-Audio-Tokenizer", repo_root=".", # the cloned MiMo-V2.5-ASR repo ) print(mimo.asr_sft("audio.wav", audio_tag="")) ``` ## Quantization fidelity Per-output-channel absmax dequant error vs the original fp32 weights, sampled across depth (layers 0/17/35), all attn+mlp projections, lm_head, and the audio local transformer: - relative Frobenius error: **~0.026, uniform** across every sampled layer (max 0.027 on lm_head) - no corrupted or outlier layers This is the expected magnitude for fp8 e4m3 with per-channel scaling (3 mantissa bits). ## Requirements - CUDA GPU with FP8 tensor cores (Ada / Hopper / Blackwell), CUDA >= 12.0 - torch >= 2.6, safetensors - **Blackwell (sm_120, e.g. RTX PRO 6000 / RTX 50xx):** use a torch build with CUDA 12.8+ (torch >= 2.7, `cu128`). torch 2.6 `cu124` ships no sm_120 kernels and will fail with "no kernel image is available for execution on the device". ## Notes / caveats - FP8 e4m3fn weight-only-style quantization is lossy; expect small WER deltas vs bf16. - Per-tensor dynamic activation scaling is simple and fast but less accurate than per-token scaling on activations with large outliers. Derivative of an MIT-licensed model; original credit to the Xiaomi MiMo team.