MiMo-V2.5-NVFP4-Experts
Expert-only NVFP4 quantization of XiaomiMiMo/MiMo-V2.5 (310B MoE, 15B active params).
This is the non-Pro variant โ MiMo-V2.5 (310B, hidden_size 4096), quantized identically to how Xiaomi produced their MiMo-V2.5-Pro-FP4-DFlash.
What Was Done
| Step | Details |
|---|---|
| Quantization | Expert-only NVFP4 via NVIDIA ModelOpt 0.44.0 |
| Calibration | 64 synthetic calibration samples |
| KV Cache | FP8 |
| Hardware | Single H200 NVL (141GB VRAM + 773GB RAM) |
| Duration | ~30 min quantization |
Quantization Approach
MiMo-V2.5 is a Mixture-of-Experts (MoE) model with 256 routed experts per layer. Only the MoE expert weights are quantized to NVFP4 (4-bit NormalFloat), while attention projections, vision encoder (729M), audio encoder (261M), and other modules remain at higher precision.
This matches Xiaomi's own approach for the Pro version:
"We quantize only the MoE experts to FP4 (MXFP4) and keep the other modules at their original precision." โ MiMo-V2.5-Pro-FP4-DFlash
Model Summary
| Base (FP8) | This (NVFP4-Experts) | |
|---|---|---|
| Architecture | MiMoV2ForCausalLM | MiMoV2ForCausalLM |
| Total Parameters | 310B | 310B |
| Active Parameters | 15B | 15B |
| Hidden Size | 4096 | 4096 |
| Layers | 48 (1 dense + 47 MoE) | 48 |
| Routed Experts | 256 per layer | 256 per layer |
| Expert Precision | FP8/E8M3 | NVFP4 |
| Attention Precision | BF16/FP8 | Untouched |
| Vision Encoder | 729M (untouched) | 729M (untouched) |
| Audio Encoder | 261M (untouched) | 261M (untouched) |
| Model Size | ~295 GB | ~125 GB |
| MTP Layers | 3 (329M) | 3 (untouched) |
Quantization Config
{
"quantization_method": "nvfp4_experts",
"kv_cache_dtype": "fp8",
"quantization_library": "nvidia-modelopt",
"calibration_samples": 64
}
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"gaber/MiMo-V2.5-NVFP4-Experts",
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(
"gaber/MiMo-V2.5-NVFP4-Experts",
trust_remote_code=True,
)
Deployment
SGLang
python3 -m sglang.launch_server \
--model-path gaber/MiMo-V2.5-NVFP4-Experts \
--quantization fp8 \
--trust-remote-code \
--dtype bfloat16 \
--context-length 32768
vLLM
vllm serve gaber/MiMo-V2.5-NVFP4-Experts \
--quantization fp8 \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 32768
Hardware Requirements
- Minimum VRAM: ~125 GB (for quantized weights alone)
- Recommended: 2ร H100/H200 or 4ร A100 80GB
- Single H200 NVL: Works with
device_map="auto"(VRAM + RAM split)
Limitations
- DFlash Drafter: Not included. Xiaomi's DFlash draft model targets MiMo-V2.5-Pro (hidden_size 6144), which is incompatible with this base model (hidden_size 4096).
- MTP Heads: Untouched (3-layer multi-token prediction preserved)
- Vision/Audio: Untouched (729M ViT + 261M Audio encoders preserved)
Base Model
XiaomiMiMo/MiMo-V2.5 โ 310B MoE, 15B active, hybrid SWA/GA attention, 1M context, multimodal (text/image/video/audio).
Quantization Tool
NVIDIA ModelOpt 0.44.0 โ NVFP4 experts-only quantization with 64-sample calibration.
License
Same as base model: MIT
- Downloads last month
- 100