MiMo-V2.5-NVFP4-Experts

Expert-only NVFP4 quantization of XiaomiMiMo/MiMo-V2.5 (310B MoE, 15B active params).

This is the non-Pro variant — MiMo-V2.5 (310B, hidden_size 4096), quantized identically to how Xiaomi produced their MiMo-V2.5-Pro-FP4-DFlash.

What Was Done

Step	Details
Quantization	Expert-only NVFP4 via NVIDIA ModelOpt 0.44.0
Calibration	64 synthetic calibration samples
KV Cache	FP8
Hardware	Single H200 NVL (141GB VRAM + 773GB RAM)
Duration	~30 min quantization

Quantization Approach

MiMo-V2.5 is a Mixture-of-Experts (MoE) model with 256 routed experts per layer. Only the MoE expert weights are quantized to NVFP4 (4-bit NormalFloat), while attention projections, vision encoder (729M), audio encoder (261M), and other modules remain at higher precision.

This matches Xiaomi's own approach for the Pro version:

"We quantize only the MoE experts to FP4 (MXFP4) and keep the other modules at their original precision." — MiMo-V2.5-Pro-FP4-DFlash

Model Summary

	Base (FP8)	This (NVFP4-Experts)
Architecture	MiMoV2ForCausalLM	MiMoV2ForCausalLM
Total Parameters	310B	310B
Active Parameters	15B	15B
Hidden Size	4096	4096
Layers	48 (1 dense + 47 MoE)	48
Routed Experts	256 per layer	256 per layer
Expert Precision	FP8/E8M3	NVFP4
Attention Precision	BF16/FP8	Untouched
Vision Encoder	729M (untouched)	729M (untouched)
Audio Encoder	261M (untouched)	261M (untouched)
Model Size	~295 GB	~125 GB
MTP Layers	3 (329M)	3 (untouched)

Quantization Config

{
  "quantization_method": "nvfp4_experts",
  "kv_cache_dtype": "fp8",
  "quantization_library": "nvidia-modelopt",
  "calibration_samples": 64
}

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "gaber/MiMo-V2.5-NVFP4-Experts",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(
    "gaber/MiMo-V2.5-NVFP4-Experts",
    trust_remote_code=True,
)

Deployment

SGLang

python3 -m sglang.launch_server \
    --model-path gaber/MiMo-V2.5-NVFP4-Experts \
    --quantization fp8 \
    --trust-remote-code \
    --dtype bfloat16 \
    --context-length 32768

vLLM

vllm serve gaber/MiMo-V2.5-NVFP4-Experts \
    --quantization fp8 \
    --trust-remote-code \
    --dtype bfloat16 \
    --max-model-len 32768

Hardware Requirements

Minimum VRAM: ~125 GB (for quantized weights alone)
Recommended: 2× H100/H200 or 4× A100 80GB
Single H200 NVL: Works with device_map="auto" (VRAM + RAM split)

Limitations

DFlash Drafter: Not included. Xiaomi's DFlash draft model targets MiMo-V2.5-Pro (hidden_size 6144), which is incompatible with this base model (hidden_size 4096).
MTP Heads: Untouched (3-layer multi-token prediction preserved)
Vision/Audio: Untouched (729M ViT + 261M Audio encoders preserved)

Base Model

XiaomiMiMo/MiMo-V2.5 — 310B MoE, 15B active, hybrid SWA/GA attention, 1M context, multimodal (text/image/video/audio).

Quantization Tool

NVIDIA ModelOpt 0.44.0 — NVFP4 experts-only quantization with 64-sample calibration.

License

Same as base model: MIT

Downloads last month: 100

Safetensors

Model size

67B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support