MiMo-V2.5-NVFP4-Experts

Expert-only NVFP4 quantization of XiaomiMiMo/MiMo-V2.5 (310B MoE, 15B active params).

This is the non-Pro variant โ€” MiMo-V2.5 (310B, hidden_size 4096), quantized identically to how Xiaomi produced their MiMo-V2.5-Pro-FP4-DFlash.

What Was Done

Step Details
Quantization Expert-only NVFP4 via NVIDIA ModelOpt 0.44.0
Calibration 64 synthetic calibration samples
KV Cache FP8
Hardware Single H200 NVL (141GB VRAM + 773GB RAM)
Duration ~30 min quantization

Quantization Approach

MiMo-V2.5 is a Mixture-of-Experts (MoE) model with 256 routed experts per layer. Only the MoE expert weights are quantized to NVFP4 (4-bit NormalFloat), while attention projections, vision encoder (729M), audio encoder (261M), and other modules remain at higher precision.

This matches Xiaomi's own approach for the Pro version:

"We quantize only the MoE experts to FP4 (MXFP4) and keep the other modules at their original precision." โ€” MiMo-V2.5-Pro-FP4-DFlash

Model Summary

Base (FP8) This (NVFP4-Experts)
Architecture MiMoV2ForCausalLM MiMoV2ForCausalLM
Total Parameters 310B 310B
Active Parameters 15B 15B
Hidden Size 4096 4096
Layers 48 (1 dense + 47 MoE) 48
Routed Experts 256 per layer 256 per layer
Expert Precision FP8/E8M3 NVFP4
Attention Precision BF16/FP8 Untouched
Vision Encoder 729M (untouched) 729M (untouched)
Audio Encoder 261M (untouched) 261M (untouched)
Model Size ~295 GB ~125 GB
MTP Layers 3 (329M) 3 (untouched)

Quantization Config

{
  "quantization_method": "nvfp4_experts",
  "kv_cache_dtype": "fp8",
  "quantization_library": "nvidia-modelopt",
  "calibration_samples": 64
}

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "gaber/MiMo-V2.5-NVFP4-Experts",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(
    "gaber/MiMo-V2.5-NVFP4-Experts",
    trust_remote_code=True,
)

Deployment

SGLang

python3 -m sglang.launch_server \
    --model-path gaber/MiMo-V2.5-NVFP4-Experts \
    --quantization fp8 \
    --trust-remote-code \
    --dtype bfloat16 \
    --context-length 32768

vLLM

vllm serve gaber/MiMo-V2.5-NVFP4-Experts \
    --quantization fp8 \
    --trust-remote-code \
    --dtype bfloat16 \
    --max-model-len 32768

Hardware Requirements

  • Minimum VRAM: ~125 GB (for quantized weights alone)
  • Recommended: 2ร— H100/H200 or 4ร— A100 80GB
  • Single H200 NVL: Works with device_map="auto" (VRAM + RAM split)

Limitations

  • DFlash Drafter: Not included. Xiaomi's DFlash draft model targets MiMo-V2.5-Pro (hidden_size 6144), which is incompatible with this base model (hidden_size 4096).
  • MTP Heads: Untouched (3-layer multi-token prediction preserved)
  • Vision/Audio: Untouched (729M ViT + 261M Audio encoders preserved)

Base Model

XiaomiMiMo/MiMo-V2.5 โ€” 310B MoE, 15B active, hybrid SWA/GA attention, 1M context, multimodal (text/image/video/audio).

Quantization Tool

NVIDIA ModelOpt 0.44.0 โ€” NVFP4 experts-only quantization with 64-sample calibration.

License

Same as base model: MIT

Downloads last month
100
Safetensors
Model size
67B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support