MiMo-V2-Flash Neuron BF16 (Compiled for trn2.48xlarge)

Pre-compiled BF16 checkpoint for XiaomiMiMo/MiMo-V2-Flash on AWS Trainium2 (trn2.48xlarge).

Configuration

  • Precision: BF16
  • Instance: trn2.48xlarge (16 Neuron devices, 64 NeuronCores, LNC=2)
  • TP degree: 64 (Expert Parallel: moe_ep=64, moe_tp=1)
  • Max sequence length: 4096
  • Batch size: 32
  • SDK: Neuron SDK 2.29 (DLAMI 20260410)
  • NxDI: neuronx-distributed-inference 0.9.x

Quick Start

1. Download the compiled checkpoint

huggingface-cli download jburtoft/MiMo-V2-Flash-Neuron-BF16 \
  --local-dir /opt/dlami/nvme/MiMo-V2-Flash-Neuron-BF16

2. Install NxDI with MiMo-V2-Flash support

git clone https://github.com/jimburtoft/neuronx-distributed-inference.git nxdi \
  -b contrib/MiMo-V2-Flash-nki-moe
pip install -e nxdi

3. Launch vLLM server

cd nxdi/contrib/models/MiMo-V2-Flash/src

python3 register_vllm.py \
  --model /opt/dlami/nvme/MiMo-V2-Flash-Neuron-BF16 \
  --served-model-name MiMo-V2-Flash-BF16 \
  --tensor-parallel-size 64 \
  --max-model-len 4096 \
  --max-num-seqs 32 \
  --swap-space 0 \
  --no-enable-chunked-prefill \
  --no-enable-prefix-caching \
  --port 8000 \
  --trust-remote-code \
  --additional-config '{"override_neuron_config": {"tp_degree": 64, "logical_nc_config": 2, "fused_qkv": false, "sequence_parallel_enabled": false, "glu_mlp": true, "normalize_top_k_affinities": true, "save_sharded_checkpoint": true, "quantized": false, "router_config": {"act_fn": "sigmoid", "dtype": "float32"}, "moe_tp_degree": 1, "moe_ep_degree": 64, "batch_size": 32, "ctx_batch_size": 1, "tkg_batch_size": 32, "max_context_length": 4096, "seq_len": 4096, "is_continuous_batching": true, "enable_bucketing": true, "context_encoding_buckets": [128, 256, 512, 1024, 2048, 4096], "token_generation_buckets": [4096], "async_mode": true, "on_device_sampling_config": {"do_sample": false}}}'

4. Query the model

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "MiMo-V2-Flash-BF16", "messages": [{"role": "user", "content": "Write hello world in Python"}], "max_tokens": 200, "temperature": 0}'

Key Fix: e_score_correction_bias

This checkpoint includes the fix for MiMo-V2-Flash noaux_tc routing bias. Without this fix, the e_score_correction_bias was silently dropped during weight loading, causing tokens to route to wrong MoE experts and producing degraded output.

The fix:

  • Loads e_score_correction_bias from the checkpoint into RouterTopK
  • Disables the fused TKG mega-kernel (which cannot apply the bias)
  • Falls back to the non-fused path where RouterTopK.forward() applies the bias

See PR #10 for details.

Files

neuron-compiled-artifacts/
  model.pt                    # Traced model (218 MB)
  neuron_config.json          # Neuron configuration
  weights/
    tp{0..63}_sharded_checkpoint.safetensors  # 64 shards, ~594 GB total
src/
  modeling_mimo_v2.py         # NxDI model implementation
  register_vllm.py            # vLLM 0.16 registration script

Notes

  • Requires trn2.48xlarge (64 NeuronCores with LNC=2)
  • The pre-compiled artifacts skip the ~20min compilation step
  • Weight loading still takes ~20 minutes on first serve
  • Uses Expert Parallel (EP) routing: each NeuronCore handles 4 of 256 experts
Downloads last month
162
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jburtoft/MiMo-V2-Flash-Neuron-BF16

Quantized
(12)
this model