MiMo-V2-Flash Neuron BF16 (Compiled for trn2.48xlarge)
Pre-compiled BF16 checkpoint for XiaomiMiMo/MiMo-V2-Flash on AWS Trainium2 (trn2.48xlarge).
Configuration
- Precision: BF16
- Instance: trn2.48xlarge (16 Neuron devices, 64 NeuronCores, LNC=2)
- TP degree: 64 (Expert Parallel: moe_ep=64, moe_tp=1)
- Max sequence length: 4096
- Batch size: 32
- SDK: Neuron SDK 2.29 (DLAMI 20260410)
- NxDI: neuronx-distributed-inference 0.9.x
Quick Start
1. Download the compiled checkpoint
huggingface-cli download jburtoft/MiMo-V2-Flash-Neuron-BF16 \
--local-dir /opt/dlami/nvme/MiMo-V2-Flash-Neuron-BF16
2. Install NxDI with MiMo-V2-Flash support
git clone https://github.com/jimburtoft/neuronx-distributed-inference.git nxdi \
-b contrib/MiMo-V2-Flash-nki-moe
pip install -e nxdi
3. Launch vLLM server
cd nxdi/contrib/models/MiMo-V2-Flash/src
python3 register_vllm.py \
--model /opt/dlami/nvme/MiMo-V2-Flash-Neuron-BF16 \
--served-model-name MiMo-V2-Flash-BF16 \
--tensor-parallel-size 64 \
--max-model-len 4096 \
--max-num-seqs 32 \
--swap-space 0 \
--no-enable-chunked-prefill \
--no-enable-prefix-caching \
--port 8000 \
--trust-remote-code \
--additional-config '{"override_neuron_config": {"tp_degree": 64, "logical_nc_config": 2, "fused_qkv": false, "sequence_parallel_enabled": false, "glu_mlp": true, "normalize_top_k_affinities": true, "save_sharded_checkpoint": true, "quantized": false, "router_config": {"act_fn": "sigmoid", "dtype": "float32"}, "moe_tp_degree": 1, "moe_ep_degree": 64, "batch_size": 32, "ctx_batch_size": 1, "tkg_batch_size": 32, "max_context_length": 4096, "seq_len": 4096, "is_continuous_batching": true, "enable_bucketing": true, "context_encoding_buckets": [128, 256, 512, 1024, 2048, 4096], "token_generation_buckets": [4096], "async_mode": true, "on_device_sampling_config": {"do_sample": false}}}'
4. Query the model
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "MiMo-V2-Flash-BF16", "messages": [{"role": "user", "content": "Write hello world in Python"}], "max_tokens": 200, "temperature": 0}'
Key Fix: e_score_correction_bias
This checkpoint includes the fix for MiMo-V2-Flash noaux_tc routing bias. Without this fix, the e_score_correction_bias was silently dropped during weight loading, causing tokens to route to wrong MoE experts and producing degraded output.
The fix:
- Loads e_score_correction_bias from the checkpoint into RouterTopK
- Disables the fused TKG mega-kernel (which cannot apply the bias)
- Falls back to the non-fused path where RouterTopK.forward() applies the bias
See PR #10 for details.
Files
neuron-compiled-artifacts/
model.pt # Traced model (218 MB)
neuron_config.json # Neuron configuration
weights/
tp{0..63}_sharded_checkpoint.safetensors # 64 shards, ~594 GB total
src/
modeling_mimo_v2.py # NxDI model implementation
register_vllm.py # vLLM 0.16 registration script
Notes
- Requires trn2.48xlarge (64 NeuronCores with LNC=2)
- The pre-compiled artifacts skip the ~20min compilation step
- Weight loading still takes ~20 minutes on first serve
- Uses Expert Parallel (EP) routing: each NeuronCore handles 4 of 256 experts
- Downloads last month
- 162
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for jburtoft/MiMo-V2-Flash-Neuron-BF16
Base model
XiaomiMiMo/MiMo-V2-Flash