File size: 21,665 Bytes
9990294 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 |
---
tags:
- unsloth
base_model:
- XiaomiMiMo/MiMo-V2-Flash
license: mit
library_name: transformers
---
> [!NOTE]
> Includes Unsloth **chat template fixes**! <br> For `llama.cpp`, use `--jinja`
>
<div>
<p style="margin-top: 0;margin-bottom: 0;">
<em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em>
</p>
<div style="display: flex; gap: 5px; align-items: center; ">
<a href="https://github.com/unslothai/unsloth/">
<img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
</a>
<a href="https://discord.gg/unsloth">
<img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
</a>
<a href="https://docs.unsloth.ai/">
<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
</a>
</div>
</div>
<br/><br/>
<div align="center">
<picture>
<source srcset="https://github.com/XiaomiMiMo/MiMo-V2-Flash/raw/main/figures/Xiaomi_MiMo_darkmode.png?raw=true" media="(prefers-color-scheme: dark)">
<img src="https://github.com/XiaomiMiMo/MiMo-V2-Flash/raw/main/figures/Xiaomi_MiMo.png?raw=true" width="60%" alt="Xiaomi-MiMo" />
</picture>
</div>
<br/>
<div align="center" style="line-height: 1;">
|
<a href="https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash" target="_blank">🤗 HuggingFace</a>
|
<a href="https://github.com/XiaomiMiMo/MiMo-V2-Flash/blob/main/paper.pdf" target="_blank">📔 Technical Report </a>
|
<a href="https://mimo.xiaomi.com/blog/mimo-v2-flash" target="_blank">📰 Blog </a>
|
<br/><br/>
<strong>Play around!</strong>
<a href="https://aistudio.xiaomimimo.com" target="_blank">🗨️ Xiaomi MiMo Studio </a>
<a href="https://platform.xiaomimimo.com/" target="_blank">🎨 Xiaomi MiMo API Platform </a>
</div>
<br/>
# MiMo-V2-Flash
**MiMo-V2-Flash** is a Mixture-of-Experts (MoE) language model with **309B total parameters** and **15B active parameters**. Designed for high-speed reasoning and agentic workflows, it utilizes a novel hybrid attention architecture and Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs.
<p align="center">
<img width="80%" src="https://github.com/XiaomiMiMo/MiMo-V2-Flash/raw/main/figures/MiMo-v2-flash-performance.jpg?raw=true">
</p>
-----
## 1. Introduction
MiMo-V2-Flash creates a new balance between long-context modeling capability and inference efficiency. Key features include:
* **Hybrid Attention Architecture**: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and an aggressive 128-token window. This reduces KV-cache storage by nearly 6x while maintaining long-context performance via learnable **attention sink bias**.
* **Multi-Token Prediction (MTP)**: Equipped with a lightweight MTP module (0.33B params/block) using dense FFNs. This triples output speed during inference and will be good to accelerates rollout in RL training.
* **Efficient Pre-Training**: Trained on 27T tokens using FP8 mixed precision and native 32k seq length. The context window supports up to 256k length.
* **Agentic Capabilities**: Post-training utilizes Multi-Teacher On-Policy Distillation (MOPD) and large-scale agentic RL, achieving superior performance on **SWE-Bench** and complex reasoning tasks.
-----
## 2. Model Downloads
| Model | Total Params | Active Params | Context Length | Download |
| :--------------------- | :----------: | :-----------: | :------------: | :-------------------------------------------------------------------: |
| **MiMo-V2-Flash-Base** | 309B | 15B | 256k | [🤗 HuggingFace](https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash-Base) |
| **MiMo-V2-Flash** | 309B | 15B | 256k | [🤗 HuggingFace](https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash) |
> [!IMPORTANT]
> We also open-source the 3-layer MTP weights to foster community research.
-----
## 3. Evaluation Results
### Base Model Evaluation
MiMo-V2-Flash-Base demonstrates strong performance across standard benchmarks, surpassing models with significantly larger parameter counts.
| Category | Benchmark | Setting/Length | MiMo-V2-Flash Base | Kimi-K2 Base | DeepSeek-V3.1 Base | DeepSeek-V3.2 Exp Base |
| :--------------- | :---------------------- | :------------- | :----------------: | :-------------: | :----------------: | :--------------------: |
| **Params** | **#Activated / #Total** | - | **15B / 309B** | **32B / 1043B** | **37B / 671B** | **37B / 671B** |
| **General** | BBH | 3-shot | 88.5 | 88.7 | 88.2 | 88.7 |
| | MMLU | 5-shot | 86.7 | 87.8 | 87.4 | 87.8 |
| | MMLU-Redux | 5-shot | 90.6 | 90.2 | 90.0 | 90.4 |
| | MMLU-Pro | 5-shot | 73.2 | 69.2 | 58.8 | 62.1 |
| | DROP | 3-shot | 84.7 | 83.6 | 86.3 | 86.6 |
| | ARC-Challenge | 25-shot | 95.9 | 96.2 | 95.6 | 95.5 |
| | HellaSwag | 10-shot | 88.5 | 94.6 | 89.2 | 89.4 |
| | WinoGrande | 5-shot | 83.8 | 85.3 | 85.9 | 85.6 |
| | TriviaQA | 5-shot | 80.3 | 85.1 | 83.5 | 83.9 |
| | GPQA-Diamond | 5-shot | 55.1 | 48.1 | 51.0 | 52.0 |
| | SuperGPQA | 5-shot | 41.1 | 44.7 | 42.3 | 43.6 |
| | SimpleQA | 5-shot | 20.6 | 35.3 | 26.3 | 27.0 |
| **Math** | GSM8K | 8-shot | 92.3 | 92.1 | 91.4 | 91.1 |
| | MATH | 4-shot | 71.0 | 70.2 | 62.6 | 62.5 |
| | AIME 24&25 | 2-shot | 35.3 | 31.6 | 21.6 | 24.8 |
| **Code** | HumanEval+ | 1-shot | 70.7 | 84.8 | 64.6 | 67.7 |
| | MBPP+ | 3-shot | 71.4 | 73.8 | 72.2 | 69.8 |
| | CRUXEval-I | 1-shot | 67.5 | 74.0 | 62.1 | 63.9 |
| | CRUXEval-O | 1-shot | 79.1 | 83.5 | 76.4 | 74.9 |
| | MultiPL-E HumanEval | 0-shot | 59.5 | 60.5 | 45.9 | 45.7 |
| | MultiPL-E MBPP | 0-shot | 56.7 | 58.8 | 52.5 | 50.6 |
| | BigCodeBench | 0-shot | 70.1 | 61.7 | 63.0 | 62.9 |
| | LiveCodeBench v6 | 1-shot | 30.8 | 26.3 | 24.8 | 24.9 |
| | SWE-Bench (AgentLess) | 3-shot | 30.8 | 28.2 | 24.8 | 9.4* |
| **Chinese** | C-Eval | 5-shot | 87.9 | 92.5 | 90.0 | 91.0 |
| | CMMLU | 5-shot | 87.4 | 90.9 | 88.8 | 88.9 |
| | C-SimpleQA | 5-shot | 61.5 | 77.6 | 70.9 | 68.0 |
| **Multilingual** | GlobalMMLU | 5-shot | 76.6 | 80.7 | 81.9 | 82.0 |
| | INCLUDE | 5-shot | 71.4 | 75.3 | 77.2 | 77.2 |
| **Long Context** | NIAH-Multi | 32K | 99.3 | 99.8 | 99.7 | 85.6* |
| | | 64K | 99.9 | 100.0 | 98.6 | 85.9* |
| | | 128K | 98.6 | 99.5 | 97.2 | 94.3* |
| | | 256K | 96.7 | - | - | - |
| | GSM-Infinite Hard | 16K | 37.7 | 34.6 | 41.5 | 50.4 |
| | | 32K | 33.7 | 26.1 | 38.8 | 45.2 |
| | | 64K | 31.5 | 16.0 | 34.7 | 32.6 |
| | | 128K | 29.0 | 8.8 | 28.7 | 25.7 |
> \* indicates the model may fail to follow the prompt or format.
### Post-training Model Evaluation
Following our Post-Training Paradigm with MOPD and Agentic RL, the model achieves SOTA reasoning and agentic performance.
| Benchmark | MiMo-V2 Flash | Kimi-K2 Thinking | DeepSeek-V3.2 Thinking | Gemini-3.0 Pro | Claude Sonnet 4.5 | GPT-5 High |
| :----------------------------- | :-----------: | :--------------: | :--------------------: | :------------: | :---------------: | :--------: |
| **Reasoning** | | | | | | |
| MMLU-Pro | 84.9 | 84.6 | 85.0 | 90.1 | 88.2 | 87.5 |
| GPQA-Diamond | 83.7 | 84.5 | 82.4 | 91.9 | 83.4 | 85.7 |
| HLE (no tools) | 22.1 | 23.9 | 25.1 | 37.5 | 13.7 | 26.3 |
| AIME 2025 | 94.1 | 94.5 | 93.1 | 95.0 | 87.0 | 94.6 |
| HMMT Feb. 2025 | 84.4 | 89.4 | 92.5 | 97.5 | 79.2 | 88.3 |
| LiveCodeBench-v6 | 80.6 | 83.1 | 83.3 | 90.7 | 64.0 | 84.5 |
| **General Writing** | | | | | | |
| Arena-Hard (Hard Prompt) | 54.1 | 71.9 | 53.4 | 72.6 | 63.3 | 71.9 |
| Arena-Hard (Creative Writing) | 86.2 | 80.1 | 88.8 | 93.6 | 76.7 | 92.2 |
| **Long Context** | | | | | | |
| LongBench V2 | 60.6 | 45.1 | 58.4 | 65.6 | 61.8 | - |
| MRCR | 45.7 | 44.2 | 55.5 | 89.7 | 55.4 | - |
| **Code Agent** | | | | | | |
| SWE-Bench Verified | 73.4 | 71.3 | 73.1 | 76.2 | 77.2 | 74.9 |
| SWE-Bench Multilingual | 71.7 | 61.1 | 70.2 | - | 68.0 | 55.3 |
| Terminal-Bench Hard | 30.5 | 30.6 | 35.4 | 39.0 | 33.3 | 30.5 |
| Terminal-Bench 2.0 | 38.5 | 35.7 | 46.4 | 54.2 | 42.8 | 35.2 |
| **General Agent** | | | | | | |
| BrowseComp | 45.4 | - | 51.4 | - | 24.1 | 54.9 |
| BrowseComp (w/ Context Manage) | 58.3 | 60.2 | 67.6 | 59.2 | - | - |
| \\(\tau^2\\)-Bench | 80.3 | 74.3 | 80.3 | 85.4 | 84.7 | 80.2 |
-----
## 4. Model Architecture
<p align="center">
<img width="80%" src="https://github.com/XiaomiMiMo/MiMo-V2-Flash/raw/main/figures/MiMo-v2-flash-arch.png?raw=true">
</p>
### Hybrid Sliding Window Attention
MiMo-V2-Flash addresses the quadratic complexity of long contexts by interleaving Local Sliding Window Attention (SWA) and Global Attention (GA).
* **Configuration**: Stacks of \\(M=8\\) hybrid blocks. Each block contains \\(N=5\\) SWA layers followed by 1 GA layer.
* **Efficiency**: SWA layers use a window size of 128 tokens, reducing KV cache significantly.
* **Sink Bias**: Learnable attention sink bias is applied to maintain performance despite the aggressive window size.
### Lightweight Multi-Token Prediction (MTP)
Unlike traditional speculative decoding, our MTP module is natively integrated for training and inference.
* **Structure**: Uses a dense FFN (instead of MoE) and SWA (instead of GA) to keep the parameter count low (0.33B per block).
* **Performance**: Facilitates self-speculative decoding, tripling generation speed and mitigating GPU idleness during small-batch RL training.
-----
## 5. Post-Training Technical Highlights
MiMo-V2-Flash leverages a post-training pipeline designed to maximize reasoning and agentic capabilities through innovative distillation and reinforcement learning strategies.
### 5.1 Multi-Teacher On-Policy Distillation (MOPD)
We introduce **Multi-Teacher On-Policy Distillation (MOPD)**, a new paradigm that formulates knowledge distillation as a reinforcement learning process.
* **Dense Token-Level Guidance**: Unlike methods relying on sparse sequence-level feedback, MOPD utilizes domain-specific expert models (teachers) to provide supervision at every token position.
* **On-Policy Optimization**: The student model learns from its own generated responses rather than a fixed dataset. This eliminates exposure bias and ensures smaller, more stable gradient updates.
* **Inherent Reward Robustness**: Rewards are derived from the distribution divergence between student and teacher, making the process naturally resistant to reward hacking.
### 5.2 Scaling Agentic RL
We significantly scale up the agentic training environments to improve intelligence and generalization.
* **Massive Code Agent Environments**: We utilize real-world GitHub issues to create over 100,000 verifiable tasks. Our automated pipeline maintains a Kubernetes cluster capable of running over 10,000 concurrent pods with a 70% environment setup success rate.
* **Multimodal Verifier for WebDev**: For web development tasks, we employ a vision-based verifier that evaluates code execution via recorded videos rather than static screenshots. This reduces visual hallucination and ensures functional correctness.
* **Cross-Domain Generalization**: Our experiments show that large-scale RL training on code agents effectively generalizes to other domains, boosting performance in Math and General Agent tasks.
### 5.3 Advanced RL Infrastructure
To support high-throughput RL training for large-scale MoE models, we implemented several infrastructure optimizations on top of SGLang and Megatron-LM.
* **Rollout Routing Replay (R3)**: Addresses numerical precision inconsistencies in MoE routing between inference and training. R3 reuses the exact routed experts from rollout during the training pass, ensuring consistency with negligible overhead.
* **Request-Level Prefix Cache**: In multi-turn agent training, this cache stores KV states and routed experts from prior turns. It avoids re-computation and ensures sampling consistency across turns.
* **Fine-Grained Data Scheduler**: We extend the rollout engine to schedule fine-grained sequences instead of micro-batches. Combined with partial rollout, this significantly reduces GPU idleness caused by long-tail stragglers.
* **Toolbox & Tool Manager**: A two-layer design using Ray actor pools to handle resource contention. It eliminates cold-start delays for tool execution and isolates task logic from system policies.
-----
## 6. Inference & Deployment
MiMo-V2-Flash supports FP8 mixed precision inference. We recommend using **SGLang** for optimal performance.
### Quick Start with SGLang
```bash
pip install sglang
# Launch server
python3 -m sglang.launch_server \
--model-path XiaomiMiMo/MiMo-V2-Flash \
--served-model-name mimo-v2-flash \
--pp-size 1 \
--dp-size 2 \
--enable-dp-attention \
--tp-size 8 \
--moe-a2a-backend deepep \
--page-size 1 \
--host 0.0.0.0 \
--port 9001 \
--trust-remote-code \
--mem-fraction-static 0.75 \
--max-running-requests 128 \
--chunked-prefill-size 16384 \
--reasoning-parser qwen3 \
--tool-call-parser mimo \
--context-length 262144 \
--attention-backend fa3 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--enable-mtp
# Send request
curl -i http://localhost:9001/v1/chat/completions \
-H 'Content-Type:application/json' \
-d '{
"messages" : [{
"role": "user",
"content": "Nice to meet you MiMo"
}],
"model": "mimo-v2-flash",
"max_tokens": 4096,
"temperature": 0.8,
"top_p": 0.95,
"stream": true,
"chat_template_kwargs": {
"enable_thinking": true
}
}'
```
### Notifications
#### 1. System prompt
> [!IMPORTANT]
> The following system prompts are **HIGHLY** recommended, please choose from English and Chinese version.
English
```plaintext
You are MiMo, an AI assistant developed by Xiaomi.
Today's date: {date} {week}. Your knowledge cutoff date is December 2024.
```
Chinese
```plaintext
你是MiMo(中文名称也是MiMo),是小米公司研发的AI智能助手。
今天的日期:{date} {week},你的知识截止日期是2024年12月。
```
#### 2. Sampling parameters
> [!IMPORTANT]
> Recommended sampling parameters:
>
> `top_p=0.95`
>
> `temperature=0.8` for math, writing, web-dev
>
> `temperature=0.3` for agentic taks (e.g., vibe-coding, tool-use)
#### 3. Tool-use practice
> [!IMPORTANT]
> In the thinking mode with multi-turn tool calls, the model returns a `reasoning_content` field alongside `tool_calls`. To continue the conversation, the user must persist all history `reasoning_content` in the `messages` array of each subsequent request.
-----
## 7. Citation
If you find our work helpful, please cite our technical report:
```bibtex
@misc{mimo2025flash,
title={MiMo-V2-Flash Technical Report},
author={LLM-Core Xiaomi},
year={2025},
url={https://github.com/XiaomiMiMo/MiMo-V2-Flash/paper.pdf}
}
```
## 8. Contact
Please contact us at [mimo@xiaomi.com](mailto:mimo@xiaomi.com), join our WeChat group below or open an issue if you have any questions.
<p align="center">
<img src="https://github.com/XiaomiMiMo/MiMo-V2-Flash/raw/main/figures/wechat_group/wechat1.jpg?raw=true" width="20%" />
<img src="https://github.com/XiaomiMiMo/MiMo-V2-Flash/raw/main/figures/wechat_group/wechat2.jpg?raw=true" width="20%" />
<img src="https://github.com/XiaomiMiMo/MiMo-V2-Flash/raw/main/figures/wechat_group/wechat3.jpg?raw=true" width="20%" />
<img src="https://github.com/XiaomiMiMo/MiMo-V2-Flash/raw/main/figures/wechat_group/wechat4.jpg?raw=true" width="20%" />
</p> |