| # Guide to Using MTP in SFT/RL Training and Inference |
|
|
| **Author**: `https://github.com/meituan-search` |
|
|
| Last updated: 02/15/2026 |
|
|
| # 1. Scope of Support |
|
|
| Currently, RL training can be performed on mimo-7B-RL, Qwen-next, and Deepseek series models based on the MTP architecture. The support rules for training and inference engines are as follows: |
|
|
| - **Training Engine**: Only supports the `mbridge/Megatron-Bridge + megatron` combination; other training engines are not compatible at this time; |
|
|
| - **Inference Engine**: Compatible with all engines, but the model must be in the corresponding engine's compatibility list; |
|
|
| - **Dependency Versions**: |
|
|
| - mbridge: Apply the patches and review suggestions from PR: [#62](https://github.com/ISEEKYAN/mbridge/pull/62) (Already merged into the main branch); |
|
|
| - Megatron-Bridge: Apply the patches and review suggestions from PR if you want to try out mimo-7B-RL: [#2387](https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/2387) (will be merged into the main branch in the future); |
|
|
| - megatron: Use the latest dev version (commit: [23e092f41ec8bc659020e401ddac9576c1cfed7e](https://github.com/NVIDIA/Megatron-LM/tree/23e092f41ec8bc659020e401ddac9576c1cfed7e)), which supports MTP + CP training methods. |
| |
| - sglang: Use the specified branch: [https://github.com/ArronHZG/sglang/tree/fix_mtp_update_weights_from_tensor](https://github.com/ArronHZG/sglang/tree/fix_mtp_update_weights_from_tensor), [PR](https://github.com/sgl-project/sglang/pull/17870) , which fix the MTP update weights from tensor OOM issue. |
| |
| # 2. MTP Training Configuration (Core Parameters) |
|
|
| The MTP training process can be flexibly controlled through the following configurations. All configurations are based on the `actor_rollout_ref.model.mtp` prefix: |
|
|
| | Configuration Scenario | Core Parameters | Description | |
| |------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------| |
| | Load MTP Parameters Only | `enable=True` | VRAM usage will increase, but the exported parameters include the MTP module and can be directly used for online deployment | |
| | Full-Parameter MTP Training | `enable=True`<br>`enable_train=True`<br>`mtp_loss_scaling_factor=0.1` | MTP Loss will apply to all model parameters | |
| | MTP Parameter-Only Training | `enable=True`<br>`enable_train=True`<br>`detach_encoder=True` | Freeze the Encoder layer, update only MTP module parameters, MTP Loss applies only to MTP parameters | |
| | MTP Accelerated Rollout | 1. vLLM configuration:<br>`enable=True`<br>`enable_rollout=True`<br>`method="mtp"`<br>`num_speculative_tokens=1`<br>2. SGLang configuration:<br>`enable=True`<br>`enable_rollout=True`<br>`speculative_algorithm="EAGLE"`<br>`speculative_num_steps=2`<br>`speculative_eagle_topk=2`<br>`speculative_num_draft_tokens=4` | Achieve inference acceleration during the Rollout phase based on MTP | |
|
|
| # 3. Experimental Results |
|
|
| The experiment was conducted as follows: |
|
|
| * model = mimo-7B-math |
| * max_response_length = 8k |
|
|
| Experiment chart: |
|
|
|  |
| |
| The wandb link for the graph: [wandb](https://wandb.ai/hou-zg-meituan/mimo-7b-sft-mtp?nw=nwuserhouzg) |
| |
| **Scenarios with No Significant Effect** |
| |
| The following configurations will not have a noticeable impact on training results: |
| |
| 1. The base model does not carry MTP parameters; |
| |
| 2. The base model carries MTP parameters, but the MTP module is not trained; |
| |
| 3. The base model carries MTP parameters and trains MTP, with `mtp_loss_scaling_factor=0`; |
|
|
| 4. The base model carries MTP parameters, trains MTP and detaches the encoder, with `mtp_loss_scaling_factor=0.1`. |
|
|
| **Scenarios with Significant Effect** |
|
|
| Only the following configuration will have a noticeable impact on training results: |
|
|
| - The base model carries MTP parameters, MTP Loss applies to all model parameters, and `mtp_loss_scaling_factor=0.1`. |
|
|
| **Recommended Training Method** |
|
|
| It is recommended to adopt the `detach_encoder=True` approach for MTP training. |
|
|
| # 4. Performance Notes for MTP in Rollout Inference |
|
|
| Enabling MTP improves the rollout acceptance rate by around 14%. However, on H20 GPUs, overall throughput does not increase and even decreases slightly. |
|
|
|  |
|
|
| The effectiveness of MTP-accelerated Rollout is significantly affected by **model size** and **inference hardware**. Key reference information is as follows: |
|
|
| **Hardware Tensor Core Performance** |
|
|
| | Hardware Model | FP16 Performance (TFLOPS) | |
| |----------------|---------------------------| |
| | H20 | 148 | |
| | H800 | 1,671 | |
| | H200 | 1,979 | |
|
|
| **Measured Performance and Recommendations** |
|
|
| Taking the mimo-7B model deployed separately on H20 hardware using SGLang as an example: After enabling MTP speculative decoding, the Rollout throughput decreases by approximately 50%. |
|
|
| - Current priority recommendation: Do not enable MTP acceleration during the inference phase for now; |
|
|
| - Future planning: Further optimization of the speculative logic in the Rollout phase will be conducted to improve throughput performance. |
|
|
| # 5. SFT training |
|
|
| The SFT training with MTP is supported, using the same MTP training configuration as RL training. |
|
|
| An example configuration for running SFT can be found in `examples/sft/gsm8k/run_mimo_megatron_mtp.sh` |
|
|
| **SFT result** |
|
|
| The experiment was conducted using following data: |
| - model = mimo-7B-math |
| - dataset = gsm8k |
|
|
| The result: [wandb link](https://wandb.ai/hou-zg-meituan/mimo-7b-sft-mtp?nw=nwuserhouzg) |
|
|
| The presence of mtp layer has limited effect on main loss. However, when MTP layer is detached, the mtp_loss converges to a higher value. |
| |
| |