Guide to Using MTP in SFT/RL Training and Inference
Author: https://github.com/meituan-search
Last updated: 01/30/2026
1. Scope of Support
Currently, RL training can be performed on mimo-7B-RL, Qwen-next, and Deepseek series models based on the MTP architecture. The support rules for training and inference engines are as follows:
Training Engine: Only supports the
mbridge + megatroncombination; other training engines are not compatible at this time;Inference Engine: Compatible with all engines, but the model must be in the corresponding engine's compatibility list;
Dependency Versions:
mbridge: Use the specified branch: https://github.com/ArronHZG/mbridge/tree/feature/verl_mtp (will be merged into the main branch in the future);
megatron: Use the latest dev version (commit: 23e092f41ec8bc659020e401ddac9576c1cfed7e), which supports MTP + CP training methods.
sglang: Use the specified branch: https://github.com/ArronHZG/sglang/tree/fix_mtp_update_weights_from_tensor, PR , which fix the MTP update weights from tensor OOM issue.
2. MTP Training Configuration (Core Parameters)
The MTP training process can be flexibly controlled through the following configurations. All configurations are based on the actor_rollout_ref.model.mtp prefix:
| Configuration Scenario | Core Parameters | Description |
|---|---|---|
| Load MTP Parameters Only | enable=True |
VRAM usage will increase, but the exported parameters include the MTP module and can be directly used for online deployment |
| Full-Parameter MTP Training | enable=Trueenable_train=Truemtp_loss_scaling_factor=0.1 |
MTP Loss will apply to all model parameters |
| MTP Parameter-Only Training | enable=Trueenable_train=Truedetach_encoder=True |
Freeze the Encoder layer, update only MTP module parameters, MTP Loss applies only to MTP parameters |
| MTP Accelerated Rollout | 1. vLLM configuration:enable=Trueenable_rollout=Truemethod="mtp"num_speculative_tokens=12. SGLang configuration: enable=Trueenable_rollout=Truespeculative_algorithm="EAGLE"speculative_num_steps=2speculative_eagle_topk=2speculative_num_draft_tokens=4 |
Achieve inference acceleration during the Rollout phase based on MTP |
3. Experimental Results
The experiment was conducted as follows:
- model = mimo-7B-math
- max_response_length = 8k
Experiment chart:
The wandb link for the graph: wandb
Scenarios with No Significant Effect
The following configurations will not have a noticeable impact on training results:
The base model does not carry MTP parameters;
The base model carries MTP parameters, but the MTP module is not trained;
The base model carries MTP parameters and trains MTP, with
mtp_loss_scaling_factor=0;The base model carries MTP parameters, trains MTP and detaches the encoder, with
mtp_loss_scaling_factor=0.1.
Scenarios with Significant Effect
Only the following configuration will have a noticeable impact on training results:
- The base model carries MTP parameters, MTP Loss applies to all model parameters, and
mtp_loss_scaling_factor=0.1.
Recommended Training Method
It is recommended to adopt the detach_encoder=True approach for MTP training.
4. Performance Notes for MTP in Rollout Inference
The effectiveness of MTP-accelerated Rollout is significantly affected by model size and inference hardware. Key reference information is as follows:
Hardware Tensor Core Performance
| Hardware Model | FP16 Performance (TFLOPS) |
|---|---|
| H20 | 148 |
| H800 | 1,671 |
| H200 | 1,979 |
Measured Performance and Recommendations
Taking the mimo-7B model deployed separately on H20 hardware using SGLang as an example: After enabling MTP speculative decoding, the Rollout throughput decreases by approximately 50%.
Current priority recommendation: Do not enable MTP acceleration during the inference phase for now;
Future planning: Further optimization of the speculative logic in the Rollout phase will be conducted to improve throughput performance.
5. SFT training
The SFT training with MTP is supported, using the same MTP training configuration as RL training.
An example configuration for running SFT can be found in examples/sft/gsm8k/run_mimo_megatron_mtp.sh
SFT result
The experiment was conducted using following data:
- model = mimo-7B-math
- dataset = gsm8k
The result: wandb link
The presence of mtp layer has limited effect on main loss. However, when MTP layer is detached, the mtp_loss converges to a higher value.
