Guide to Using MTP in SFT/RL Training and Inference

Author: https://github.com/meituan-search

Last updated: 01/30/2026

1. Scope of Support

Currently, RL training can be performed on mimo-7B-RL, Qwen-next, and Deepseek series models based on the MTP architecture. The support rules for training and inference engines are as follows:

Training Engine: Only supports the mbridge + megatron combination; other training engines are not compatible at this time;
Inference Engine: Compatible with all engines, but the model must be in the corresponding engine's compatibility list;
Dependency Versions:
- mbridge: Use the specified branch: https://github.com/ArronHZG/mbridge/tree/feature/verl_mtp (will be merged into the main branch in the future);
- megatron: Use the latest dev version (commit: 23e092f41ec8bc659020e401ddac9576c1cfed7e), which supports MTP + CP training methods.
- sglang: Use the specified branch: https://github.com/ArronHZG/sglang/tree/fix_mtp_update_weights_from_tensor, PR , which fix the MTP update weights from tensor OOM issue.

2. MTP Training Configuration (Core Parameters)

The MTP training process can be flexibly controlled through the following configurations. All configurations are based on the actor_rollout_ref.model.mtp prefix:

Configuration Scenario	Core Parameters	Description
Load MTP Parameters Only	`enable=True`	VRAM usage will increase, but the exported parameters include the MTP module and can be directly used for online deployment
Full-Parameter MTP Training	`enable=True` `enable_train=True` `mtp_loss_scaling_factor=0.1`	MTP Loss will apply to all model parameters
MTP Parameter-Only Training	`enable=True` `enable_train=True` `detach_encoder=True`	Freeze the Encoder layer, update only MTP module parameters, MTP Loss applies only to MTP parameters
MTP Accelerated Rollout	1. vLLM configuration: `enable=True` `enable_rollout=True` `method="mtp"` `num_speculative_tokens=1` 2. SGLang configuration: `enable=True` `enable_rollout=True` `speculative_algorithm="EAGLE"` `speculative_num_steps=2` `speculative_eagle_topk=2` `speculative_num_draft_tokens=4`	Achieve inference acceleration during the Rollout phase based on MTP

3. Experimental Results

The experiment was conducted as follows:

model = mimo-7B-math
max_response_length = 8k

Experiment chart:

The wandb link for the graph: wandb

Scenarios with No Significant Effect

The following configurations will not have a noticeable impact on training results:

The base model does not carry MTP parameters;
The base model carries MTP parameters, but the MTP module is not trained;
The base model carries MTP parameters and trains MTP, with mtp_loss_scaling_factor=0;
The base model carries MTP parameters, trains MTP and detaches the encoder, with mtp_loss_scaling_factor=0.1.

Scenarios with Significant Effect

Only the following configuration will have a noticeable impact on training results:

The base model carries MTP parameters, MTP Loss applies to all model parameters, and mtp_loss_scaling_factor=0.1.

Recommended Training Method

It is recommended to adopt the detach_encoder=True approach for MTP training.

4. Performance Notes for MTP in Rollout Inference

The effectiveness of MTP-accelerated Rollout is significantly affected by model size and inference hardware. Key reference information is as follows:

Hardware Tensor Core Performance

Hardware Model	FP16 Performance (TFLOPS)
H20	148
H800	1,671
H200	1,979

Measured Performance and Recommendations

Taking the mimo-7B model deployed separately on H20 hardware using SGLang as an example: After enabling MTP speculative decoding, the Rollout throughput decreases by approximately 50%.

Current priority recommendation: Do not enable MTP acceleration during the inference phase for now;
Future planning: Further optimization of the speculative logic in the Rollout phase will be conducted to improve throughput performance.

5. SFT training

The SFT training with MTP is supported, using the same MTP training configuration as RL training.

An example configuration for running SFT can be found in examples/sft/gsm8k/run_mimo_megatron_mtp.sh

SFT result

The experiment was conducted using following data:

model = mimo-7B-math
dataset = gsm8k

The result: wandb link

The presence of mtp layer has limited effect on main loss. However, when MTP layer is detached, the mtp_loss converges to a higher value.